Top Banner
Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27
24

Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Jun 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Messaging Alarmsfrom Metis to Kapacitor

Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27

Page 2: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

ContextReplacement of the old Messaging Based Monitoring (MBM) by MONIT.

Log files are (already) handled by Logstash.

Metrics are (already) generated by Collectd.

How to generate alarms?

Use case 1: alarms based on aggregated metrics like #messages in a cluster.

Use case 2: alarms not based on metrics like STOMP “loopback” checks.

2

Page 3: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

RequirementsBe able to assess how well the messaging service is working.

Have simple dashboards showing the current situation (what is working or not).

Have historical dashboards showing how the situation evolved in the past.

Be fully integrated with MONIT: Flume, HDFS, InfluxDB/ElasticSearch, Grafana/Kibana, Service Now...

3

Page 4: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Terminology (Metis)Metric: result of measuring something = number.

Check: result of checking something = enum UNKNOWN | OK | WARNING | CRITICAL.

Status: at first approximation, most recent check value.

However, status is more complex and takes care of flapping, masking, validity…

Metis notifications (e.g. via Service Now) are based on status events.

4

Page 5: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Examples of Check versus StatusA check is oscillating between OK and WARNING: its status could be WARNING with a “flapping flag”, along with more information like “flapping since”…

A service enters scheduled downtime: its checks start to fail but its status could indicate something like “in maintenance” (to be linked to Roger states).

A check does not get new value: after some time (aka validity expiration), its status could become UNKNOWN.

5

Page 6: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Continuous Checking ⇒ Continuous ResultsEach time something is checked, its result should be recorded.

This allows to easily find out what was working or not at some point in the past.

Recording only the changes (like Collectd’s threshold plugin with persist=false) complicates the situation: • how far in the past should we go to find the checks that did not change? • how to deal with checks that are no longer executed? • how to know if the service has not been checked? • how to workaround InfluxDB limitations (e.g. lack of “having” clause)?

6

Page 7: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Design DecisionsGenerate (continuous) “check metrics” and send them as MONIT metrics.

Generate “service alarms” from “check metrics” and send them as MONIT alarms.

To generate alarms, use the same logic regardless of the type of checks: • based on metrics like #messages in a cluster • based on functional tests like STOMP “loopback” (send and receive message)

Use standard tools and minimize the need for coding.

7

Page 8: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Technology SurveyCollectd cannot be used for aggregation.

Grafana requires a graph for each alarm (alerts cannot use templates).

Spark requires a lot of coding.

Kapacitor has been created “to process alerts with dynamic thresholds, match metrics for patterns, compute statistical anomalies”. It seems to have all the features we need.

8

Page 9: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Kapacitor Features

9

● part of the “TICK stack” so fully integrated with InfluxDB● two modes of operation: stream and batch● high-level Domain Specific Language named TICKscript to extend what

InfluxDB can do● built in support for script templates, fine grain thresholds, hysteresis, flapping,

masking, state handling, statistical analysis…● flexible connection to the outside world: Email, HTTP POST, Kafka, Slack…

Page 10: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Kapacitor Overview

10

Page 11: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

https://github.com/influxdata/kapacitor/blob/master/examples/telegraf/cpu/cpu_alert_batch.tick

// TELEGRAF CONFIGURATION// [[inputs.cpu]]// percpu = true// totalcpu = true// fielddrop = ["time_*"]

// Parameters var info = 70var warn = 80var crit = 90var infoSig = 2.5var warnSig = 3var critSig = 3.5var period = 10svar every = 10s

11

Page 12: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

https://github.com/influxdata/kapacitor/blob/master/examples/telegraf/cpu/cpu_alert_batch.tick

var data = batch |query("SELECT 100 - mean(idle) AS stat FROM dbrp.cpu WHERE cpu = 'cpu-total'") .period(period) .every(every) .groupBy('host')

var alert = data |eval(lambda: sigma("stat")) .as('sigma') .keep() |alert() .id('{{ index .Tags "host"}}/cpu_used') .message('{{ .ID }}:{{ index .Fields "stat" }}') .info(lambda: "stat" > info OR "sigma" > infoSig) .warn(lambda: "stat" > warn OR "sigma" > warnSig) .crit(lambda: "stat" > crit OR "sigma" > critSig)

12

Page 13: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

https://github.com/influxdata/kapacitor/blob/master/examples/telegraf/cpu/cpu_alert_stream.tick

var data = stream |from() .database('telegraf') .retentionPolicy('autogen') .measurement('cpu') .groupBy('host') .where(lambda: "cpu" == 'cpu-total') |eval(lambda: 100.0 - "usage_idle") .as('used') |window() .period(period) .every(every) |mean('used') .as('stat')

var alert = data...

13

Page 14: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

https://gitlab.cern.ch/ai/it-puppet-hostgroup-mig/blob/master/code/files/.../activemq-cluster.tick

var data = batch |query('select mean_value as value ' + 'from monit_production_mig.one_week.activemq_broker ' + 'where toplevel_hostgroup = \'mig\' ' + 'order by time desc limit ' + string(window)) .every(every) .period(period) .groupBy('host', 'value_instance', 'cluster') |groupBy('host', 'value_instance', 'cluster') |mean('value').as('value') |groupBy('value_instance', 'cluster') |sum('value').as('value') |httpOut('last')

14

Page 15: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

https://gitlab.cern.ch/ai/it-puppet-hostgroup-mig/blob/master/code/files/.../activemq-cluster.tick

data|where(lambda: "value_instance" == 'messages_stored') |sideload() .source(source) .order('{{.cluster}}.yaml') .field('cluster_messages_stored_warn', 100000.0) .field('cluster_messages_stored_crit', 1000000.0) |alert() .id(id) .message('stored on {{ index .Tags "where" }}: {{ index .Fields "value" }}') .details(details) .info(lambda: TRUE) .warn(lambda: "value" > "cluster_messages_stored_warn") .crit(lambda: "value" > "cluster_messages_stored_crit") .post() .endpoint('kapacitor2monit') .captureResponse()

15

Page 16: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Kapacitor’s httpOut() Node# curl -s http://localhost:9092/kapacitor/v1/tasks/activemq-cluster/last | jsonpp... { "name": "activemq_broker", "tags": { "cluster": "dashb", "value_instance": "messages_stored" }, "columns": [ "time", "value" ], "values": [ [ "2018-09-26T11:11:12.364816682+02:00", 74 ] ] },...

16

Page 17: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

https://gitlab.cern.ch/ai/it-puppet-hostgroup-mig/blob/master/code/files/.../alarm.tick

batch |query('select message, status ' + 'from monit_production_mig.raw.check ' + 'order by time desc limit 1') .cluster('mig-metrics') .every(every) .period(period) .groupBy('what', 'where') |alert() .id(id) .message('{{ index .Fields "message" }}') .details(details) .info(lambda: TRUE) .warn(lambda: "status" == 'WARNING') .crit(lambda: "status" == 'CRITICAL') .stateChangesOnly(1d) .topic('alert')

17

Page 18: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Kapacitor’s alert() Node# kapacitor show-topic alertID: alertLevel: INFOCollected: 1676Handlers: [log-alert, post-alert]Events:Event Level Message Datebrksvc:stomp:queue:dashb@mb128:61113 INFO STOMP loopback took 17ms 26 Sep 18 11:32brksvc:stomps:topic:dashb@mb128:61123 INFO STOMP loopback took 21ms 26 Sep 18 11:32broker:connections:dashb@mb118 INFO connections on dashb@mb118: 224 26 Sep 18 11:37broker:messages_stored:dashb@mb134 INFO stored messages on dashb@mb134: 29 26 Sep 18 11:33broker:threads:dashb@mb117 INFO threads on dashb@mb117: 102 26 Sep 18 11:34cluster:dns:dashb-test INFO mb083 (a.b.c.d) in dashb-test-mb 26 Sep 18 11:35cluster:messages_stored:dashb INFO stored messages on dashb: 165 26 Sep 18 11:33df:bytes:mb117:root INFO disk bytes usage for root on mb117: 8.9 26 Sep 18 11:35host:load:mb117 INFO load on mb117.cern.ch: 0.1 26 Sep 18 11:35host:tcpconns_established:mb114 INFO established connections on mb114: 13003 26 Sep 18 11:36host:tcpconns_waiting:mb114 INFO waiting connections on mb114: 26 26 Sep 18 11:37...

18

Page 19: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Kapacitor for Messaging

19

Page 20: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Grafana Dashboard

20

Page 21: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Grafana Dashboard (zoomed)

21

Page 22: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Kibana View

22

Page 23: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

Numbers● 14 clusters● 37 brokers● 63 VMs● 33 check types (aka “what”)● 1486 different checks (“what” + “where”)● ~25 MONIT check metrics per second● ~2 POST to monit-metrics per second

Kapacitor has a small footprint:

PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

20 0 1398408 345144 19268 S 0.3 2.3 50:18.13 kapacitord

23

> show tag values with key = "what"what brksvc:http:jmx4perlwhat brksvc:https:consolewhat brksvc:stomp:queuewhat brksvc:stomp:topicwhat brksvc:stomps:queuewhat brksvc:stomps:topicwhat broker:connectionswhat broker:connections_ratewhat broker:consumerswhat broker:cpu_loadwhat broker:heapwhat broker:memorywhat broker:messages_pendingwhat broker:messages_receivedwhat broker:messages_sentwhat broker:messages_storedwhat broker:open_fdwhat broker:storewhat broker:tempwhat broker:threadswhat cluster:dnswhat cluster:messages_pendingwhat cluster:messages_receivedwhat cluster:messages_sentwhat cluster:messages_stored...

Page 24: Messaging Alarms - CERN IT-CM-MM 27-09-2018 - Mess… · Messaging Alarms from Metis to Kapacitor Lionel Cons - IT/CM/MM Section Meeting - 2018/09/27. ... • how to workaround InfluxDB

SummaryKapacitor has a wide range of useful features.

It can be used to replace Metis.

It is not specific to the messaging use cases so other IT services could use it.

Like InfluxDB, the Open Source version lacks clustering and scalability.

Like InfluxDB, there are many issues on GitHub (T=488, I=813, C=718, K=545).

24