Top Banner
Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams <[email protected]>
25

Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Dec 17, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Extensible Monitoring with Nagios and Messaging MiddlewareLISA 2012

Jonathan Reams <[email protected]>

Page 2: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Symon Says Nagios Project

• Replace 12-year-old home grown monitoring system– Very customized– Very engineered– Very unsupported

• ~17,000 checks • Mandate to move to Nagios

Page 3: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

False Start

1. Installed Nagios

2. Ported checks from old system to new

3. Went out for coffee

4. Problems

a. High check latency

b. High load

Page 4: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Stock Nagios

Page 5: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Nagios Problems

• Trapped on one host:– Check results– Status data– Configuration data

• Nagios isn’t a great executor– Forks 2 processes per check– Everything is basically synchronous – async achieved

with multiple processes• Data format is simple but non-standard

Page 6: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Nagios Problems

• Implementation is all in C – hard to customize• Can be I/O bound by reading/writing check result files• Cannot query data from status file/configuration without

reading/parsing all of it• Input via FIFO gives no feedback and has a limited

buffer size

Page 7: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Nagios Problems

Communication is hard!

Page 8: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

My Solution

NagMQ

A ZeroMQ-based API for Nagios

Page 9: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Background on ZeroMQ

• Broker-less messaging kernel in a single library• Emulates Berkeley socket API• Supports IPC/TCP/Multicast transports• Fanout, pub/sub, pipe-line, and request/reply messaging

patterns• All I/O is asynchronous after connections are established

with dedicated I/O threads• Bindings available for large number of operating systems

and languages• Agnostic of data being sent – no defined data format

Page 10: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

NagMQ

Page 11: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Event Publisher & Commands

Host check result from publisherhost_check_processed localhost{ "host_name": "localhost", "check_type": 0, "check_options": 0, "scheduled_check": 1, "reschedule_check": 1, "current_attempt": 1, "max_attempts": 1, "state": 0, "last_state": 0, "last_hard_state": 0, "last_check": 1354996955, "last_state_change": 1337098090, "latency": 1.63600, "timeout": 60, "type": "host_check_processed", "start_time": { "tv_sec": 1354996955, "tv_usec": 636453 }, "end_time": { "tv_sec": 1354996964, "tv_usec": 161965 }, "early_timeout": 0, "execution_time": 0.07324, "return_code": 0, "output": "Host up", "long_output": null, "perf_data": null, "timestamp": { "tv_sec": 1354996964, "tv_usec": 161966 } }

Command to add an acknowledgement to service problem{'comment_data': 'Stop alerting me!!', 'notify_contacts': False, 'author_name': ’jreams', 'persistent_comment': False, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'time_stamp': {'tv_sec': 1355074576}, 'type': 'acknowledgement'}

Page 12: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

State Data

Request{'keys': ['host_name', 'services', 'hosts', 'service_description', 'current_state', 'members', 'type', 'name', 'problem_has_been_acknowledged', 'plugin_output', 'checks_enabled', 'notifications_enabled', 'event_handler_enabled'], 'include_services': True, 'host_name': 'localhost'}

Response[{'checks_enabled': True, 'notifications_enabled': True, 'current_state': 0, 'plugin_output': 'Host up', 'problem_has_been_acknowledged': 0, 'event_handler_enabled': True, 'host_name': 'localhost', 'services': ['rotate-unix'], 'type': 'host'}, {'checks_enabled': False, 'notifications_enabled': True, 'current_state': 1, 'plugin_output': 'You are now on call', 'problem_has_been_acknowledged': False, 'event_handler_enabled': True, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'type': 'service'}]

Page 13: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Some examples

• Distributed check execution (mqexec)• Custom user interfaces (nag.py, etc)• High availability (haagent.py, halib.py)

Page 14: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

mqexec

Page 15: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

mqexec

• Asynchronous command executor• Subscribes to host_check_initiate,

service_check_initiate, and event_handler_start

messages, and executes command line specified• Can filter which commands to execute based on any

attribute in message• Receives messages as

– Fair-queued worker pool (pull from MQ broker)– Individual worker (subscribe directly to NagMQ)

• Sends results back to command interface of NagMQ

Page 16: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Performance: Stock Nagios

1 2 3 4 5 6 7 8 9 10111213141516171819200

2

4

6

8

10

12

14

16

18

Max HostAvg HostMax SvcAvg Svc

Time in Minutes

Lat

ency

in

Sec

on

ds

Page 17: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Performance: NagMQ/mqexec

1 2 3 4 5 6 7 8 9 10111213141516171819200

2

4

6

8

10

12

14

16

18

Max HostAvg HostMax SvcAvg Svc

Time in Minutes

Lat

ency

in

Sec

on

ds

Page 18: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

User Interfaces

• Command-line$ nag.py -c 'Stop alerting me!!' add ack localhost[localhost]: No problem found[uptime@localhost]: Acknowledgement added

• Python/Javascript/Twitter Bootstrap web interface using NagMQ (see demo)

• Interface to Twitter

Page 19: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

High Availability – Stock Nagios

Page 20: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

High Availability - NagMQ

Page 21: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

High Availability - NagMQ

• Use regular program_status to provide heartbeat• Retrieve active state from state interface to bring passive

node into sync with active node on startup• Subscribe to and send check result messages,

acknowledgements, downtimes, and adaptive changes to command interface

• Passive host’s mqexec(s) run checks for whatever host is active

• Use VIFs owned by the message broker to direct traffic to active host

Page 22: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Why not use one of these?

• LiveStatus – live state query module with check execution workers

• Mod_gearman – distributed check execution based on gearman job queue

• Merlin – database/distributed backend for Nagios• Ndoutils – database backend for Nagios• NSCA – allows check/command submission over

network• NRPE – remote check executor

Page 23: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

API – not a product

• NagMQ is just an interface into Nagios, not a product• Better communication with clients comes from larger

ZeroMQ project – leaving NagMQ to focus on Nagios• Implement ad-hoc tools for Nagios without having to

write any compiled code• Doing expensive data processing of monitoring data

doesn’t have to create latency in monitoring system• Re-use one interface for many tools

Page 24: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

Future Work

• Pluggable authentication/encryption for NagMQ• Pluggable parser/emitter for custom data formats (XML,

Yaml, etc)• NDOutils database replacement• More user interfaces (Jabber, SMS, email gateway,

REST API)• Nagios 4

Page 25: Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams.

NagMQ

https://github.com/jbreams/nagmq

Jonathan Reams

[email protected]