Top Banner
29 April 2003 – Ian Stokes- Rees HEPSYSMAN Conference Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics
26

HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

Mar 28, 2015

Download

Documents

Joseph Bruce
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Grid Monitoring using Nagios and RRDtool

Ian Stokes-ReesOxford Particle Physics

Page 2: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

In a perfect world …

• Individual node statuso Is it up?o What is its load?o What is the memory and swap usage?o NFS and network load?o Are the partitions full?o Are applications and services running properly?

• Amalgamated node statuso Same info, but across groups of nodes

Page 3: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

In a perfect world …

• Historical informationo Trends

• Notification of service stateso e.g. Storage down to 100 megs free = Warningo Storage down to 10 megs free = Criticalo sshd no longer running = Failureo notify by email, pager, mobile

• Easy access to monitoring informationo web, email, digest, mobile

Page 4: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

In a perfect world …

• Avoidance of “Too many red flashing lights”o “Just the facts, ma’am” – only want root cause

failures to be reported, not cascade of every downstram failure.

o also includes avoiding unnecessary checkso e.g. HTTP responding, therefore no need to pingo e.g. power outage, doesn’t ping, so don’t bother

trying anything else

• Other wish list requirements?

Page 5: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Aspects of Current Grid Monitoring

1. LDAP (Lightweight Directory Access Protocol) is the current foundation for MDS. Designed frequent read, infrequent write.

2. MDS (Monitoring and Discovery Service) uses LDAP for maintaining static and dynamic system details.

3. R-GMA (Relational Grid Monitoring Architecture) meant to address shortcomings of LDAP based MDS system by using hierarchy of relational databases. Now being deployed.

4. GRIS (Grid Resource Information Service) stores details about the state of “the grid” (at least from the local node)

5. GIIS (Grid Index Information Service) ties together several GRISes

6. HBM (Heart Beat Monitor) monitor Globus services – seems to have died a quiet death

Page 6: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Existing Grid Monitoring Lacks…

• Historical information for trends• Simple interface for accessing

information• Automated response to changes in

system state

Here is where RRDtool and Nagios can contribute

Page 7: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

RRDtool

• Round Robin Database for time series data storage

• Command line based• From the author of MRTG• Made to be faster and more flexible• Includes CGI and Graphing tools, plus APIs• Solves the Historical Trends and Simple Interface

problems

www.rrdtool.com

Page 8: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Define Data Sources (Inputs)

• DS:speed:COUNTER:600:U:U• DS:fuel:GAUGE:600:U:U

o DS = Data Sourceo speed, fuel = “variable” nameso COUNTER, GAUGE = variable typeo 600 = heart beat – UNKNOWN returned for

interval if nothing received after this amount of time

o U:U = limits on minimum and maximum variable values (U means unknown and any value is permitted)

Page 9: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Define Archives (Outputs)

• RRA:AVERAGE:0.5:1:24• RRA:AVERAGE:0.5:6:10

o RRA = Round Robin Archiveo AVERAGE = consolidation functiono 0.5 = up to 50% of consolidated points may be UNKNOWN

o 1:24 = this RRA keeps each sample (average over one 5 minute primary sample), 24 times (which is 2 hours worth)

o 6:10 = one RRA keeps an average over every six 5 minute primary samples (30 minutes), 10 times (which is 5 hours worth)

• Clear as mud!o all depends on original step size which defaults to 5

minutes

Page 10: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

RRDtool Database Format

Recent data stored once every 5 minutes for the past 2 hours (1:24)

Medium length data averaged to one entry per half hour for the last 5 hours (6:10)

Old data averaged to one entry per day for the last 365 days (288:365)

--step 300

(5 minute input step

size)

RRA 1:24 RRA 6:10 RRA 288:365

RRD

File

Page 11: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

RRDtool Example

• Monitoring a car – fuel in the tank plus odometer 12:05 12345 KM 7.0 L 12:10 12357 KM 5.8 L 12:15 12363 KM 5.2 L STOP 12:20 12363 KM 5.2 L 12:25 12363 KM 5.2 L RESTART 12:30 12373 KM 4.2 L 12:35 12383 KM 3.2 L 12:40 12393 KM 2.2 L 12:45 12399 KM 1.6 L 12:50 12405 KM 9.0 L REFUEL 12:55 12411 KM 8.4 L 13:00 12415 KM 8.0 L 13:05 12420 KM 7.5 L 13:10 12422 KM 7.3 L 13:15 12423 KM 7.2 L

Page 12: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

RRDtool Example

• Create an RRD to store distance and fuelrrdtool create car.rrd

--start 920804400 \

DS:speed:COUNTER:600:U:U \

DS:fuel:GAUGE:600:U:U \

RRA:AVERAGE:0.5:1:24 \

RRA:AVERAGE:0.5:6:10• --start Defines earliest time RRD

accepts

Page 13: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

RRDtool Example

• Input data:

rrdtool update car.rrd 920804700:12345:7.0 920805000:12357:5.8

rrdtool update car.rrd 920805300:12363:5.2 920805600:12363:5.2

rrdtool update car.rrd 920805900:12363:5.2 920806200:12373:4.2

rrdtool update car.rrd 920806500:12383:3.2 920806800:12393:2.2

rrdtool update car.rrd 920807100:12399:1.6 920807400:12405:9.0

rrdtool update car.rrd 920807700:12411:8.4 920808000:12415:8.0

rrdtool update car.rrd 920808300:12420:7.5 920808600:12422:7.3

rrdtool update car.rrd 920808900:12423:7.2

Page 14: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

RRDtool Graphing

• Now with data in the RRD, RRDtool can generate graphs:

rrdtool graph speed.gif \ --start 920804400 --end 920808000 \ --vertical-label m/s \ DEF:myspeed=car.rrd:speed:AVERAGE\ DEF:myfuel=car.rrd:fuel:AVERAGE \ CDEF:realspeed=myspeed,1000,* \ LINE2:realspeed#FF0000 \ LINE2:myfuel#00FF00

Page 15: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

RRDtool Graphing Output

• Much more interesting graphs possible• Multiple RRDs may be used as sources for variables• Auto-interpolation of points• Functions and calculations can be applied to

variables• Legends, labels, and text can be inserted

Page 16: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

RRDtool Graphing Output

Page 17: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Nagios

• Instantaneous service level monitoring• Web based interface• Somewhat complicated set of configuration

files to manually edit• Automated notification of change in service

level (email, phone, etc.)• Defines WARNING, CRITICAL, FAILED levels

www.nagios.org

Page 18: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

What Do We Want to Monitor?

Static Dynamic Services

CPU (SPECint) Load Live

RAM (swap) Mem/swap usage Accessible

HD capacity Storage available Globus

Network b/w Network utilisation SSH

OS Users Etc.

Applications Processes

Location, Admin Queues (PBS)

Page 19: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Nagios Host Definitions

• Define details about each node and their hierarchy in the network:

define host{ host_name tbce01 alias Testbed CE address 163.1.243.105 parents edg-testbed notifications_enabled 1 process_perf_data 1 check_command check-host-alive notification_interval 120 notification_period 24x7 notification_options d,u,r}

Page 20: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Nagios Service Definitions

• Define details about each service:define service{ name ping check_command check_ping!100.0,20%!500.0,60% contact_groups linux-admins check_period 24x7 max_check_attempts 3 normal_check_interval 5 notification_interval 120 notification_period 24x7 notification_options c,r}

Page 21: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Nagios Service and Host Polling

• Pull model, where Nagios server executes command to fetch host or service status

• Requires remote hosts and services to cooperateo NRPE installed on clients allows server to execute

“plugins” to poll for informationo Alternatively use existing client reporting mechanisms

(ping, wget, http)

• Server responsible for configuration of polling intervals and details to be polled

Page 22: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Nagios Service and Host Reporting

• Push model, where services and hosts decide when to report status to Nagios servero push data when available/relevanto generally full access to node-local datao requires configuring every node independentlyo authentication of nodes at servero nodes need to know who to send data to

Page 23: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Host and Service Status

Page 24: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Host and Service Status

Page 25: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Host and Service Status

Page 26: HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

29 April, 2003 – Ian Stokes-ReesHEPSYSMAN Conference

Finally, some other monitors

• NWS (Network Weather Service) attempts to predict network utilisation from historical information

• Ganglia cluster monitoring system, provides aggregate graphs of cluster performance – Globus/EDG tie-ins underway

• Map Center EDG project to monitor Grid status and services

• ActiveMap, GridPortal, and InfoPortal* appear to be inactive projects