Top Banner
How to monitor the How to monitor the $H!T out of Hadoop $H!T out of Hadoop Developing a Developing a comprehensive open comprehensive open approach to monitoring approach to monitoring hadoop clusters hadoop clusters
39

Hadoop Monitoring best Practices

Jan 22, 2015

Download

Technology

Edward Capriolo

Monitoring hadoop With Cacti and Nagios
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 1. How to monitor the$H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters

2. Relevant Hadoop Information

  • From 3 3000 Nodes
  • Hardware/Software failures common
  • Redundant Components DataNode, TaskTracker
  • Non-redundant Components NameNode, JobTracker, SecondaryNameNode
  • Fast Evolving Technology (Best Practices?)

3. Monitoring Software

  • Nagios
    • Red Yellow Green Alerts, Escalations
    • Defacto Standard Widely deployed
    • Text base configuration
    • Web Interface
    • Pluggable with shell scripts/external apps
      • Return 0 - OK

4. Cacti

  • Performance Graphing System
  • RRD/RRA Front End
  • Slick Web Interface
  • Template System for Graph Types
  • Pluggable
    • SNMP input
    • Shell script /external program

5. 6. hadoop-cacti-jtg

  • JMX Fetching Code w/ (kick off) scripts
  • Cacti templates For Hadoop
  • Premade Nagios Check Scripts
  • Helper/Batch/automation scripts
  • Apache License

7. Hadoop JMX 8. Sample Cluster P1

  • NameNode & SecNameNode
    • Hardware RAID
    • 8 GB RAM
    • 1x QUAD CORE
    • DerbyDB (hive) on SecNameNode
  • JobTracker
    • 8GB RAM
    • 1x QUAD CORE

9. A Sample Cluster p2

  • Slave (hadoopdata1-XXXX)
    • JBOD 8x 1TB SATA Disk
    • RAM 16GB
    • 2x Quad Core

10. Prerequisites

  • Nagios (install) DAG RPMs
  • Cacti (install) Several RPMS
  • Liberal network access to the cluster

11. Alerts & Escalations

  • X nodes * Y Services = < Sleep
  • Define a policy
    • Wake Me Ups (SMS)
    • Dont Wake Me Ups (EMAIL)
    • Review (Daily, Weekly, Monthly)

12. Wake Me Ups

  • NameNode
    • Disk Full (Big Big Headache)
    • RAID Array Issues (failed disk)
  • JobTracker
  • SecNameNode
    • Do not realize it is not working too late

13. Dont Wake Me Ups

  • Or Wake someone else up
  • DataNode
    • Warning Currently Failed Disk will down the Data Node (see Jira)
  • TaskTracker
  • Hardware
    • Bad Disk (Start RMA)
  • Slaves are expendable (up to a point)

14. Monitoring Battle Plan

  • Start With the Basics
    • Ping, Disk
  • Add Hadoop Specific Alarms
    • check_data_node
  • Add JMX Graphing
    • NameNodeOperations
  • Add JMX Based alarms
    • FilesTotal > 1,000,000 or LiveNodes < 50%

15. The Basics Nagios

  • Nagios (All Nodes)
    • Host up (Ping check)
    • Disk % Full
    • SWAP > 85 %
  • * Load based alarms are somewhat useless389% CPU load is not necessarily a bad thing in Hadoopville

16. The Basics Cacti

  • Cacti (All Nodes)
    • CPU (full CPU)
    • RAM/SWAP
    • Network
    • Disk Usage

17. Disk Utilization 18. RAID Tools

  • Hpacucli not a Street Fighter move
    • Alerts on RAID events (NameNode)
      • Disk failed
      • Rebuilding
    • JBOD (DataNode)
      • Failed Drive
      • Drive Errors
  • Dell, SUN, Vendor Specific Tools

19. Before you jump in

  • X Nodes * Y Checks * = Lots of work
  • About 3 Nodes into the process
    • Wait!!! I need some interns!!!
  • Solution S.I.C.C.T.Semi-Intelligent-Configuration-cloning-tools
    • (I made that up)
    • (for this presentation)

20. Nagios

  • Answers IS IT RUNNING?
  • Text based Configuration

21. Cacti

  • Answers HOW WELL IS IT RUNNING?
  • Web Based configuration
    • php-cli tools

22. Monitoring Battle Plan Thus Far

  • Start With the Basics
    • Ping, Disk !!!!!!Done!!!!!!
  • Add Hadoop Specific Alarms
    • check_data_node
  • Add JMX Graphing
    • NameNodeOperations
  • Add JMX Based alarms
    • FilesTotal > 1,000,000 or LiveNodes < 50%

23. Add Hadoop Specific Alarms

  • Hadoop Components with a Web Interface
    • NameNode 50070
    • JobTracker 50030
    • TaskTracker 50060
    • DataNode 50075
  • check_http + regex = simple + effective

24. nagios_check_commands.cfg

  • Component Failure
  • (Future) Newer Hadoop will have XML status

define command { command_namecheck_remote_namenode command_line$USER1$/check_http -H$HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode } define service { service_description check_remote_namenode use generic-service host_name hadoopname1 check_command check_remote_namenode!50070 } 25. Monitoring Battle Plan

  • Start With the Basics
    • Ping, Disk (Done)
  • Add Hadoop Specific Alarms
    • check_data_node (Done)
  • Add JMX Graphing
    • NameNodeOperations
  • Add JMX Based alarms
    • FilesTotal > 1,000,000 or LiveNodes < 50%

26. JMX Graphing

  • Enable JMX
  • Import Templates

27. JMX Graphing 28. JMX Graphing 29. JMX Graphing 30. 31. Standard Java JMX 32. Monitoring Battle Plan Thus Far

  • Start With the Basics !!!!!!Done!!!!!
    • Ping, Disk
  • Add Hadoop Specific Alarms !Done!
    • check_data_node
  • Add JMX Graphing !Done!
    • NameNodeOperations
  • Add JMX Based alarms
    • FilesTotal > 1,000,000 or LiveNodes < 50%

33. Add JMX based Alarms

  • hadoop-cacti-jtg is flexible
    • extend fetch classes
    • Dont call output()
    • Write your own check logic

34. Quick JMX Base Walkthrough

  • url, user, pass, object specified from CLI
  • wantedVariables, wantedOperations by inheritance
  • fetch() output() provided

35. Extend for NameNode 36. Extend for Nagios 37. Monitoring Battle Plan

  • Start With the Basics !DONE!
    • Ping, Disk
  • Add Hadoop Specific Alarms !DONE!
    • check_data_node
  • Add JMX Graphing !DONE!
    • NameNodeOperations
  • Add JMX Based alarms !DONE!
    • FilesTotal > 1,000,000 or LiveNodes < 50%

38. Review

  • File System Growth
    • Size
    • Number of Files
    • Number of Blocks
    • Ratios
  • Utilization
    • CPU/Memory
    • Disk
  • Email (nightly)
    • FSCK
    • DSFADMIN

39. The Future

  • JMX Coming to JobTracker and TaskTracker (0.21)
    • Collect and Graph Jobs Running
    • Collect and Graph Map / Reduce per node
    • Profile Specific Jobs in Cacti?