Health Monitoring
circa 1999• Nagios Core
• Event scheduler • Event processor • Alert manager
• Host groups config • Ping • HTTP • SSH
• Nagios Remote Plugin Executor • SNMP • load • disk
photo credit: https://en.wikipedia.org/wiki/Nagios
Health Monitoring circa 2012
• AMI • Chef / Ansible
• ELB / Health Check • Protocol: HTTP (or HTTPS, TCP, SSL) • Port: 80 • Path: /index.html • Timeout / Interval: 5s / 30s • Unhealthy / Healthy Threshold: 2 / 10
• EC2 / Status Checks • Loss of network • Loss of power • Host software problems • Host hardware problems
• ASG photo credit: http://aws.amazon.com/architecture/ http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html
But you probably still need…
• Nagios for monitoring
• or Zabbix, Ganglia, Sensu…
• or OpsView, SolarWinds…
• or Pingdom, Datadog…
• To provide system feedback
• ASG SetInstanceHealth
photo credit: http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938
Health Monitoring circa 2016, the age of containers
• Generic AMI • Docker
• ECS • Container scheduling and re-scheduling as a service
• ASG / EC2 / Status Checks • Simple monitoring container
photo credit: https://github.com/docker/swarm
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api128 MB
registry256 MB
rails web.21024 MB
data worker.1512 MB
rails web.31024 MB
data worker.2512 MB
rails worker.2256 MB
rails worker.3256 MB
rails web.11024 MB
rails worker.1256 MB
rails worker.4256 MB
ECS
ASG
api ELB rails ELB
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api128 MB
registry256 MB
rails web.21024 MB
data worker.1512 MB
rails web.31024 MB
data worker.2512 MB
rails worker.2256 MB
rails worker.3256 MB
rails web.11024 MB
rails worker.1256 MB
rails worker.4256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with ECS
• Instance userspace gets wacky
Failure Scenarios• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
photo credit: http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693
>rescheduletask
Container Schedulers are the new watchman
• Container process monitoring
• Service health check monitoring
• Automatic re-scheduling
photo credit: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api128 MB
registry256 MB
rails web.21024 MB
data worker.1512 MB
rails web.31024 MB
data worker.2512 MB
rails worker.2256 MB
rails worker.3256 MB
rails web.11024 MB
rails worker.1256 MB
rails worker.4256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with ECS
• Instance userspace gets wacky
Still need to configure an ASG to maintain capacity…
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api128 MB
registry256 MB
rails web.21024 MB
data worker.1512 MB
rails web.31024 MB
data worker.2512 MB
rails worker.2256 MB
rails worker.3256 MB
rails web.11024 MB
rails worker.1256 MB
rails worker.4256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with ECS
• Instance userspace gets wacky
Still need a monitor…
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api128 MB
registry256 MB
rails web.21024 MB
data worker.1512 MB
rails web.31024 MB
data worker.2512 MB
rails worker.2256 MB
rails worker.3256 MB
rails web.11024 MB
rails worker.1256 MB
rails worker.4256 MB
ECS
ASG
api ELB rails ELB
Health Monitoring circa 2016, the age of containers
• Schedule a monitor process in container cluster
• Describe ASG an ECS membership
• Mark all instances unregistered with ECS unhealthy
• `docker run` a user space health check on every instance
• Mark instances that fail to connect to Docker unhealthy
• Mark instances that fail user space health check unhealthy
No Nagios server + plugins!
Partial Failure Scenarios battle scars
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with ECS
• Instance userspace gets wacky
• Disk full
• Disk partition corrupt / read-only
• Network packet loss
• CPU steal
• Kernel bugs triggered
• Security vulnerabilities
• Security breaches
• …
User Space Health Check
$dockerrunbusyboxsh-c\'dmesg|grep"Remountingfilesystemread-only"'
#whynot:$dockerrunhealth-check
To package, distribute and run common top, netstat, smartmontools, etc. binaries and scripts
Thanks!
Slides available on Medium / SlideSharehttps://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286
http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run
Open source Golang monitor available on GitHubhttps://github.com/convox/rack/blob/master/api/workers/cluster.go
Questions / feedback to @nzoschke or [email protected]