Host Health Monitoring with Docker Run

Host Health Monitoring with `docker run`

Noah Zoschke @nzoschke

[email protected] 10 / 28 / 2015

https://twitter.com/nzoschke

mailto:[email protected]

Health Monitoring

circa 1999• Nagios Core

• Event scheduler • Event processor • Alert manager

• Host groups config • Ping • HTTP • SSH

• Nagios Remote Plugin Executor • SNMP • load • disk

photo credit: https://en.wikipedia.org/wiki/Nagios

https://en.wikipedia.org/wiki/Nagios

Health Monitoring circa 2012

• AMI • Chef / Ansible

• ELB / Health Check • Protocol: HTTP (or HTTPS, TCP, SSL) • Port: 80 • Path: /index.html • Timeout / Interval: 5s / 30s • Unhealthy / Healthy Threshold: 2 / 10

• EC2 / Status Checks • Loss of network • Loss of power • Host software problems • Host hardware problems

• ASG photo credit: http://aws.amazon.com/architecture/ http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html

http://aws.amazon.com/architecture/

http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html

But you probably still need…

• Nagios for monitoring

• or Zabbix, Ganglia, Sensu…

• or OpsView, SolarWinds…

• or Pingdom, Datadog…

• To provide system feedback

• ASG SetInstanceHealth

photo credit: http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938

http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938

Health Monitoring circa 2016, the age of containers

• Generic AMI • Docker

• ECS • Container scheduling and re-scheduling as a service

• ASG / EC2 / Status Checks • Simple monitoring container

photo credit: https://github.com/docker/swarm

https://github.com/docker/swarm

ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd

api128 MB

registry256 MB

rails web.21024 MB

data worker.1512 MB

rails web.31024 MB

data worker.2512 MB

rails worker.2256 MB


rails web.11024 MB



ECS

ASG

api ELB rails ELB


api128 MB

registry256 MB

rails web.21024 MB

data worker.1512 MB

rails web.31024 MB

data worker.2512 MB



rails web.11024 MB



ECS

ASG

api ELB rails ELB

Failure Scenarios• web.2 container crashes

• web.2 port unresponsive

• ecs-agent fails

• dockerd fails

• Instance hardware fails

• Instance fails to register with ECS

• Instance userspace gets wacky



• ecs-agent fails

• dockerd fails

photo credit: http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693

>rescheduletask

http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693

Container Schedulers are the new watchman

• Container process monitoring

• Service health check monitoring

• Automatic re-scheduling

photo credit: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html

http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html


api128 MB

registry256 MB

rails web.21024 MB

data worker.1512 MB

rails web.31024 MB

data worker.2512 MB



rails web.11024 MB



ECS

ASG

api ELB rails ELB



• ecs-agent fails

• dockerd fails




Still need to configure an ASG to maintain capacity…


api128 MB

registry256 MB

rails web.21024 MB

data worker.1512 MB

rails web.31024 MB

data worker.2512 MB



rails web.11024 MB



ECS

ASG

api ELB rails ELB



• ecs-agent fails

• dockerd fails




Still need a monitor…


api128 MB

registry256 MB

rails web.21024 MB

data worker.1512 MB

rails web.31024 MB

data worker.2512 MB



rails web.11024 MB



ECS

ASG

api ELB rails ELB

Health Monitoring circa 2016, the age of containers

• Schedule a monitor process in container cluster

• Describe ASG an ECS membership

• Mark all instances unregistered with ECS unhealthy

• `docker run` a user space health check on every instance

• Mark instances that fail to connect to Docker unhealthy

• Mark instances that fail user space health check unhealthy

No Nagios server + plugins!

Partial Failure Scenarios battle scars

• web.2 container crashes


• ecs-agent fails

• dockerd fails




• Disk full

• Disk partition corrupt / read-only

• Network packet loss

• CPU steal

• Kernel bugs triggered

• Security vulnerabilities

• Security breaches

• …

User Space Health Check

$dockerrunbusyboxsh-c\'dmesg|grep"Remountingfilesystemread-only"'

#whynot:$dockerrunhealth-check

To package, distribute and run common top, netstat, smartmontools, etc. binaries and scripts

Thanks!

Slides available on Medium / SlideSharehttps://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286

http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run

Open source Golang monitor available on GitHubhttps://github.com/convox/rack/blob/master/api/workers/cluster.go

Questions / feedback to @nzoschke or [email protected]

https://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286

http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run

https://github.com/convox/rack/blob/master/api/workers/cluster.go

https://twitter.com/nzoschke

mailto:[email protected]

Host Health Monitoring with Docker Run

Software