Top Banner
Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock
70

edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

May 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Monitoring Cloudflare's planet-scaleedge network with PrometheusMatt Bostock

Page 2: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

@mattbostockPlatform Operations

Page 3: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Prometheus for monitoring

● Alerting on critical production issues

● Incident response

● Post-mortem analysis

● Metrics, but not long-term storage

Page 4: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

What does Cloudflare do?

CDNMoving content physically

closer to visitors with our CDN.

Website OptimizationCachingTLS 1.3HTTP/2

Server pushAMP

Origin load-balancingSmart routing

DNSCloudflare is one of the fastest managed DNS providers in the world.

Page 5: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

115+Data centers globally

1.2MDNS requests/second

10%Internet requests

every day

5MHTTP requests/second

websites, apps & APIs in 150 countries

6M+

Cloudflare’s anycast edge network

Page 6: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

4.6MTime-series

max per server

4Top-level

Prometheus servers

185Prometheus servers

currently in Production

72kSamples ingested per

second max per server

Max size ofdata on disk

250GB

Cloudflare’s Prometheus deployment

Page 7: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Edge Points of Presence (PoPs)

● Routing via anycast

● Configured identically

● Independent

Page 8: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Services in each PoP

● HTTP

● DNS

● Replicated key-value store

● Attack mitigation

Page 9: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Core data centers

● Enterprise log share (HTTP access logs for Enterprise customers)

● Customer analytics

● Logging: auditd, HTTP errors, DNS errors, syslog

● Application and operational metrics

● Internal and customer-facing APIs

Page 10: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Services in core data centers

● PaaS: Marathon, Mesos, Chronos, Docker, Sentry

● Object storage: Ceph

● Data streams: Kafka, Flink, Spark

● Analytics: ClickHouse (OLAP), CitusDB (shared PostgreSQL)

● Hadoop: HDFS, HBase, OpenTSDB

● Logging: Elasticsearch, Kibana

● Config management: Salt

● Misc: MySQL

Page 11: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Prometheus queries

Page 12: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

node_md_disks_active / node_md_disks * 100

Page 13: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

count(count(node_uname_info) by (release))

Page 14: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

rate(node_disk_read_time_ms[2m]) / rate(node_disk_reads_completed[2m])

Page 15: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Metrics for alerting

Page 16: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

sum(rate(http_requests_total{job="alertmanager", code=~"5.."}[2m])) / sum(rate(http_requests_total{job="alertmanager"}[2m]))

* 100 > 0

Page 17: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

count(abs(

(hbase_namenode_FSNamesystemState_CapacityUsed / hbase_namenode_FSNamesystemState_CapacityTotal)

- ON() GROUP_RIGHT()

(hadoop_datanode_fs_DfsUsed / hadoop_datanode_fs_Capacity)

) * 100> 10)

Page 18: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Prometheus architecture

Page 19: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Before, we used Nagios

● Tuned for high volume of checks

● Hundreds of thousands of checks

● One machine in one central location

● Alerting backend for our custom metrics

pipeline

Page 20: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Specification

Comments

Page 21: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Inside each PoP

Server

Server

Server

Prometheus

Page 22: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Inside each PoP

Server

Server

Server

Prometheus

Page 23: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Inside each PoP: High availability

Prometheus

Server

Server

Server

Prometheus

Page 24: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Federation

San Jose

Frankfurt

Santiago

Prometheus

CORE

Page 25: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Federation configuration - job_name: 'federate'

scheme: https

scrape_interval: 30s

honor_labels: true

metrics_path: '/federate'

params:

'match[]':

# Scrape target health

- '{__name__="up"}'

# Colo-level aggregate metrics

- '{__name__=~"colo(?:_.+)?:.+"}'

Page 26: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Federation configuration - job_name: 'federate'

scheme: https

scrape_interval: 30s

honor_labels: true

metrics_path: '/federate'

params:

'match[]':

# Scrape target health

- '{__name__="up"}'

# Colo-level aggregate metrics

- '{__name__=~"colo(?:_.+)?:.+"}'

colo:*colo_job:*

Page 27: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Federation

San Jose

Frankfurt

Santiago

Prometheus

CORE

Page 28: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Federation: High availability

Prometheus

Prometheus

San Jose

Frankfurt

Santiago

CORE

Page 29: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Federation: High availability

Prometheus

Prometheus

San Jose

Frankfurt

Santiago

CORE US

CORE EU

Page 30: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Retention and sample frequency

● 15 days’ retention

● Metrics scraped every 60 seconds

○ Federation: every 30 seconds

● No downsampling

Page 31: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Exporters we use

Purpose Name

System (CPU, memory, TCP, RAID, etc) Node exporter

Network probes (HTTP, TCP, ICMP ping) Blackbox exporter

Log matches (hung tasks, controller errors) mtail

Page 32: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Deploying exporters

● One exporter per service instance

● Separate concerns

● Deploy in same failure domain

Page 33: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Alerting

Page 34: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Alerting

Alertmanager

San Jose

Frankfurt

Santiago

CORE

Page 35: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Alerting: High availability (soon)

Alertmanager

Alertmanager

San Jose

Frankfurt

Santiago

CORE US

CORE EU

Page 36: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Writing alerting rules

● Test the query on past data

Page 37: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Writing alerting rules

● Test the query on past data

● Descriptive name with adjective or adverb

Page 38: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

RAID_Array

Page 39: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

RAID_Health_Degraded

Page 40: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Writing alerting rules

● Test the query on past data

● Descriptive name with adjective/adverb

● Must have an alert reference

Page 41: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Writing alerting rules

● Test the query on past data

● Descriptive name with adjective/adverb

● Must have an alert reference

● Must be actionable

Page 42: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Writing alerting rules

● Test the query on past data

● Descriptive name with adjective/adverb

● Must have an alert reference

● Must be actionable

● Keep it simple

Page 43: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Example alerting ruleALERT RAID_Health_Degraded

IF node_md_disks - node_md_disks_active > 0

LABELS { notify="jira-sre" }

ANNOTATIONS {

summary = `{{ $value }} disks in {{ $labels.device }} on {{ $labels.instance }} are faulty`,

Dashboard = `https://grafana.internal/disk-health?var-instance={{ $labels.instance }}`,

link = "https://wiki.internal/ALERT+Raid+Health",

}

Page 44: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Monitoring your monitoring

Page 45: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

PagerDuty escalation drillALERT SRE_Escalation_Drill

IF (hour() % 8 == 1 and minute() >= 35) or (hour() % 8 == 2 and minute() < 20)

LABELS { notify="escalate-sre" }

ANNOTATIONS {

dashboard="https://cloudflare.pagerduty.com/",

link="https://wiki.internal/display/OPS/ALERT+Escalation+Drill",

summary="This is a drill to test that alerts are being correctly escalated.

Please ack the PagerDuty notification."

}

Page 46: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Monitoring Prometheus

● Mesh: each Prometheus monitors other

Prometheus servers in same datacenter

● Top-down: top-level Prometheus servers

monitor datacenter-level Prometheus servers

Page 47: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Monitoring Alertmanager

● Use Grafana’s alerting mechanism to page

● Alert if notifications sent is zero even though

notifications were received

Page 48: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Monitoring Alertmanager

(

sum(rate(alertmanager_alerts_received_total{job="alertmanager"}[5m]))

without(status, instance) > 0

and

sum(rate(alertmanager_notifications_total{job="alertmanager"}[5m]))

without(integration, instance) == 0

)

or vector(0)

Page 49: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring
Page 50: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Alert routing

Page 51: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Alert routing

notify=”hipchat-sre escalate-sre”

Page 52: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Alert routing

- match_re:

notify: (?:.*\s+)?hipchat-sre(?:\s+.*)?

receiver: hipchat-sre

continue: true

Page 53: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Routing tree

Page 54: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring
Page 55: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring
Page 56: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring
Page 57: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring
Page 58: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring
Page 59: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

amtool

matt➜~» go get -u github.com/prometheus/alertmanager/cmd/amtool

matt➜~» amtool silence add \

--expire 4h \

--comment https://jira.internal/TICKET-1234 \

alertname=HDFS_Capacity_Almost_Exhausted

Page 60: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Pain points

Page 61: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Storage pressure

● Use -storage.local.target-heap-size

● Set -storage.local.series-file-shrink-ratio to 0.3 or

above

Page 62: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Alertmanager races, deadlocks, timeouts,oh my

Page 63: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Cardinality explosion

mbostock@host:~$ sudo cp /data/prometheus/data/heads.db ~

mbostock@host:~$ sudo chown mbostock: ~/heads.db

mbostock@host:~$ storagetool dump-heads heads.db | awk '{ print $2 }' | sed 's/{.*//' | sed 's/METRIC=//' | sort | uniq -c | sort -n

...snip...

678869 eyom_eyomCPTOPON_numsub

678876 eyom_eyomCPTOPON_hhiinv

679193 eyom_eyomCPTOPON_hhi

2314366 eyom_eyomCPTOPON_rank

2314988 eyom_eyomCPTOPON_speed

2993974 eyom_eyomCPTOPON_share

Page 64: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Standardise on metric labels early

● Especially probes: source versus target

● Identifying environments

● Identifying clusters

● Identifying deployments of same app in different

roles

Page 65: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Next steps

Page 66: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Prometheus 2.0

● Lower disk I/O and memory requirements

● Better handling of metrics churn

Page 67: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Integration with long term storage

● Ship metrics from Prometheus (remote write)

● One query language: PromQL

Page 68: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

More improvements

● Federate one set of metrics per datacenter

● Highly-available Alertmanager

● Visual similarity search

● Alert menus; loading alerting rules dynamically

● Priority-based alert routing

Page 69: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

More information

blog.cloudflare.com

github.com/cloudflare

Try Prometheus 2.0: prometheus.io/blog

Questions? @mattbostock

Page 70: edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Thanks!

blog.cloudflare.com

github.com/cloudflare

Try Prometheus 2.0: prometheus.io/blog

Questions? @mattbostock