Top Banner
Alerting with Time Series github.com/fabxc @fabxc Fabian Reinartz, CoreOS
55

Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Alerting with Time Series

github.com/fabxc

@fabxc

Fabian Reinartz, CoreOS

Page 2: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Stream of <timestamp, value> pairs associated with an identifier

http_requests_total{job="nginx",instance="1.2.3.4:80",path="/status",status="200"}

1348 @ 1480502384

1899 @ 1480502389

2023 @ 1480502394

http_requests_total{job="nginx",instance="1.2.3.1:80",path="/settings",status="201"}

http_requests_total{job="nginx",instance="1.2.3.5:80",path="/",status="500"}

...

Time Series

Page 3: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Stream of <timestamp, value> pairs associated with an identifier

sum by(path,status) (rate(http_requests_total{job="nginx"}[5m]))

{path="/status",status="200"} 32.13 @ 1480502384

{path="/status",status="500"} 19.133 @ 1480502394

{path="/profile",status="200"} 44.52 @ 1480502389

Time Series

Page 4: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Prometheus

TargetsService Discovery

(Kubernetes, AWS, Consul, custom...)

GrafanaHTTP API

UI

Page 5: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

A lot of traffic to monitorMonitoring traffic should not be proportional to user traffic

Page 6: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

A lot of targets to monitorA single host can run hundreds of machines/procs/containers/...

Page 7: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Targets constantly changeDeployments, scaling up, scaling down, rolling-updates

Page 8: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Need a fleet-wide viewWhat’s my 99th percentile request latency across all frontends?

Page 9: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Drill-down for investigationWhich pod/node/... has turned unhealthy? How and why?

Page 10: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Monitor all levels, with the same system Query and correlate metrics across the stack

Page 11: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Translate that to

Meaningful Alerting

Page 12: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Anomaly Detection

Automated Alert Correlation

Self-Healing

Machine Learning

Page 13: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Anomaly Detection

If you are actually monitoring at scale, something will always correlate.

Huge efforts to eliminate huge number of false positives.

Huge chance to introduce false negatives.

Page 14: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Prometheus Alerts

!= =

current state desired state alerts

Page 15: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Symptom-based pages Urgent issues – Does it hurt your user?

system

user

dependency

dependency

dependency

dependency

Page 16: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

LatencyFour Golden Signals

system

user

dependency

dependency

dependency

dependency

Page 17: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

TrafficFour Golden Signals

system

user

dependency

dependency

dependency

dependency

Page 18: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ErrorsFour Golden Signals

system

user

dependency

dependency

dependency

dependency

Page 19: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Cause-based warningsHelpful context, non-urgent problems

system

user

dependency

dependency

dependency

dependency

Page 20: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Saturation / CapacityFour Golden Signals

system

user

dependency

dependency

dependency

dependency

Page 21: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Prometheus Alerts

ALERT <alert name>

IF <PromQL vector expression>

FOR <duration>

LABELS { ... }

ANNOTATIONS { ... }

<elem1> <val1>

<elem2> <val2>

<elem3> <val3>

...

Each result entry is one alert:

Page 22: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

etcd_has_leader{job="etcd", instance="A"} 0

etcd_has_leader{job="etcd", instance="B"} 0

etcd_has_leader{job="etcd", instance="C"} 1

Page 23: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Prometheus Alerts

ALERT EtcdNoLeader

IF etcd_has_leader == 0

FOR 1m

LABELS {

severity=”page”

}

{job=”etcd”,instance=”A”} 0.0

{job=”etcd”,instance=”B”} 0.0

{job=”etcd”,alertname=”EtcdNoLeader”,severity=”page”,instance=”A”}

{job=”etcd”,alertname=”EtcdNoLeader”,severity=”page”,instance=”B”}

Page 24: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

requests_total{instance="web-1", path="/index", method="GET"} 8913435

requests_total{instance="web-1", path="/index", method="POST"} 34845

requests_total{instance="web-3", path="/api/profile", method="GET"} 654118

requests_total{instance="web-2", path="/api/profile", method="GET"} 774540

request_errors_total{instance="web-1", path="/index", method="GET"} 84513

request_errors_total{instance="web-1", path="/index", method="POST"} 434

request_errors_total{instance="web-3", path="/api/profile", method="GET"} 6562

request_errors_total{instance="web-2", path="/api/profile", method="GET"} 3571

Page 25: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum(rate(request_errors_total[5m])) > 500

{} 534

Page 26: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum(rate(request_errors_total[5m])) > 500

{} 534

WRONGAbsolute threshold

alerting rule needs constant tuning as traffic changes

Page 27: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum(rate(request_errors_total[5m])) > 500

{} 534traffic changes over days

Page 28: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum(rate(request_errors_total[5m])) > 500

{} 534traffic changes over months

Page 29: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum(rate(request_errors_total[5m])) > 500

{} 534

traffic when you releaseawesome feature X

Page 30: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum(rate(request_errors_total[5m])) /

sum(rate(requests_total[5m])) > 0.01

{} 1.8354

Page 31: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum(rate(request_errors_total[5m])) /

sum(rate(requests_total[5m])) > 0.01

{} 1.8354

Page 32: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum(rate(request_errors_total[5m])) /

sum(rate(requests_total[5m])) > 0.01

{} 1.8354

WRONGNo dimensionality in result

loss of detail, signal cancelation

Page 33: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum(rate(request_errors_total[5m])) /

sum(rate(requests_total[5m])) > 0.01

{} 1.8354

high error /low traffic

low error /high traffic

total sum

Page 34: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum by(instance, path) (rate(request_errors_total[5m])) /

sum by(instance, path) (rate(requests_total[5m])) > 0.01

{instance=”web-2”, path=”/api/comments”} 0.02435

{instance=”web-1”, path=”/api/comments”} 0.01055

{instance=”web-2”, path=”/api/profile”} 0.34124

Page 35: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum by(instance, path) (rate(request_errors_total[5m])) /

sum by(instance, path) (rate(requests_total[5m])) > 0.01

{instance=”web-2”, path=”/api/v1/comments”} 0.022435

...

WRONGWrong dimensionsaggregates away dimensions of fault-tolerance

Page 36: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum by(instance, path) (rate(request_errors_total[5m])) /

sum by(instance, path) (rate(requests_total[5m])) > 0.01

{instance=”web-2”, path=”/api/v1/comments”} 0.02435

...

instance 1

instance 2..1000

Page 37: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT HighErrorRate

IF sum without(instance) (rate(request_errors_total[5m])) /

sum without(instance) (rate(requests_total[5m])) > 0.01

{method=”GET”, path=”/api/v1/comments”} 0.02435

{method=”POST”, path=”/api/v1/comments”} 0.015

{method=”POST”, path=”/api/v1/profile”} 0.34124

Page 38: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT DiskWillFillIn4Hours

IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0

FOR 5m

...

0

now -1h +4h

Page 39: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

ALERT DiskWillFillIn4Hours

IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0

FOR 5m

ANNOTATIONS {

summary = “device filling up”,

description = “{{$labels.device}} mounted on {{$labels.mountpoint}} on

{{$labels.instance}} will fill up within 4 hours.”

}

Page 40: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

AlertmanagerAggregate, deduplicate, and route alerts

Page 41: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Prometheus

TargetsService Discovery

(Kubernetes, AWS, Consul, custom...)

Alertmanager

Email, Slack, PagerDuty, OpsGenie, ...

Page 42: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Alerting Rule Alerting Rule Alerting Rule Alerting Rule...

04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/profile, method=GET04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET04:11 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/settings, method=POST04:12 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=GET04:13 hey, HighLatency, service=”X”, zone=”eu-west”, path=/index, method=POST04:13 hey, CacheServerSlow, service=”X”, zone=”eu-west”, path=/user/profile, method=POST . . .04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/comments, method=GET04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=POST

Page 43: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Alerting Rule Alerting Rule Alerting Rule Alerting Rule...

Alert Manager

Chat

JIRAPagerDuty

...

You have 15 alerts for Service Xin zone eu-west

3x HighLatency 10x HighErrorRate 2x CacheServerSlow

Individual alerts: ...

Page 44: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Inhibition

{alertname=”LatencyHigh”, severity=”page”, ..., zone=”eu-west”}

...

{alertname=”LatencyHigh”, severity=”page”, ..., zone=”eu-west”}

{alertname=”ErrorsHigh”, severity=”page”, ..., zone=”eu-west”}

...

{alertname=”ServiceDown”, severity=”page”, ..., zone=”eu-west”}

{alertname=”DatacenterOnFire”, severity=”page”, zone=”eu-west”}

if active,mute everything else in same zone

Page 45: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Anomaly Detection

Page 46: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Practical Example 1

job:requests:rate5m = sum by(job) (rate(requests_total[5m]))

job:requests:holt_winters_rate1h = holt_winters( job:requests:rate5m[1h], 0.6, 0.4)

Page 47: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Practical Example 1

ALERT AbnormalTrafficIF abs( job:requests:rate5m - job:requests:holt_winters_rate1h offset 7d ) > 0.2 * job:request_rate:holt_winters_rate1h offset 7dFOR 10m...

Page 48: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Practical Example 2

instance:latency_seconds:mean5m> on (job) group_left() ( avg by (job)(instance:latency_seconds:mean5m) + on (job) 2 * stddev by (job)(instance:latency_seconds:mean5m) )

Page 49: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Practical Example 2

( instance:latency_seconds:mean5m > on (job) group_left() ( avg by (job)(instance:latency_seconds:mean5m) + on (job) 2 * stddev by (job)(instance:latency_seconds:mean5m) ))> on (job) group_left() 1.2 * avg by (job)(instance:latency_seconds:mean5m)

Page 50: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Practical Example 2

( instance:latency_seconds:mean5m > on (job) group_left() ( avg by (job)(instance:latency_seconds:mean5m) + on (job) 2 * stddev by (job)(instance:latency_seconds:mean5m) ))> on (job) group_left() 1.2 * avg by (job)(instance:latency_seconds:mean5m)

and on (job) avg by (job)(instance:latency_seconds_count:rate5m) > 1

Page 51: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Self Healing

Page 52: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Prom

Alertmanagerwh

nodescrape

notify

alert

actio

n

Page 53: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Conclusion

- Symptom-based pages + cause based warnings provide good coverage and insight

into service availability

- Design alerts that are adaptive to change, preserve as many dimensions as

possible, aggregate away dimensions of fault tolerance

- Use linear prediction for capacity planning and saturation detection

- Advanced alerting expressions allow for well-scoped and practical anomaly detection

- Raw alerts are not meant for human consumption

- The Alertmanager aggregates, silences, and routes groups of alerts as meaningful

notifications

Page 54: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

coreos.com/fest@coreosfest

May 31 - June 1, 2017San Francisco

Page 55: Alerting with Time Series - FOSDEM...action Conclusion-Symptom-based pages + cause based warnings provide good coverage and insight into service availability - Design alerts that are

Join us!

careers: coreos.com/careers (now in Berlin!)