Top Banner
Alerting: More Signal Less Noise Less Pain Alexis Lê-Quôc (@alq)
32
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Alerting: more signal, less noise, less pain

Alerting:More SignalLess NoiseLess Pain

Alexis Lê-Quôc (@alq)

Page 2: Alerting: more signal, less noise, less pain

Is this talk for me?

✓I am or will be on-call

✓I don’t like being alerted

✓I want the pain to go away

Page 3: Alerting: more signal, less noise, less pain

The next 40 minutes

1. Alerts == pain?

2. Measure alerts

3. Concrete (& fun) steps

Page 4: Alerting: more signal, less noise, less pain

Alleviate the pain

Page 5: Alerting: more signal, less noise, less pain
Page 6: Alerting: more signal, less noise, less pain
Page 7: Alerting: more signal, less noise, less pain

Pain

Page 8: Alerting: more signal, less noise, less pain

Manvs

Machine

Page 9: Alerting: more signal, less noise, less pain

“too frequently”“odd hours”

“always the same”

Page 10: Alerting: more signal, less noise, less pain

3 simple thingsto measure

Page 11: Alerting: more signal, less noise, less pain

“Always the same”

Page 12: Alerting: more signal, less noise, less pain

Steps

•Group alert stream by “alert signature”

•Rank by occurrences

•Graph

Page 13: Alerting: more signal, less noise, less pain

Alert Signatures (example)

name | count ----------------------+------- Root disk space | 88 redis-queue | 71 Zombies | 50 Total Processes | 47 dispatcher | 37 pgsql backends | 35 cassandra JVM Heap | 32 SSH | 30

Naive: alert headers

Page 14: Alerting: more signal, less noise, less pain

Case 1: Top 5 = 25% in volumeAlert count by signature

%

01

23

45

67

Zoom on top 10

%

01

23

45

67

Sample size: 1123 alerts6 months

Page 15: Alerting: more signal, less noise, less pain

Case 2: Top 5 = 38% in volumeAlert count by signature

%

02

46

810

12

Zoom on top 10

%

02

46

810

12

Sample size:2324 alerts6 months

Page 16: Alerting: more signal, less noise, less pain

0 500 1000 1500 2000 2500 3000

020

4060

80

alert sample size

% in

vol

ume

of to

p 5

% in volume of top 5

Freq

uenc

y0 20 40 60 80

02

46

810

12

Solve the top 5 alertsand drop the volume by 20-80%Solve the top 5 alertsand drop the volume by 20-80%

Outlier (due to naive signature)Outlier (due to naive signature)

Top 5 over 103 alert streamsMin. 100 alerts per stream

Page 17: Alerting: more signal, less noise, less pain

“Odd hours”

Page 18: Alerting: more signal, less noise, less pain

Steps

•Group alert stream by signature,

•... day of week, hour of day

•Graph

Page 19: Alerting: more signal, less noise, less pain

Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

0

50

100

150

0

50

100

150

0

50

100

150

0

50

100

150

0

50

100

150

0

50

100

150

0

50

100

150

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23Hour of Day UTC

Aler

t cou

nt

dow

Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Work daysWork daysZZZzzzZZZzzz

SundaySunday

SaturdaySaturday

EveningsEvenings

Page 20: Alerting: more signal, less noise, less pain

“Too frequently”

Page 21: Alerting: more signal, less noise, less pain

Steps

•Group alerts by signature

•Measure time elapsed between first and last occurrence & average/%-ile time elapsed between occurrences

•Graph

Page 22: Alerting: more signal, less noise, less pain

0

25

50

75

0 50 100 150 200Alert age in days

Aler

t cou

nt

Old and frequentOld and frequent

Age = days between first and last occurrenceAge = days between first and last occurrence

New and frequentNew and frequent

Page 23: Alerting: more signal, less noise, less pain

0

50

100

150

0 25 50 75Occurences per Alert

Aver

age

perio

d be

twee

n oc

curre

nces

Occur every 2-3 days on averageOccur every 2-3 days on average

Once in a blue moonOnce in a blue moon

Page 24: Alerting: more signal, less noise, less pain

“too frequently”“odd hours”

“always the same”

Page 25: Alerting: more signal, less noise, less pain

“too frequently”“odd hours”

“always the same”

Quantified

Page 26: Alerting: more signal, less noise, less pain

Concrete steps

Page 27: Alerting: more signal, less noise, less pain

Measure your alerts

1. Collect

2. Massage

3. Visualize

4. Learn

Page 28: Alerting: more signal, less noise, less pain

Collect your alerts

• From PagerDuty (OpsGenie, Nagios, etc.)

• Import with Python (pygerduty)

• Store in PostgreSQL

Page 29: Alerting: more signal, less noise, less pain

Massage your alerts

•Use any of

•SQL (windowing functions)

•R (reshape)

•Python (pandas)

Page 30: Alerting: more signal, less noise, less pain

Visualize

•R (or d3.js, excel, etc.)

•Key is quick feedback

Page 31: Alerting: more signal, less noise, less pain

Slides, Code & Data

https://github.com/alq666/velocity-ny-2013

Page 32: Alerting: more signal, less noise, less pain

Enjoyed it? Hated it? Don’t care?---

Let me know @alq