Top Banner
MILAN 20/21.11.2015 Alert overload: How to adopt a microservices architecture without being overwhelmed with noise Sarah Wells - Financial Times @sarahjwells
109

Codemotion Milan 2015 Alerts Overload

Jan 22, 2018

Download

Technology

sarahjwells
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Codemotion Milan 2015 Alerts Overload

MILAN 20/21.11.2015

Alert overload: How to adopt a

microservices architecture without being

overwhelmed with noise

Sarah Wells - Financial Times

@sarahjwells

Page 2: Codemotion Milan 2015 Alerts Overload
Page 3: Codemotion Milan 2015 Alerts Overload

Microservices make it worse

Page 4: Codemotion Milan 2015 Alerts Overload

microservices (n,pl): an efficient device for

transforming business problems into distributed

transaction problems

@drsnooks

Page 5: Codemotion Milan 2015 Alerts Overload

You have a lot more systems

Page 6: Codemotion Milan 2015 Alerts Overload

45 microservices

Page 7: Codemotion Milan 2015 Alerts Overload

45 microservices

3 environments

Page 8: Codemotion Milan 2015 Alerts Overload

45 microservices

3 environments

2 instances for each service

Page 9: Codemotion Milan 2015 Alerts Overload

45 microservices

3 environments

2 instances for each service

20 checks per service

Page 10: Codemotion Milan 2015 Alerts Overload

45 microservices

3 environments

2 instances for each service

20 checks per service

running every 5 minutes

Page 11: Codemotion Milan 2015 Alerts Overload

> 1,500,000 system checks

per day

Page 12: Codemotion Milan 2015 Alerts Overload

Over 19,000 system

monitoring alerts in 50 days

Page 13: Codemotion Milan 2015 Alerts Overload

Over 19,000 system

monitoring alerts in 50 days

An average of 380 per day

Page 14: Codemotion Milan 2015 Alerts Overload

Functional monitoring is also an issue

Page 15: Codemotion Milan 2015 Alerts Overload

12,745 response time/error

alerts in 50 days

Page 16: Codemotion Milan 2015 Alerts Overload

12,745 response time/error

alerts

An average of 255 per day

Page 17: Codemotion Milan 2015 Alerts Overload

Why so many?

Page 18: Codemotion Milan 2015 Alerts Overload
Page 19: Codemotion Milan 2015 Alerts Overload
Page 20: Codemotion Milan 2015 Alerts Overload
Page 21: Codemotion Milan 2015 Alerts Overload
Page 22: Codemotion Milan 2015 Alerts Overload

http://devopsreactions.tumblr.com/post/122408751191/alerts-when-an-outage-starts

Page 23: Codemotion Milan 2015 Alerts Overload

How can you make it better?

Page 24: Codemotion Milan 2015 Alerts Overload

Quick starts: attack your problem

See our EngineRoom blog for more:

http://bit.ly/1PP7uQQ

Page 25: Codemotion Milan 2015 Alerts Overload

1 2 3

Page 26: Codemotion Milan 2015 Alerts Overload

Think about monitoring from the start

1

Page 27: Codemotion Milan 2015 Alerts Overload

It's the business functionality you care about

Page 28: Codemotion Milan 2015 Alerts Overload
Page 29: Codemotion Milan 2015 Alerts Overload
Page 30: Codemotion Milan 2015 Alerts Overload

1

Page 31: Codemotion Milan 2015 Alerts Overload

2

1

Page 32: Codemotion Milan 2015 Alerts Overload

3

1

2

Page 33: Codemotion Milan 2015 Alerts Overload

4

1

2

3

Page 34: Codemotion Milan 2015 Alerts Overload

We care about whether published content made it to us

Page 35: Codemotion Milan 2015 Alerts Overload

When people call our APIs, we care about speed

Page 36: Codemotion Milan 2015 Alerts Overload

… we also care about errors

Page 37: Codemotion Milan 2015 Alerts Overload

But it's the end-to-end that matters

https://www.flickr.com/photos/robef/16537786315/

Page 38: Codemotion Milan 2015 Alerts Overload

You only want an alert where you need to take

action

Page 39: Codemotion Milan 2015 Alerts Overload

If you just want information, create a dashboard or report

Page 40: Codemotion Milan 2015 Alerts Overload

Make sure you can't miss an alert

Page 41: Codemotion Milan 2015 Alerts Overload

Make the alert great

http://www.thestickerfactory.co.uk/

Page 42: Codemotion Milan 2015 Alerts Overload

Build your system with support in mind

Page 43: Codemotion Milan 2015 Alerts Overload

Transaction ids tie all microservices together

Page 44: Codemotion Milan 2015 Alerts Overload
Page 45: Codemotion Milan 2015 Alerts Overload

Healthchecks tell you whether a service is OK

GET http://{service}/__health

Page 46: Codemotion Milan 2015 Alerts Overload

Healthchecks tell you whether a service is OK

GET http://{service}/__health

returns 200 if the service can run the healthcheck

Page 47: Codemotion Milan 2015 Alerts Overload

Healthchecks tell you whether a service is OK

GET http://{service}/__health

returns 200 if the service can run the healthcheck

each check will return "ok": true or "ok": false

Page 48: Codemotion Milan 2015 Alerts Overload
Page 49: Codemotion Milan 2015 Alerts Overload
Page 50: Codemotion Milan 2015 Alerts Overload

Synthetic requests tell you about problems early

https://www.flickr.com/photos/jted/5448635109

Page 51: Codemotion Milan 2015 Alerts Overload

Use the right tools for the job

2

Page 52: Codemotion Milan 2015 Alerts Overload

There are basic tools you need

Page 53: Codemotion Milan 2015 Alerts Overload

FT Platform: An internal PaaS

Page 54: Codemotion Milan 2015 Alerts Overload

Service monitoring (e.g. Nagios)

Page 55: Codemotion Milan 2015 Alerts Overload

Log aggregation (e.g. Splunk)

Page 56: Codemotion Milan 2015 Alerts Overload

Graphing (e.g. Graphite/Grafana)

Page 57: Codemotion Milan 2015 Alerts Overload

metrics:

reporters:

- type: graphite

frequency: 1 minute

durationUnit: milliseconds

rateUnit: seconds

host: <%= @graphite.host %>

port: 2003

prefix: content.<%= @config_env %>.api-policy-component.<%=

scope.lookupvar('::hostname') %>

Page 58: Codemotion Milan 2015 Alerts Overload
Page 59: Codemotion Milan 2015 Alerts Overload
Page 60: Codemotion Milan 2015 Alerts Overload

Real time error analysis (e.g. Sentry)

Page 61: Codemotion Milan 2015 Alerts Overload

Build other tools to support you

Page 62: Codemotion Milan 2015 Alerts Overload

SAWS

Built by Silvano Dossan

See our Engine room blog: http://bit.ly/1GATHLy

Page 63: Codemotion Milan 2015 Alerts Overload

"I imagine most people do exactly

what I do - create a google filter to

send all Nagios emails straight to the

bin"

Page 64: Codemotion Milan 2015 Alerts Overload

"Our screens have a viewing angle of

about 10 degrees"

Page 65: Codemotion Milan 2015 Alerts Overload

"Our screens have a viewing angle of

about 10 degrees"

"It never seems to show the page I

want"

Page 66: Codemotion Milan 2015 Alerts Overload

Code at: https://github.com/muce/SAWS

Page 67: Codemotion Milan 2015 Alerts Overload

Dashing

Page 68: Codemotion Milan 2015 Alerts Overload
Page 69: Codemotion Milan 2015 Alerts Overload

Nagios chart

Built by Simon Gibbs

@simonjgibbs

Page 70: Codemotion Milan 2015 Alerts Overload
Page 71: Codemotion Milan 2015 Alerts Overload
Page 72: Codemotion Milan 2015 Alerts Overload
Page 73: Codemotion Milan 2015 Alerts Overload
Page 74: Codemotion Milan 2015 Alerts Overload

Use the right communication channel

Page 75: Codemotion Milan 2015 Alerts Overload

It's not email

Page 76: Codemotion Milan 2015 Alerts Overload

Slack integration

Page 77: Codemotion Milan 2015 Alerts Overload
Page 78: Codemotion Milan 2015 Alerts Overload

Radiators everywhere

Page 79: Codemotion Milan 2015 Alerts Overload

Cultivate your alerts

3

Page 80: Codemotion Milan 2015 Alerts Overload

Review the alerts you get

Page 81: Codemotion Milan 2015 Alerts Overload

If it isn't

helpful, make

sure you don't

get sent it

again

Page 82: Codemotion Milan 2015 Alerts Overload

See if you can improve it

www.workcompass.com/

Page 83: Codemotion Milan 2015 Alerts Overload

Splunk Alert: PROD - MethodeAPIResponseTime5MAlert

Business Impact

The methode api server is slow responding to requests.

This might result in articles not getting published to the new

content platform or publishing requests timing out.

...

Page 84: Codemotion Milan 2015 Alerts Overload

Splunk Alert: PROD - MethodeAPIResponseTime5MAlert

Business Impact

The methode api server is slow responding to requests.

This might result in articles not getting published to the new

content platform or publishing requests timing out.

...

Page 85: Codemotion Milan 2015 Alerts Overload

Technical Impact

The server is experiencing service degradation because of

network latency, high publishing load, high bandwidth

utilization, excessive memory or cpu usage on the VM. This

might result in failure to publish articles to the new content

platform.

Page 86: Codemotion Milan 2015 Alerts Overload

Splunk Alert: PROD Content Platform Ingester Methode

Publish Failures Alert

There has been one or more publish failures to the

Universal Publishing Platform. The UUIDs are listed below.

Please see the run book for more information.

_time transaction_id uuid

Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

Page 87: Codemotion Milan 2015 Alerts Overload

Splunk Alert: PROD Content Platform Ingester Methode

Publish Failures Alert

There has been one or more publish failures to the

Universal Publishing Platform. The UUIDs are listed below.

Please see the run book for more information.

_time transaction_id uuid

Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

Page 88: Codemotion Milan 2015 Alerts Overload

Splunk Alert: PROD Content Platform Ingester Methode

Publish Failures Alert

There has been one or more publish failures to the

Universal Publishing Platform. The UUIDs are listed below.

Please see the run book for more information.

_time transaction_id uuid

Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

Page 89: Codemotion Milan 2015 Alerts Overload

When you didn't get an alert

Page 90: Codemotion Milan 2015 Alerts Overload

What would have told you about this?

Page 91: Codemotion Milan 2015 Alerts Overload
Page 92: Codemotion Milan 2015 Alerts Overload

Setting up an alert is part of fixing the problem

✔ code

✔ test

alerts

Page 93: Codemotion Milan 2015 Alerts Overload

System boundaries are more difficult

Severin.stalder [CC BY-SA 3.0

(http://creativecommons.org/licenses/by-sa/3.0)], via

Wikimedia Commons

Page 94: Codemotion Milan 2015 Alerts Overload

Make sure you would know if an alert stopped

working

Page 95: Codemotion Milan 2015 Alerts Overload

Add a unit test

public void shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() {

}

Page 96: Codemotion Milan 2015 Alerts Overload

Deliberately break things

Page 97: Codemotion Milan 2015 Alerts Overload

Chaos snail

Page 98: Codemotion Milan 2015 Alerts Overload

The thing that sends you alerts need to be up and running

https://www.flickr.com/photos/davidmasters/2564786205/

Page 99: Codemotion Milan 2015 Alerts Overload

What's happened to our alerts?

Page 100: Codemotion Milan 2015 Alerts Overload

We turned off ALL emails from

system monitoring

Page 101: Codemotion Milan 2015 Alerts Overload

Our two most important alerts

come in via our team slack

channel

Page 102: Codemotion Milan 2015 Alerts Overload

We have dashboards for

our read APIs in Grafana

Page 103: Codemotion Milan 2015 Alerts Overload

To summarise...

Page 104: Codemotion Milan 2015 Alerts Overload

Build microservices

Page 105: Codemotion Milan 2015 Alerts Overload

1 2 3

Page 106: Codemotion Milan 2015 Alerts Overload

About technology at the FT:

Look us up on Stack Overflow

http://bit.ly/1H3eXVe

Read our blog

http://engineroom.ft.com/

Page 107: Codemotion Milan 2015 Alerts Overload

The FT on github

https://github.com/Financial-Times/

https://github.com/ftlabs

Page 108: Codemotion Milan 2015 Alerts Overload

Thank you!

Page 109: Codemotion Milan 2015 Alerts Overload

Questions?