Top Banner
Observability for Startups
61

Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Jun 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Observability for Startups

Page 2: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Luke Demi

@luke_demi_

Page 3: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

No vendors were harmed in the making of this

presentation

Page 4: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 5: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 6: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Backend RPM

Page 7: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Backend RPM

Page 8: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Capacity Planning for Crypto Mania

Page 9: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Web Service Time Breakdown

Page 10: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Web Service Time Breakdown

Page 11: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Web Service Time Breakdown

Page 12: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 13: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

We finally began to ask the hard questions about what was happening in our environment.

Page 14: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

User TrafficBackend RPM

Page 15: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Lesson: Good instrumentation will surface your problems. Bad instrumentation will

obscure them.

Page 16: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

What is good instrumentation?

all we want is to know when and why our services break

Page 17: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Which really means:

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface

Page 18: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 19: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Logging Rich, open-ended context around discrete events

Pros

Cons

Example

● Audit trails● Debug/error messages

● High cardinality events

● Extreme granularity (down to the microsecond!)

● Very expensive (not optimized for storage)● Slow aggregations

Page 20: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Tracing End-to-end visibility into real requests

Pros

Cons

Example

● Entire lifecycle of a request or transaction

● Granular timings from the perspective of the application

● Magical

● Lacks full context - aggregations are approximations● Performance penalties

Page 21: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Metrics Structured, lightweight and easy to aggregate

Pros

Cons

Example

● Depth of a queue (gauge)● Count of incoming requests (counter)

● Fast aggregations

● Storage/retention optimized (burst-proof!)

● Lacks granularity and cardinality for deeper analysis

Page 22: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 23: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 24: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Logs provide many, many, many of the features we wanted:

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface

Page 25: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 26: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

enrichment

Page 27: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

We piled so much into logs...

Page 28: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 29: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

40 nodes * 64GB RAM = 2.5TB= 12-24 hours of logs

… so aggregating over 7 days is a little bit painful

Page 30: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 31: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

What was wrong with tripling down on Elasticsearch/Kibana?

1. Bad interface for building alerts2. Challenges around long term retention and

management 3. Iffy aggregation support

Page 32: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 33: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Lesson: Hammers don't work for everything. Sometimes you need a screwdriver?

Page 34: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface● + Alerts● + Stability

Page 35: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Maybe Kibana was just a bad metrics

provider?

Page 36: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 37: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 38: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

What does Datadog Provide?

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface● + Alerts● + Stability

Page 39: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Looks like we're compensating for something?

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface● + Alerts● + Stability

Page 40: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

app1

app2

EC2 Instance

DB

udp

/proc/loadavg/proc/meminfo

Coinbase VPC

/tmp/safe-docker.sock

Page 41: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 42: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Cardinality?!

Page 43: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 44: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Δt

Page 45: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Δt

10s flush time for aggregations

Page 46: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Δt

10s flush time for aggregations

aggregations stored pertag

Page 47: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 48: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

definitely a darkersquare

definitely40 pxx40 px

Page 49: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

definitely a darkersquare

definitely40 pxx40 px

STILL

Page 50: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 51: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 52: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

● A lot can happen in 10 seconds...● Pre aggregations feel kind of dumb

○ We're limiting the questions we can ask in the future

● No built in context/discoverability● What about high cardinality?

Page 53: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Lesson: The grass isn't always greener on the other side of the observability fence.

Page 54: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 55: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 56: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.
Page 57: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

When Nirvana?

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface● + Alerts● + Stability

Page 58: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

sampling

Alternative 1

Page 59: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Is a perfect solution too much to ask?

Page 60: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Double down into events

Alternative 2

Page 61: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM.

Questions?

@luke_demi_