Top Banner
@xaprb What Should I Instrument And How Should I Do It?
22

What Should I Instrument, And How Should I Do It?

Jan 21, 2018

Download

Technology

VividCortex
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: What Should I Instrument, And How Should I Do It?

@xaprb

What Should I Instrument

And How Should I Do It?

Page 2: What Should I Instrument, And How Should I Do It?

@xaprb

Logistics...● I’m Baron Schwartz: @xaprb or [email protected]● I will post the slides from this talk● This is a follow-on to What Should I Monitor And How Should I Do It

○ https://youtu.be/zLjhFrUhqxg

Page 3: What Should I Instrument, And How Should I Do It?

@xaprb

What’s The Goal?Assumption: you’re building and operating a service.

You want to instrument it so you can build and operate it better.

You want observability.

● In the present● In the past● In the future? (Predictability)

Observability is how well an external observer can infer a system’s internal state.

Page 4: What Should I Instrument, And How Should I Do It?

@xaprb

What Should I Observe?

There’s a lot to measure in a complex system. What’s important?

● It’s more important to observe the work than the service itself.

● But it’s important to observe how the service responds to the workload.

Page 5: What Should I Instrument, And How Should I Do It?

@xaprb

Some Convenient BlueprintsBrendan Gregg’s USE Method

● Utilization, Saturation, Errors● http://www.brendangregg.com/usemethod.html

Tom Wilkie’s RED Method

● Measure request {Rate, Errors, Duration}● https://www.slideshare.net/weaveworks/interactive-monitoring-for-kubernetes

The SRE Book’s 4 Golden Signals

● latency, traffic, errors, and saturation● https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html

Page 6: What Should I Instrument, And How Should I Do It?

@xaprb

Some Formal LawsQueueing Theory

● Utilization, arrival rate, throughput, latency

Little’s Law

● Concurrency, latency, throughput

Universal Scalability Law

● Throughput, concurrency

Page 7: What Should I Instrument, And How Should I Do It?

@xaprb

The Zen of PerformanceThe unifying concept in observing a service is two perspectives on requests.

External (customer’s) view:

● Request (singular), and its latency and success.

Internal (operator’s) view:

● Requests (plural, population), and their latency distribution, rates, and concurrency.

● System resources/components and their throughput, utilization, and backlog.

Page 8: What Should I Instrument, And How Should I Do It?

@xaprb

Much Confusion Comes From One-Sided ViewsMany people, when asked if a service is working well, will look at the service for problems.

But you can only answer that question by looking at the service’s work. From that, you may need to examine the service to see why it isn’t working well.

Both are necessary. You need instrumentation that enables both perspectives.

Page 9: What Should I Instrument, And How Should I Do It?

@xaprb

Metrics That MatterAll of the metrics in all of the methods & laws mentioned are important.

● Throughput, concurrency, latency, utilization, backlog/load/saturation, rates

All of them are time-related, either point-in-time or over-a-duration.

● Time is the zeroth performance metric (perfdynamics.com).

Page 10: What Should I Instrument, And How Should I Do It?

@xaprb

Your Service Must Provide These DataIf your service is to be observable, it needs to be possible to observe these things.

● You can provide the data directly, by instrumenting your service.● An instrumented system (e.g. OS) can implicitly offer a framework.● Or you can use a framework to build your service (e.g. Coda’s Metrics).

Page 11: What Should I Instrument, And How Should I Do It?

@xaprb

Service and Component InstrumentationIt’s not enough to just instrument your service’s input and output.

● You need internal components and subsystems to be observable too.● Common examples: buffers, queues, locks, mutexes, persistence.

It’s easy to see that a clear architecture can help.

● Are subsystems loosely coupled and cohesive, with clear boundaries?● Are they well defined?● Can you draw an architecture/block diagram of them? (c.f. Brendan Gregg)

Metrics on components rarely help much, beyond the basics.

Page 12: What Should I Instrument, And How Should I Do It?

@xaprb

The Process List Is GoldenFocus more on requests/work than components. This is a well-trodden path. Every mature request-oriented service has a process table/list.

● UNIX: process table, visible with `ps`● Apache: ServerStatus● MySQL: SHOW PROCESSLIST● PostgreSQL: pg_stat_activity● MongoDB: db.currentOp()

A process table tracks the existence and state of every process/worker in the system, and tasks/requests that it is executing.

Page 13: What Should I Instrument, And How Should I Do It?

@xaprb

Common Attributes Of Process TablesRequest itself

● E.g. SQL text, commandline+args, verb+url+qparams● Parent request/stage/span, if possible

State of request

● At a minimum: working or waiting (where? func/module/mutex…)● Ideally: stages/states of execution (parsing, planning, checking auth…)

Timings

● Timestamp of start; ideally timestamps of state changes too

Page 14: What Should I Instrument, And How Should I Do It?

@xaprb

One ExampleAt VividCortex, we built github.com/VividCortex/pm for API/service processlists.

● It’s for #golang● HTTP and web browser interface● See every request in-flight● Kill requests● Check request state and timings

This provides observability “now.” But not historical observability.

Page 15: What Should I Instrument, And How Should I Do It?

@xaprb

Extending Observability To Historical Views“Current state” observability is the foundation of historical views.The process list can be the foundation of request history and metrics.

For requests:

● Log every state transition/change a request makes.● Emit metrics on aggregates at these points, or at regular intervals.

○ See previous slides for which metrics to emit!

● Capture traces of requests for distributed tracing.

For components:

● Emit metrics from each component at regular intervals (ditto on prev. slides).

Page 16: What Should I Instrument, And How Should I Do It?

@xaprb

Logging, Metrics, TracesPeter Bourgon drew a diagram that helps illustrate some concepts.

https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

Page 17: What Should I Instrument, And How Should I Do It?

@xaprb

What should you log? I tend to agree with Dave Cheney:

I believe that there are only two things you should log:1. Things that developers care about when they are developing or

debugging software.2. Things that users care about when using your software.

Obviously these are debug and info levels, respectively.

https://dave.cheney.net/2015/11/05/lets-talk-about-logging

Logging

Page 18: What Should I Instrument, And How Should I Do It?

@xaprb

I am not a fan of “sampling” the way it’s commonly done.

● It’s a euphemism for “let’s ignore most things.”● Every request should be measured.

It’s typically implemented in terribly biased ways that cause all kinds of problems (e.g. “slow” query logs ignore fast-but-frequent requests).

● I prefer keeping representative samples of raw data.● But not ignoring/dropping the rest: at least aggregating it into metrics.

Logging and Traces

Page 19: What Should I Instrument, And How Should I Do It?

@xaprb

Representative Sampling Is Possible To Do

https://www.vividcortex.com/resources/sampling-a-stream-with-probabilistic-sketch

Page 20: What Should I Instrument, And How Should I Do It?

@xaprb

Observability CultureObservability is more than a Silicon Valley buzzword. It’s a culture, like DevOps.

How can you build a culture of observability?

● You get what you incentivize. Incentivize the data/metrics, you’ll get it.● Prioritize the end, not the means.● Understand the difference between culture, and visible artifacts of culture.

Many a company has tried to imitate Netflix or Etsy and gotten different results.

● See McFunley’s talk, for example: http://pushtrain.club/

Page 21: What Should I Instrument, And How Should I Do It?

@xaprb

What Should You Reward?● Clarity and intentionality; purposefulness● Empathy● Shared ownership and responsibility● Attendance at DevOpsDays

What should you think twice about rewarding?

● Metrics/data/graphs, in a vacuum for their own sake● Keep in mind Etsy’s “if it moves, graph it” slogan is a means, not an end

○ https://codeascraft.com/2011/02/15/measure-anything-measure-everything/

Page 22: What Should I Instrument, And How Should I Do It?

@xaprb

Parting ThoughtsI’m a fan of defining the problem before working on the solution.

● Clarity of purpose tends to influence decisions for the better.● Explicit goal of observability and intelligibility tends to improve operability.● Clear understanding of performance focuses on KPIs, not vanity metrics.

Some further thoughts at https://www.vividcortex.com/resources/architecting-highly-monitorable-apps