Top Banner
Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
29

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Aug 15, 2015

Download

Technology

SignalFx
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

SignalFx

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Page 2: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

SignalFx

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Phillip Liu [email protected]

@SignalFx - signalfx.com

Page 3: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Agenda

• My background

• Microservices, a review

• Analytics approach to monitoring

• Code push side effects, an example

• Summary

Page 4: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

SignalFx

My Background

Page 5: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Experience

[2013 - ] SignalFx - Founder, CTO, Software EngineerMicroservices; Monitoring using Analytics

[2008 - 2012] Facebook - Software Engineer, Software ArchitectHyperscale SOA; Monitoring using Nagios, Ganglia, and in-house Analytics

[2004 - 2008] Opsware - Chief Architect, Software EngineerMonolithic Architecture; Monitoring using Ganglia, Nagios, Splunk

[2000 - 2004] Loudcloud - Software EngineerLAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool

[1998 - 2000] Marimba - Software EngineerClient / Server; Monitoring using SNMP, FreshWater Software

[ … ]

Page 6: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

SignalFx

Microservices, a Review

Page 7: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

A Microservices Definition

Loosely coupled service oriented architecture with bounded context.

Adrian Cockcroft

Page 8: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

SignalFx’s Microservices

More than 15 internal services. Spanning hundreds of instances. Across 3 AZs.

Have dependencies on tens of external services.

Page 9: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Monitoring Challenges

• High iteration rate leads to shortened test cycles

• Integration test combinations are intractable

• Catch problems during rolling deployments

• Identify upstream/downstream side effects

• e.g. backpressure

• Identify brownouts before the customer

• etc.

Page 10: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

SignalFx

Analytics Approach to Monitoring

Page 11: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Measure

Page 12: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Store

Page 13: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Analyze

Page 14: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Detect

Page 15: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

SignalFx

Examples

Page 16: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Monitoring at SignalFx

•We use SignalFx to monitor SignalFx

•CollectD for OS and Docker metrics on all VMs

•Yammer metrics for all Java app servers

•Custom logger to count exception types

•All metrics are sent to an analytics service

•Each service deploy a their cadence

•Push lab, then canary in prod, then rest of tier

Page 17: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Code Push Side Effects

Page 18: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Code Push Side Effects

Push canary instance and Metadata API dashboard shows healthy tier.

Page 19: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Code Push Side Effects

However, upstream UI dashboard showed unusual # of timeouts.

Page 20: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Code Push Side Effects

In search of root cause. Always safe to start by looking at exception counts.Can’t derive much from all the noise.

Page 21: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Code Push Side Effects

Sum the # of exceptions to create a single signal.

Page 22: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Code Push Side Effects

Compare sum with time-shifted sum from a day ago.

Page 23: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Code Push Side Effects

Look at an outlier host - an Analytics service host.

Page 24: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Code Push Side Effects

java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does not exist in class com.google.common.hash.BloomFilterStrategies at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:1.7.0_79] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) ~[na:1.7.0_79] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) ~[na:1.7.0_79] …

Looking at Analytic’s logs revealed source of the problem.

Page 25: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Code Push Side Effects

• Analytics across multiple microservices reduced time to identify problem. From push to resolution was ~15min • Service instrumentation helped narrowed down

root cause • Discovery allowed us to create a detector using

analytics to notify similar problems in the future

Page 26: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Other Examples

• A customer started dropping data because they reverted to an unsupported API • Compare tsdb write throughput of two different

write strategies • Create per-service capacity reports • Identify memory usage patterns across our

Analytics service • Create a detector for every previously uncaught

error conditions - postmortem output

Page 27: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

SignalFx

Summary

Page 28: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

• Measure and Store as much metrics and events as possible

• Use data analytics techniques to • Identify problems • Chase down root cause • Create analytics based detectors to notify you of recurrence

Page 29: Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

SignalFx

Thank You!

Phillip Liu [email protected]

WE’RE HIRING [email protected]

@SignalFx - signalfx.com