ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + ProprietaryConfidential + Proprietary

Finding (and Fixing!) Performance Anomalies in Large Scale Distributed SystemsVictor [email protected]

Confidential + Proprietary

Today

App

? ? ?


Containers Infrastructure

Manage containers @ Google

Everything runs in a container

2B+ containers started per week

Images by Connie Zhou


You may Know Some of our OSS Work

Let Me Contain That For You


What about at Google?



Borg


What is Borg?

Large-scale cluster management at Google with Borg

http://research.google.com/pubs/pub43438.html



Borglet

Google’s node agent

Borglet = init + Docker + a few other things

Primary goals

➔ Talk to master➔ Manage tasks➔ Manage resources (containers)


How do we get to task performance management?

Dremel: Interactive Analysis of Web-Scale Datasets




Task Performance Analysis (TPA)

Our system for container-based black-box application performance analysis

Containers are the main enabler

Manage, monitor, and improve application performance

Today’s Talk

➔ How does it work➔ User stories: stories from the front-lines!

Container

App


How does it work?


Overall Flow

Collection → Aggregation → Baselines → SLOs → Enforcement


Low-Level Performance Metrics

Key: collect lots of container-based low-level metrics from the kernel

Custom kernel patches to give us even more stats and metrics

Sources➔ cgroups➔ /proc➔ perf_events➔ misc (e.g.: netlink, ioctls, etc)

Container

App

low-level performance metrics and telemetry



Low-Level Performance Metrics

Histograms are our favorite: number, breakdown, and tail of operations➔ CPU latencies➔ Memory reclaim, page faults, re-faults➔ I/O wait time and service time

Metrics collected every 1s - 10s➔ 1s: Used for on-machine control loops➔ 10s: Exported for off-machine analysis

Collection is very low-overhead



Cluster-Wide Aggregation

Cluster service that collects all metrics and exports them to Dremel

Push data for all tasks on all machines, keep them for a while

Single-handedly our most valuable resource➔ SQL is very expressive and flexible➔ Ability to query all that data in seconds: priceless

Best news: You can use it too! Google BigQuery

Performance Data DB

BigQuery



Performance Baselines

Cluster-level service: slice & dice data➔ Types of tasks➔ Distributions across replicas➔ Per compute cluster (Borg cell)➔ Historical trends

Gives us insights into performance trends and helps us develop performance baselines

Performance baseline: performance we can achieve given different parameters➔ CPU: How quickly can we schedule you on the CPU➔ Disk I/O: What disk I/O latency can we achieve



Baselines → SLOs

From baselines we provide performance SLOs:promise to the user

You promise to do X

➔ CPU: Use at most as much CPU as you asked for➔ Disk I/O: Issue less than X I/Os per second

We promise to give you Y performance

➔ CPU: You will get scheduled on a CPU within Yms of requesting it➔ Disk I/O: You will get I/O wait time of at most Yms



Enacting SLOs

Monitor SLOs closely and aggressively ensure they are met

Per-node➔ Give more resources or better quality resources➔ Throttle bad actors (antagonists)

Cluster-wide➔ Ask for help!➔ Move task to a different machine➔ Move antagonist to a different machine

Container

App

Container

App



Metrics➔ CPU➔ NUMA➔ Disk I/O


CPU

Low-level metrics➔ Wakeup latency: time between

wanting to run and running➔ Round-robin latency: how well

you share CPU within your app➔ Load: how much work you

wanted to do➔ Time per state: how much time

your spent in each state (e.g.: sleep, wait, run, queue)


CPU

SLOs➔ Wakeup latency when

well-behaved➔ CPU usage rate when

well-behaved


NUMA

Low-level metrics➔ CPU locality: how much of your CPU (and

usage) was in local vs remote nodes➔ Memory locality: how much of your memory

(and accesses) was in local vs remote nodes

➔ NUMA score: resource-product of both above (0.0 - 1.0)

SLOs➔ NUMA score of 0.85 or above given certain

job shapes

The NUMA Experience




Disk I/O

Low-level metrics➔ Service time latency: time it took kernel to service request to disk➔ Wait time latency: time it took kernel to queue and service request

to disk➔ Queued: how much work you wanted to do➔ Usage: how much work did you actually did

SLOs➔ Small amount of disk time when well-behaved


User Stories


Performance Regression

User: VM environment

User Problem: … silence ...

SLO not met: CPU

Signal: CPU queue other

Root cause: Subtle, but expensive, new periodic operation

Make it better: Give the application more debug information


Performance Variation #1

User: Flight search

User Problem: QPS variation on some tasks

SLO not met: NUMA

Signal: CPU and memory locality

Root cause: Bad NUMA allocation by infrastructure

Make it better: Improve NUMA allocation


Performance Variation #2

User: Web search

User Problem: Latency variation on some task

SLO not met: CPI variation

Signal: CPI from perf_events

Root cause: Bad actors co-scheduled on the machine

Make it better: Throttle or move these bad actors


Performance Degradation Under Load

User: Borglet

User Problem: Stuckness under heavy load

SLO not met: Disk access

Signal: Disk I/O wait time latencies

Root cause: Heavy disk operations blocking other operations

Make it better: Move disk operations away from latency sensitive operations


Future Work

➔ Signals for more resources (e.g.: memory)➔ Using the right signals➔ Better reporting and fleet-wide view to catch regressions across various

components

Helping apps more➔ Where are the problems?➔ Suggest how to fix problems we can’t fix ourselves


Takeaways

➔ Containers are the main enabler: common language for performance signals➔ More data ⇒ better decisions➔ Slicing and dicing of data is priceless for finding patterns and baselines➔ On by default performance monitoring: low overhead and high value➔ Performance SLOs give power to the application and make infrastructure

cheaper


Takeaways


cheaper

You can do this too!


Questions?


cheaper

You can do this too!

Victor [email protected]

● Friday 8am - 1pm @ Google's Toronto office● Hear real life experiences of two companies using GKE● Share war stories with your peers● Learn about future plans for microservice management

from Google● Help shape our roadmap

g.co/microservicesroundtable† Must be able to sign digital NDA

Join our Microservices Customer Roundtable

http://g.co/microservicesroundtable

http://g.co/microservicesroundtable


Questions?


ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Engineering