Top Banner
Confidential + Proprietary Confidential + Proprietary Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems Victor Marmol [email protected]
34

ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Jan 29, 2018

Download

Engineering

Victor Marmol
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + ProprietaryConfidential + Proprietary

Finding (and Fixing!) Performance Anomalies in Large Scale Distributed SystemsVictor [email protected]

Page 2: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Today

App

? ? ?

Page 3: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Containers Infrastructure

Manage containers @ Google

Everything runs in a container

2B+ containers started per week

Images by Connie Zhou

Page 4: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

You may Know Some of our OSS Work

Let Me Contain That For You

Page 5: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

What about at Google?

Images by Connie Zhou

Page 6: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Borg

Page 7: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

What is Borg?

Large-scale cluster management at Google with Borg

Page 8: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Borglet

Google’s node agent

Borglet = init + Docker + a few other things

Primary goals

➔ Talk to master➔ Manage tasks➔ Manage resources (containers)

Page 9: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

How do we get to task performance management?

Dremel: Interactive Analysis of Web-Scale Datasets

Page 10: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Task Performance Analysis (TPA)

Our system for container-based black-box application performance analysis

Containers are the main enabler

Manage, monitor, and improve application performance

Today’s Talk

➔ How does it work➔ User stories: stories from the front-lines!

Container

App

Page 11: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + ProprietaryConfidential + Proprietary

How does it work?

Page 12: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Overall Flow

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 13: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Low-Level Performance Metrics

Key: collect lots of container-based low-level metrics from the kernel

Custom kernel patches to give us even more stats and metrics

Sources➔ cgroups➔ /proc➔ perf_events➔ misc (e.g.: netlink, ioctls, etc)

Container

App

low-level performance metrics and telemetry

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 14: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Low-Level Performance Metrics

Histograms are our favorite: number, breakdown, and tail of operations➔ CPU latencies➔ Memory reclaim, page faults, re-faults➔ I/O wait time and service time

Metrics collected every 1s - 10s➔ 1s: Used for on-machine control loops➔ 10s: Exported for off-machine analysis

Collection is very low-overhead

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 15: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Cluster-Wide Aggregation

Cluster service that collects all metrics and exports them to Dremel

Push data for all tasks on all machines, keep them for a while

Single-handedly our most valuable resource➔ SQL is very expressive and flexible➔ Ability to query all that data in seconds: priceless

Best news: You can use it too! Google BigQuery

Performance Data DB

BigQuery

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 16: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Performance Baselines

Cluster-level service: slice & dice data➔ Types of tasks➔ Distributions across replicas➔ Per compute cluster (Borg cell)➔ Historical trends

Gives us insights into performance trends and helps us develop performance baselines

Performance baseline: performance we can achieve given different parameters➔ CPU: How quickly can we schedule you on the CPU➔ Disk I/O: What disk I/O latency can we achieve

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 17: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Baselines → SLOs

From baselines we provide performance SLOs:promise to the user

You promise to do X

➔ CPU: Use at most as much CPU as you asked for➔ Disk I/O: Issue less than X I/Os per second

We promise to give you Y performance

➔ CPU: You will get scheduled on a CPU within Yms of requesting it➔ Disk I/O: You will get I/O wait time of at most Yms

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 18: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Enacting SLOs

Monitor SLOs closely and aggressively ensure they are met

Per-node➔ Give more resources or better quality resources➔ Throttle bad actors (antagonists)

Cluster-wide➔ Ask for help!➔ Move task to a different machine➔ Move antagonist to a different machine

Container

App

Container

App

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 19: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Metrics➔ CPU➔ NUMA➔ Disk I/O

Page 20: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

CPU

Low-level metrics➔ Wakeup latency: time between

wanting to run and running➔ Round-robin latency: how well

you share CPU within your app➔ Load: how much work you

wanted to do➔ Time per state: how much time

your spent in each state (e.g.: sleep, wait, run, queue)

Page 21: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

CPU

SLOs➔ Wakeup latency when

well-behaved➔ CPU usage rate when

well-behaved

Page 22: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

NUMA

Low-level metrics➔ CPU locality: how much of your CPU (and

usage) was in local vs remote nodes➔ Memory locality: how much of your memory

(and accesses) was in local vs remote nodes

➔ NUMA score: resource-product of both above (0.0 - 1.0)

SLOs➔ NUMA score of 0.85 or above given certain

job shapes

The NUMA Experience

Page 23: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Disk I/O

Low-level metrics➔ Service time latency: time it took kernel to service request to disk➔ Wait time latency: time it took kernel to queue and service request

to disk➔ Queued: how much work you wanted to do➔ Usage: how much work did you actually did

SLOs➔ Small amount of disk time when well-behaved

Page 24: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + ProprietaryConfidential + Proprietary

User Stories

Page 25: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Performance Regression

User: VM environment

User Problem: … silence ...

SLO not met: CPU

Signal: CPU queue other

Root cause: Subtle, but expensive, new periodic operation

Make it better: Give the application more debug information

Page 26: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Performance Variation #1

User: Flight search

User Problem: QPS variation on some tasks

SLO not met: NUMA

Signal: CPU and memory locality

Root cause: Bad NUMA allocation by infrastructure

Make it better: Improve NUMA allocation

Page 27: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Performance Variation #2

User: Web search

User Problem: Latency variation on some task

SLO not met: CPI variation

Signal: CPI from perf_events

Root cause: Bad actors co-scheduled on the machine

Make it better: Throttle or move these bad actors

Page 28: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Performance Degradation Under Load

User: Borglet

User Problem: Stuckness under heavy load

SLO not met: Disk access

Signal: Disk I/O wait time latencies

Root cause: Heavy disk operations blocking other operations

Make it better: Move disk operations away from latency sensitive operations

Page 29: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Future Work

➔ Signals for more resources (e.g.: memory)➔ Using the right signals➔ Better reporting and fleet-wide view to catch regressions across various

components

Helping apps more➔ Where are the problems?➔ Suggest how to fix problems we can’t fix ourselves

Page 30: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Takeaways

➔ Containers are the main enabler: common language for performance signals➔ More data ⇒ better decisions➔ Slicing and dicing of data is priceless for finding patterns and baselines➔ On by default performance monitoring: low overhead and high value➔ Performance SLOs give power to the application and make infrastructure

cheaper

Page 31: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Takeaways

➔ Containers are the main enabler: common language for performance signals➔ More data ⇒ better decisions➔ Slicing and dicing of data is priceless for finding patterns and baselines➔ On by default performance monitoring: low overhead and high value➔ Performance SLOs give power to the application and make infrastructure

cheaper

You can do this too!

Page 32: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Questions?

➔ Containers are the main enabler: common language for performance signals➔ More data ⇒ better decisions➔ Slicing and dicing of data is priceless for finding patterns and baselines➔ On by default performance monitoring: low overhead and high value➔ Performance SLOs give power to the application and make infrastructure

cheaper

You can do this too!

Victor [email protected]

Page 33: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

● Friday 8am - 1pm @ Google's Toronto office● Hear real life experiences of two companies using GKE● Share war stories with your peers● Learn about future plans for microservice management

from Google● Help shape our roadmap

g.co/microservicesroundtable† Must be able to sign digital NDA

Join our Microservices Customer Roundtable

Page 34: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Questions?

Images by Connie Zhou