Top Banner
What To Monitor For Black Friday / Cyber Monday Baron Schwartz - VividCortex
33

What To Monitor For Black Friday / Cyber Monday

Apr 14, 2017

Download

Technology

VividCortex
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: What To Monitor For Black Friday / Cyber Monday

What To Monitor ForBlack Friday / Cyber Monday

Baron Schwartz - VividCortex

Page 2: What To Monitor For Black Friday / Cyber Monday

Purpose Of This Webinar● Why talk about Black Friday and

Cyber Monday?

○ Isn’t it just jumping on buzzwords and fluff?

● What kinds of apps are affected?

● What companies don’t see peaks?

● What can we learn from this topic?

Page 3: What To Monitor For Black Friday / Cyber Monday

Themes● What could go wrong?

● Capacity planning

● Detecting latent issues early

● Understanding technology-specific limits

Page 4: What To Monitor For Black Friday / Cyber Monday

What CouldGo Wrong?

Page 5: What To Monitor For Black Friday / Cyber Monday

The Voice Of Your Peers● Disk space capacity

● Disk I/O capacity / IO wait

● CPU versus query latency

● Mutex bottlenecks / waits

● History list length / VACUUM

● DDoS attacks from botnets

● “Legit DoS” from buying bots

● Noisy neighbors / shared cust

Page 6: What To Monitor For Black Friday / Cyber Monday

Example from Shopify“Shopify uses Rails, which creates a lot of connections to shard masters. At peak times this has the potential to consume a lot of extra memory. I keep a close eye on this.”

-- Sergio Roysen, Shopify

Page 7: What To Monitor For Black Friday / Cyber Monday

Example from a Hosted E-Commerce Platform“We’ve had issues where some customers fell victim to Drupal Commerce inserting/updating a lot of records for each order. It worked fine during normal operation, but failed during Black Friday and after.”

-- Anonymous DBA

Page 8: What To Monitor For Black Friday / Cyber Monday

Capacity Planning

Page 9: What To Monitor For Black Friday / Cyber Monday

Capacity● Ability to serve desired workload

within performance tolerances

● Soft limits on capacity

● Hard limits on capacity

Page 10: What To Monitor For Black Friday / Cyber Monday

Desired Workload● Workload = both the user

population and their requests

● Do you know what to expect?

Page 11: What To Monitor For Black Friday / Cyber Monday

Trending and Projections● Use long-term metrics to project

based on historicals

● Use forecasting methods such as Holt-Winters for metrics with trend and seasonality

Page 12: What To Monitor For Black Friday / Cyber Monday

Key Resources● Compute resources: CPU,

memory, IO (network/disk)

● Models of the user population (e.g. connections, sessions)

● Consider the application, the database, and the OS

Page 13: What To Monitor For Black Friday / Cyber Monday

Hard Limits● Configured limits

○ Example: max_connections

○ Example: size of redo log

● Inherent limits

○ Example: network bandwidth

Page 14: What To Monitor For Black Friday / Cyber Monday

Soft LimitsMany resources have “burstable” capacity or will degrade gradually

● Example: the redo log

● Example: NIC buffers

Others will degrade as you approach capacity

● Example: latency under load

Page 15: What To Monitor For Black Friday / Cyber Monday

● Does your app degrade gracefully?

● Can you do load shedding?

● Is backpressure built-in? At what tier?

● Do you have feature flags?

● Do you know your most expensive features?

Application/Architecture Features

Page 16: What To Monitor For Black Friday / Cyber Monday

How Much Runway Do We Have?Use models/simulations to

estimate what % of capacity you’re consuming now

Use forecasting to project what you will need to handle for peaks

Are you going to make it?

Page 17: What To Monitor For Black Friday / Cyber Monday

The Universal Scalability Law● Simple, fast, real model of

capacity under load

● Black-box, easy to measure for

● Gives an idea what % capacity you have used

● Download our ebook and Excel workbook to learn more and do your own modeling

Page 18: What To Monitor For Black Friday / Cyber Monday

Actual Customer Server with USL Model

Page 19: What To Monitor For Black Friday / Cyber Monday

Another Real Server with USL Model

Page 20: What To Monitor For Black Friday / Cyber Monday

Query Latency vs Disk Utilization● Queueing theory explains why

latency spikes at high utilization

● In database servers, it’s often IO that’s the bottleneck, not CPU

● The spike is highly nonlinear

● See our queueing theory ebook

● Real customer screenshot -->

Page 21: What To Monitor For Black Friday / Cyber Monday

Micro-StallsAll systems stall constantly

When conditions are right, small problems become big, again nonlinearly

If you have 1-second pauses/freezes, would you know it?

VividCortex’s Adaptive Fault Detection algorithm is specifically for this use case

Page 22: What To Monitor For Black Friday / Cyber Monday

Are You Gonna Make It?If you know your X factor and it

looks like you’re going to fall short on capacity to serve it, you could have a problem.

Next steps? Load simulation / load testing could be a good idea.

Page 23: What To Monitor For Black Friday / Cyber Monday

Detecting Latent

Problems

Page 24: What To Monitor For Black Friday / Cyber Monday

Latent Problems● These are the problems that existed long before

● They manifest when they are least convenient

● Together with other issues, they become jointly sufficient to cause problems/outages

● (There’s no single root cause)

Page 25: What To Monitor For Black Friday / Cyber Monday

Errors You Haven’t Yet Noticed● Check your error logs! Was the

last restart clean?

● Are there errors/warnings in the logs?

● Are there any crashes, failures, restarts, etc you didn’t know about?

Page 26: What To Monitor For Black Friday / Cyber Monday

Servers with Reboot Risk● Do you have any servers that

have accumulated config drift?

● When’s the last time each server was rebooted?

● Are your servers immutable?

● Is this the first Black Friday with your current hardware, current DB version, etc?

Page 27: What To Monitor For Black Friday / Cyber Monday

Replication Delay● Replication works fine until it

doesn’t. Then it can’t catch up.

● What’s the “catch-up slope” from small delays in replication?

Page 28: What To Monitor For Black Friday / Cyber Monday

Database-Specific Stuff● Idle-in-trx sessions

● Locks/mutexes that escalate

○ Per-page locks

○ SELECT FOR UPDATE

○ Buffer pool mutexes

● Overhead of per-XYZ stuff (per-connection overhead…)

● Background worker tasks

○ VACUUM

○ InnoDB buffer pool purge and history list maintenance

Page 29: What To Monitor For Black Friday / Cyber Monday

Knowing Your Workload

Page 30: What To Monitor For Black Friday / Cyber Monday

Workload Analytics Is The Killer App● Do you have “new” query types?

● Are queries gradually ramping?

● What’s different now versus last week or last month?

Page 31: What To Monitor For Black Friday / Cyber Monday
Page 32: What To Monitor For Black Friday / Cyber Monday

In Conclusion...● Try to understand/forecast your capacity requirements

● Try to understand/forecast your headroom

● Look for latent problems

● Sweep the floors so nobody trips on stuff

Page 33: What To Monitor For Black Friday / Cyber Monday

Thanks! Questions?● Baron Schwartz

[email protected]

● @xaprb