Top Banner
SHIP IT!!! CODING RELIABLE COUCHBASE APPLICATIONS FOR PRODUCTION Matt Ingenthron, Couchbase Michael Nitschinger, Couchbase
42

Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

Aug 09, 2015

Download

Software

Couchbase
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

SHIP IT!!! CODING RELIABLE COUCHBASE APPLICATIONS FOR PRODUCTIONMatt Ingenthron, CouchbaseMichael Nitschinger, Couchbase

Page 2: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 2

Warning

In this session you will hear stories of lost packets, corrupted data, confused administrators sending terabytes of logs to even more confused developers and many other insanely scary things. If the thought of a bit flip frightens you because you have only parity checking and no error correction, this session may not be for you.

Computers were harmed while preparing this talk.

If what you typically type after “catch” involves only the word “log”, this session may help you. If you hope to learn how an HTTP 503 can be useful, this presentation is for you.

Page 3: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

Game Show Time(war stories from the field)

Page 4: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 4

Obligatory Raising of Hands

Who here has used Couchbase? Who has seen this?

Page 5: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 5

Hundredaire!

Page 6: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 6

Question One System: Virtual machines at a public cloud provider.

Node.js application. Observation: Under load testing, saw high latencies

(>100ms).

Causes?

Root cause: The ethernet device driver in the linux distro didn’t work that well with the virtualized hardware interface causing high latencies.

Solution: Swap out the Linux OS distribution. Went from one that was less common but had better user tooling to

one of the most common ones in production deployments

A) Bugs in Couchbase.

B) The system software wasn’t well matched and tested.

C) Running too many node.js processes for

the number of OS CPU cores.

D) It’s the “cosmic rays” man.

Page 7: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 7

Question Two System: Private virtual machines on a private cloud. Strong

monitoring and control of the environment Observation: As daily load would ramp, latencies would rise

and failure to meet the SLA would consume.

Causes?

Root cause: Memory resources were overprovisioned on the private cloud.

Solution: Adjust the memory allocation within the environment. Also found that the number of tomcat workers was rather unusually

set; thousands of worker processes for systems with 8 virtual cores.

A) Bugs in Couchbase.

B) JVM Garbage Collection Pauses.

C) Virtualization is overprovisioned.

D) The NSA wiretap program was slowing

things down.

Page 8: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 8

Question Three System: Database running on physical hardware, applications

on VMs across the network. SLA need was 50ms or less. Observation: Regular heartbeat of high latency in the 3-

400ms range.

Causes?

Root cause: The monitoring system was inspecting kernel counters on a regular basis and was somehow hitting a hot lock.

Solution: Disable that one poller in the monitor. There were no other apps in that environment that had the same latency

requirements, so it was assumed that the environment was clean.

A) Bugs in Couchbase.

B) Misconfigured load balancer

sending all traffic to one app JVM.

C) Monitoring system interrogating the kernel causing lock contention.

D) Standing waves from running a 50hz power supply under

60hz.

Page 9: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

Planning for Success

Page 10: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 10

Define & Measure!

Develop

Test

Measure

Evaluate

Requirements

If it‘s not defined you can‘t measure it.

SLAs Throughput at max.

Latency

Page 11: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 11

Define & Measure!

Develop

Test

Measure

Evaluate

Requirements

Ideally from the get-go:

Error Detection Error Recovery Error Mitigation

Page 12: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 12

Define & Measure!

Develop

Test

Measure

Evaluate

Requirements

Not just unit testing.

Stress Tests Load Tests Failure Tests

Page 13: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 13

Define & Measure!

Develop

Test

Measure

Evaluate

Requirements

You can‘t manage whatyou don‘t measure.

Page 14: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 14

Define & Measure!

Develop

Test

Measure

Evaluate

Requirements

Evaluate, rinse, repeat.

Page 15: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 15

Service Level Required

100% Uptime not easily achievable

For instance, is it 100% available if 50% of your users are leaving because it’s too slow?

The question must always be:

“At max latency, what throughput do I get?”

Page 16: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 16

Avoid the Coffin Corner

http://de.wikipedia.org/wiki/Coffin_Corner#/media/File:CoffinCorner.png

Height

Speed

Page 17: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 17

Avoid the Coffin Corner

Both airplanes and your applications do not like the extremes

Resource contention and overload conditions result in high latency

Keep some headroom to fly smoothly

Page 18: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 18

Prepare for bad weather

https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg

Page 19: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 19

with Error Detection

System MonitorsPeriodic Checking

WatchdogsVoting

Auditing

https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg

Page 20: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 20

with Error Recovery

TimeoutsFailoverRetries

https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg

Page 21: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 21

with Error MitigationIntelligent Data Structures

Failing FastCircuit BreakersBackpressure

https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg

Page 22: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 22

Timeouts

Are your last resort when calling external resources.

so: Always use them

Page 23: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 23

Timeouts

Page 24: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 24

Timeouts

Page 25: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 25

Circuit Breakers

monitor traffic open if errors happen

Latency Throughput Wrong results

close in a controlledfashion

expose metrics

Page 26: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 26

Circuit Breakers

Page 27: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 27

Backpressure

Allows for coordinated flow control under stress conditions

Page 28: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 28

Backpressure

Allows for coordinated flow control under stress conditions

Is used to shed load and provide partial good experience

Source: http://mechanical-sympathy.blogspot.co.at/2011/10/smart-batching.html

Page 29: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

Testing & Benchmarking

Page 30: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 30

This is NOT a benchmark

Page 31: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 31

This is NOT a benchmark

Page 32: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 32

Benchmarking

Benchmarks assert expectations while tests verfiy correctness

Like with statistics, almost always wrong and biased

Two hard problems in computer science: Cache Invalidation Naming Things

Page 33: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 33

Benchmarking

Benchmarks assert expectations while tests verfiy correctness

Like with statistics, almost always wrong and biased

Two Three hard problems in computer science: Cache Invalidation Naming Things Benchmarking

Page 34: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 34

Benchmarking

The appropriate Workload Concurrency Think Time

The right Environment Hardware, OS external effects

The proper Tool Measure NOOPs Be aware of GC, Coordinated Omission,...

Page 35: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 35

And the industry?

Yahoo! Cloud Serving Benchmark (YCSB) Industry Standard Makes it easy to compare solutions Be aware of the (many) pitfalls!

Pioneering a new fork: https://github.com/YCSB/YCSB Maintained NoSQL versions Coordinated Omission fixes ...

Page 36: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 36

And the industry?

Java Microbenchmarking Harness (JMH) (http://openjdk.java.net/projects/code-tools/jmh/)

http://shipilev.net/talks/jvmls-July2013-benchmarking.pdf

Page 37: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 37

Load & Stress Testing

Load Testing Determine behaviour during normal traffic

Stress Testing Traffic heavily increased (to the “Coffin Corner“) Explicitly test edge cases Knowing where and how it breaks is important

Page 38: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 38

Failure Testing

Test specific failure cases Node failures Netsplits Firewall issues

(dropped packets, closed sockets)

Failures will happen, better to prepare for it early.

http://www.bloomberg.com/ss/09/04/0427_mdea_awards/image/002_lifepak15monitorde_220a.jpg

Page 39: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

Some Tools to Consider

Page 40: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

©2015 Couchbase Inc. 40

Tools of the trade Run tools to validate a set

up with a reasonably known workload. libcouchbase’s cbc pillowfight Java’s RoadRunner .NET’s MeepMeep

Isolate performance statistics at different layers. libcouchbase and Java SDKs

have performance profiling abilities

Couchbase has cbstats timings

Page 41: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

Questions?

Page 42: Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

Thank you.