Top Banner
Reactive By Example Eran Harel - @eran_ha
43
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reactive by example - at Reversim Summit 2015

Reactive By Example

Eran Harel - @eran_ha

Page 2: Reactive by example - at Reversim Summit 2015

source: http://www.reactivemanifesto.org/

The Reactive Manifesto

Page 3: Reactive by example - at Reversim Summit 2015

Responsive

The system responds in a timely manner if at all possible.

source: http://www.reactivemanifesto.org/

Page 4: Reactive by example - at Reversim Summit 2015

Resilient

The system stays responsive in the face of failure.

source: http://www.reactivemanifesto.org/

Page 5: Reactive by example - at Reversim Summit 2015

Elastic

The system stays responsive under varying workload.

source: http://www.reactivemanifesto.org/

Page 6: Reactive by example - at Reversim Summit 2015

Message DrivenReactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation, location transparency, and provides the means to delegate errors as messages.

source: http://www.reactivemanifesto.org/

Page 7: Reactive by example - at Reversim Summit 2015

Case Study

Scaling our metric delivery system

Page 8: Reactive by example - at Reversim Summit 2015

Graphite

● Graphite is a highly scalable real-time graphing system.

● Graphite performs two pretty simple tasks: storing numbers that change over time and graphing them.

● Sources: ○ http://graphite.wikidot.com/faq○ http://aosabook.org/en/graphite.html

Page 9: Reactive by example - at Reversim Summit 2015

Graphite

http://aosabook.org/en/graphite.html

Page 10: Reactive by example - at Reversim Summit 2015

Graphite plain-text Protocol

<dotted.metric.name> <value> <unix epoch>\nFor example:servers.foo1.load.shortterm 4.5 1286269260\n

Page 11: Reactive by example - at Reversim Summit 2015

Brief History - take I

App -> Graphite

This kept us going for a while…The I/O interrupts were too much for Graphite.

Page 12: Reactive by example - at Reversim Summit 2015

Brief History - take II

App -> LogStash -> RabbitMQ -> LogStash -> Graphite

The LogStash on localhost couldn’t handle the load, crashed and hung on regular basis.The horror...

Page 13: Reactive by example - at Reversim Summit 2015

Brief History - take III

App -> Gruffalo -> RabbitMQ -> LogStash -> Graphite

The queue consuming LogStash was way too slow.Queues build up hung RabbitMQ, and stopped the producers on Gruffalo.

Total failure.

Page 14: Reactive by example - at Reversim Summit 2015

Brief History - take IV

App -> Gruffalo -> Graphite (single carbon relay)

A single relay couldn’t take all the load, and losing it means graphite is 100% unavailable.

Page 15: Reactive by example - at Reversim Summit 2015

Brief History - take V

App -> Gruffalo -> Graphite (multi carbon relays)

Great success, but not for long.As we grew our metric count we had to take additional measures to make it stable.

Page 16: Reactive by example - at Reversim Summit 2015

Introducing Gruffalo

● Gruffalo acts as a proxy to Graphite; it○ Uses non-blocking IO (Netty)○ Protects Graphite from the herd of clients,

minimizing context switches and interrupts○ Replicates metrics between Data Centers○ Batches metrics○ Increases the Graphite availability

https://github.com/outbrain/gruffalo

Page 17: Reactive by example - at Reversim Summit 2015

Metrics Delivery HL Design

CarbonRelayCarbon

RelayCarbonRelay

CarbonRelayCarbon

RelayCarbonRelay

DC1

DC2

Page 18: Reactive by example - at Reversim Summit 2015

Graphite (Gruffalo) Clients

● GraphiteReporter● Collectd● StatsD● JmxTrans● Bucky● netcat● Slingshot

Page 19: Reactive by example - at Reversim Summit 2015

Metrics Clients Behavior

● Most clients open up a fresh connection, once per minute, and publish ~1000K - 5000K metrics

● Each metric is flushed immediately

Page 20: Reactive by example - at Reversim Summit 2015

Scale (Metrics / Min)

More than 4M metrics per minute sent to graphite

Page 21: Reactive by example - at Reversim Summit 2015

Scale (Concurrent Connections)

Page 22: Reactive by example - at Reversim Summit 2015

Scale (bps)

Page 23: Reactive by example - at Reversim Summit 2015

Hardware

● We handle the load using 2 Gruffalo instances in each Data Center (4 cores each)

● A single instance can handle the load, but we need redundancy

Page 24: Reactive by example - at Reversim Summit 2015

The Gruffalo Pipeline (Inbound)

IdleStateHandler

Line Framer

StringDecoder

BatchHandler

PublishHandler

Graphite Client

Helps detect dropped / leaked connections

Handling ends here unless the batch is full (4KB)

Page 25: Reactive by example - at Reversim Summit 2015

The Graphite Client Pipeline (Outbound)

IdleStateHandler

StringDecoder

StringEncoder

GraphiteHandler

Handles reconnects, back-pressure, and dropped connections

Helps detect dropped connections

Page 26: Reactive by example - at Reversim Summit 2015

Graphite Client Load Balancing

Carbon Relay 1

Carbon Relay 2

Carbon Relay n

Metric batches

...

Page 27: Reactive by example - at Reversim Summit 2015

Graphite Client Retries

● A connection to a carbon relay may be down. But we have more than one relay.

● We make a noble attempt to find a target to publish metrics to, even if some relay connections are down.

Page 28: Reactive by example - at Reversim Summit 2015

Graphite Client Reconnects

Processes crash, the network is *not* reliable, and timeouts do occur...

Page 29: Reactive by example - at Reversim Summit 2015

Graphite Client Metric Replication

● For DR purposes we replicate each metric to 2 Data Centers.

● ...Yes it can be done elsewhere…● Sending millions of metrics across the WAN,

to a remote data center is what brings most of the challenges

Page 30: Reactive by example - at Reversim Summit 2015

Handling Graceless Disconnections

● We came across an issue where an unreachable data center was not detected by the TCP stack.

● This renders the outbound channel unwritable

● Solution: Trigger reconnection when no writes are performed on a connection for 10 sec.

Page 31: Reactive by example - at Reversim Summit 2015

Queues Everywhere

● SO_Backlog - queue of incoming connections

● EventLoop queues (inbound and outbound)● NIC driver queues - and on each device on

the way 0_o

Page 32: Reactive by example - at Reversim Summit 2015

Why are queues bad?

● If queues grow unbounded, at some point, the process will exhaust all available RAM and crash, or become unresponsive.

● At this point you need to apply either○ Back-Pressure○ Drop requests: SLA--○ Crash: is this an option?

Page 33: Reactive by example - at Reversim Summit 2015

Why are queues bad?

● Queues can increase latency by a magnitude of the size of the queue (in the worst case).

Page 34: Reactive by example - at Reversim Summit 2015

● When one component is struggling to keep-up, the system as a whole needs to respond in a sensible way.

● Back-pressure is an important feedback mechanism that allows systems to gracefully respond to load rather than collapse under it.

Back-Pressure

Page 35: Reactive by example - at Reversim Summit 2015

Back-Pressure (take I)

● Netty sends an event when the channel writability changes

● We use this to stop / resume reads from all inbound connections, and stop / resume accepting new connections

● This isn’t enough under high loads

Page 36: Reactive by example - at Reversim Summit 2015

Back-Pressure (take II)

● We implemented throttling based on outstanding messages count

● Setup metrics and observe before applying this

Page 37: Reactive by example - at Reversim Summit 2015

Idle / Leaked Inbound Connections Detection● Broken connections can’t be detected by the

receiving side.● Half-Open connections can be caused by

crashes (process, host, routers), unplugging network cables, etc

● Solution: We close all idle inbound connections

Page 38: Reactive by example - at Reversim Summit 2015

The Load Balancing Problem

● TCP Keep-alive?● HAProxy?● DNS?● Something else?

Page 39: Reactive by example - at Reversim Summit 2015

Consul Client Side Load Balancing

● We register Gruffalo instances in Consul● Clients use Consul DNS and resolve a

random host on each metrics batch● This makes scaling, maintenance, and

deployments easy with zero client code changes :)

Page 40: Reactive by example - at Reversim Summit 2015

Auto Scaling?

[What can be done to achieve auto-scaling?]

Page 41: Reactive by example - at Reversim Summit 2015

Questions?“Systems built as Reactive Systems are more flexible, loosely-coupled and scalable. This makes them easier to develop and amenable to change. They are significantly more tolerant of failure and when failure does occur they meet it with elegance rather than disaster. Reactive Systems are highly responsive, giving users effective interactive feedback.”

source: http://www.reactivemanifesto.org/

Page 42: Reactive by example - at Reversim Summit 2015

Wouldn’t you want to do this daily?

We’re recruiting ;)