Top Banner
Financial Time Series Cassandra 1.2 Jake Luciani and Carl Yeksigian BlueMountain Capital
43

NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Dec 04, 2014

Download

Documents

This talk will focus on our approach to building a scalable TimeSeries database for financial data using Cassandra 1.2 and CQL3. We will discuss how we deal with a heavy mix of reads and writes as well as how we monitor and track performance of the system.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Financial Time SeriesCassandra 1.2

Jake Luciani and Carl YeksigianBlueMountain Capital

Page 2: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Know your problem.

1000s of consumers..creating and reading data as fast as possible..consistent to all readers..and handle ad-hoc user queries..quickly..across datacenters.

Page 3: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Know your data.AAPL price

MSFT price

Page 4: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Know your queries.Time Series Query

start (10am)

end (2pm)

1 minute periods

Start, End, Periodicity defines query

Page 5: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Know your queries.Cross Section Query

As Of time defines the query

As Of Time (11am)

Page 6: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Know your queries.

● Cross sections are for random data● Storing for Cross Sections means thousands of

writes, inconsistent queries● We also need bitemporality, but it's hard, so let's

ignore it in the query

Page 7: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Know your users.

A million, billion writes per second..and reads are fast and happen at the same time..and we can answer everything consistently..and it scales to new use cases quickly..and it's all done yesterday

Page 8: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Since we can't optimize for everything.

Let's optimize for Time Series.

Page 9: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Data Model (in C* 1.1)

AAPL lastPrice:2013-03-18:2013-03-19 0E-34-88-FF-26-E3-2C

lastPrice:2013-03-19:2012-03-19

lastPrice:2013-03-19:2013-03-20

0E-34-88-FF-26-E3-3D

0E-34-88-FF-26-E3-4E

Page 10: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

But we're using C* 1.2.CQL3

V-nodesJBOD

Pooled Decompression buffers

SSD Aware

Parallel CompactionOff-Heap Bloom Filters

Metrics!Concurrent Schema Creation

Page 11: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

CREATE TABLE tsdata (id blob,property string,asof_ticks bigint,knowledge_ticks bigint,value blob,PRIMARY KEY(id,property,asof_ticks,knowledge_ticks)

)WITH COMPACT STORAGEAND CLUSTERING ORDER BY(asof_ticks DESC, knowledge_ticks DESC)

Data Model (CQL 3)

Page 12: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

SELECT * FROM tsdataWHERE id = 0x12345AND property = 'lastPrice'AND asof_ticks >= 1234567890AND asof_ticks <= 2345678901

CQL3 Queries: Time Series

Page 13: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

CQL3 Queries: Cross Section

SELECT * FROM tsdataWHERE id = 0x12345AND property = 'lastPrice'AND asof_ticks = 1234567890AND knowledge_ticks < 2345678901LIMIT 1

Page 14: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Data Overload!

All points between start and endEven though we have a periodicity

All knowledge timesEven though we only want latest

Page 15: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

A Service, not an app

C*

Olympus

Olympus

Olym

pusOly

mpu

s

Olympus

Olympus Olympus

OlympusApp

App

App

App

App

App

App

App

App

App

Page 16: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Filtration

Filter everything by knowledge time

Filter time series by periodicity

200k points filtered down to 300

ServiceFilter

AAPL:lastPrice:2013-03-18:2013-03-19AAPL:lastPrice:2013-03-19:2013-03-19AAPL:lastPrice:2013-03-19:2013-03-20AAPL:lastPrice:2013-03-20:2013-03-20AAPL:lastPrice:2013-03-20:2013-03-21

AAPL:lastPrice:2013-03-18:2013-03-19AAPL:lastPrice:2013-03-19:2013-03-20AAPL:lastPrice:2013-03-20:2013-03-21Cassandra Reads

Page 17: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Pushdown Filters

● To provide periodicity on raw data, downsample on write

● There are still cases where we don't know how to sample

● This filtering should be pushed to C*● The coordinator node should apply a filter to the

result set

Page 18: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Complex Value Types

Not every value is a doubleSome values belong togetherBid and Ask should come back together

Page 19: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Thrift

Thrift structures as valuesTyped, extensible schemaUnion types give us a way to deserialize any type

Page 20: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Thrift: Union Types

https://gist.github.com/carlyeks/5199559

Page 21: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

But that's the easy part...

Page 22: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Scaling...

The first rule of scaling is you do not just turn eveything to 11.

Page 23: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Scaling...

Step 1 - Fast Machines for your workloadStep 2 - Avoid Java GC for your workloadStep 3 - Tune Cassandra for your workloadStep 4 - Prefetch and cache for your workload

Page 24: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Can't fix what you can't measure

Riemann (http://riemann.io)Easily push application and system metrics into a single systemWe push 4k metrics per second to a single Riemann instance

Page 25: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Metrics: Riemann

Yammer Metrics with Riemann

https://gist.github.com/carlyeks/5199090

Page 26: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Metrics: Riemann

Push stream based metrics libraryRiemann Dash for Why is it Slow?

Graphite for Why was itSlow?

Page 27: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

VisualVM-The greatest tool EVER

Many useful plugins...Just start jstatd on each server and go!

Page 28: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Scaling Reads: Machines

SSDs for hot dataJBOD configAs many cores as possible (> 16)10GbE networkBonded network cardsJumbo frames

Page 29: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

JBOD is a lifesaver

SSDs are great until they aren't anymore

JBOD allowed passive recovery in the face of simultaneous disk failures (SSDs had a bad firmware)

Page 30: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Scaling Reads: JVM

-Xmx12G-Xmn1600M-XX:SurvivorRatio=16-XX:+UseCompressedOops

-XX:+UseTLAB yields ~15% Boost!(Thread local allocators, good for SEDA architectures)

JVM

Magic!

Page 31: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Scaling Reads: Cassandra

Changes we've made:● Configuration● Compaction● Compression● Pushdown Filters

Page 32: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Scaling Cassandra: Configuration

Hinted HandoffHHO single threaded, 100kb throttle

Page 33: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Scaling Cassandra: Configuration

memtable size2048mb, instead of 1/3 heap

We're using a 12gb heap; leaves enough room for memtables while the majority is left for reads and compaction.

Page 34: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Scaling Cassandra: Configuration

Half-Sync Half-Async serverNo thread dedicated to an idle connectionWe have a lot of idle connections

Page 35: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Scaling Cassandra: Configuration

Multithreaded compaction, 4 coresMore threads to compact means fastToo many threads means resource contention

Page 36: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Scaling Cassandra: Configuration

Disabled internode compressionCaused too much GC and Latency

On a 10GbE network, who needs compression?

Page 37: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Leveled Compaction

Wide rows means data can be spread across a huge number of SSTablesLeveled Compaction puts a bound on the worst case (*)Fewer SSTables to read means lower latency, as shown below; orange SSTables get read

L0

L1

L2

L3

L4

L5

* In Theory

Page 38: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Leveled CompactionBreaking Bad

L0

L1

L2

L3

L4

L5

Under high write load, forced to read all of the L0 files

Page 39: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Hybrid CompactionBreaking Better

L0

L1

L2

L3

L4

L5

{HybridCompaction

Size Tiered

Leveled

Size Tiering Level 0

Page 40: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Better Compression:New LZ4Compressor

LZ4 Compression is 40% faster than Google's Snappy...

LZ4 JNI

Snappy JNI

LZ4 Sun Unsafe

Blocks in Cassandra are so small we don't see the same in production but the 95% latency is improved and it works with Java 7

Page 41: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

CRC Check Chance

CRC check of each compressed block causes reads to be 2x SLOWER.Lowered crc_check_chance to 10% of reads.

A move to JNI would cause a 30x boost

Page 42: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Current Stats

● 12 nodes● 2 DataCenters● RF=6● 150k Writes/sec at EACH_QUORUM● 100k Reads/sec at LOCAL_QUORUM● > 6 Billion points (without replication)● 2TB on disk (compressed) ● Read Latency 50%/95% is 1ms/10ms

Page 43: NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2

Questions?

Thank you!

@tjake and @carlyeks