Top Banner
SCYLLA: NoSQL at Ludicrous Speed Duarte Nunes @duarte_nunes
88

ScyllaDB: NoSQL at Ludicrous Speed

Jan 21, 2018

Download

Technology

J On The Beach
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ScyllaDB: NoSQL at Ludicrous Speed

SCYLLA: NoSQL at Ludicrous Speed

Duarte Nunes@duarte_nunes

Page 2: ScyllaDB: NoSQL at Ludicrous Speed

❏ Introducing ScyllaDB❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing

AGENDA

Page 3: ScyllaDB: NoSQL at Ludicrous Speed

ScyllaDB

● Clustered NoSQL database compatible with Apache Cassandra

● ~10X performance on same hardware● Low latency, esp. higher percentiles● Self tuning● Mechanically sympathetic C++14

Page 4: ScyllaDB: NoSQL at Ludicrous Speed

YCSB Benchmark:3 node Scylla cluster vs 3, 9, 15, 30Cassandra machines

3 Scylla30 Cassandra

3 Cassandra

3 Scylla

30 Cassandra

3 Cassandra

Page 5: ScyllaDB: NoSQL at Ludicrous Speed

Scylla vs Cassandra - CL:LOCAL_QUORUM, Outbrain Case Study

Scylla and Cassandra handling the full load (peak of ~12M RPM)

200ms

10ms

20x Lower Latency

5

Page 6: ScyllaDB: NoSQL at Ludicrous Speed

Scylla benchmark by Samsung

op/s

Full report: http://tinyurl.com/msl-scylladb

Page 7: ScyllaDB: NoSQL at Ludicrous Speed

Dynamo-based system

Page 8: ScyllaDB: NoSQL at Ludicrous Speed

Data model

Partition Key1Clustering Key1

Clustering Key1 Clustering Key2

Clustering Key2

...

...

...

...

...

CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id ));INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’');

Sort

ed b

y Pr

imar

y Ke

y

Page 9: ScyllaDB: NoSQL at Ludicrous Speed

Log-Structured Merge Tree

SStable 1

Tim

e

Page 10: ScyllaDB: NoSQL at Ludicrous Speed

Log-Structured Merge Tree

SStable 1

SStable 2

Tim

e

Page 11: ScyllaDB: NoSQL at Ludicrous Speed

Log-Structured Merge Tree

SStable 1

SStable 2

SStable 3Tim

e

Page 12: ScyllaDB: NoSQL at Ludicrous Speed

Log-Structured Merge Tree

SStable 1

SStable 2

SStable 3Tim

e

SStable 4

Page 13: ScyllaDB: NoSQL at Ludicrous Speed

Log-Structured Merge Tree

SStable 1

SStable 2

SStable 3Tim

e

SStable 4SStable 1+2+3

Page 14: ScyllaDB: NoSQL at Ludicrous Speed

Log-Structured Merge Tree

SStable 1

SStable 2

SStable 3Tim

e

SStable 4

SStable 5SStable 1+2+3

Page 15: ScyllaDB: NoSQL at Ludicrous Speed

Log-Structured Merge Tree

SStable 1

SStable 2

SStable 3Tim

e

SStable 4

SStable 5SStable 1+2+3

Foreground Job Background Job

Page 16: ScyllaDB: NoSQL at Ludicrous Speed

Request path

SSTable

Memtable

Page 17: ScyllaDB: NoSQL at Ludicrous Speed

Request path

SSTable

Memtable

Reads

Page 18: ScyllaDB: NoSQL at Ludicrous Speed

Request path

SSTable

Memtable

Reads

Commit Log

Writes

Page 19: ScyllaDB: NoSQL at Ludicrous Speed

Implementation Goals

● Efficiency:○ Make the most out of every cycle

● Utilization:○ Squeeze every cycle from the machine

● Control○ Spend the cycles on what we want, when we want

Page 20: ScyllaDB: NoSQL at Ludicrous Speed

❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing

AGENDA

Page 21: ScyllaDB: NoSQL at Ludicrous Speed

● Thread-per-core design (shard)○ No blocking. Ever.

Enter Seastar www.seastar-project.org

Page 22: ScyllaDB: NoSQL at Ludicrous Speed

Enter Seastar www.seastar-project.org

● Thread-per-core design (shard)○ No blocking. Ever.

● Asynchronous networking, file I/O, multicore

Page 23: ScyllaDB: NoSQL at Ludicrous Speed

Enter Seastar www.seastar-project.org

● Thread-per-core design (shard)○ No blocking. Ever.

● Asynchronous networking, file I/O, multicore

● Future/promise based APIs

Page 24: ScyllaDB: NoSQL at Ludicrous Speed

Enter Seastar www.seastar-project.org

● Thread-per-core design (shard)○ No blocking. Ever.

● Asynchronous networking, file I/O, multicore

● Future/promise based APIs● Usermode TCP/IP stack included in the box

Page 25: ScyllaDB: NoSQL at Ludicrous Speed

Seastar task schedulerTraditional stack Seastar stack

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise is a pointer to eventually computed value

Task is a pointer to a lambda function

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread is a function pointer

Stack is a byte array from 64k to megabytes

Context switch cost is

high. Large stacks pollutes

the cachesNo sharing, millio

ns of

parallel events

Page 26: ScyllaDB: NoSQL at Ludicrous Speed

Seastar memcached

Page 27: ScyllaDB: NoSQL at Ludicrous Speed

Pedis https://github.com/fastio/pedis

Page 28: ScyllaDB: NoSQL at Ludicrous Speed

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

Page 29: ScyllaDB: NoSQL at Ludicrous Speed

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

Page 30: ScyllaDB: NoSQL at Ludicrous Speed

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

Page 31: ScyllaDB: NoSQL at Ludicrous Speed

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

Page 32: ScyllaDB: NoSQL at Ludicrous Speed

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

Page 33: ScyllaDB: NoSQL at Ludicrous Speed

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

Page 34: ScyllaDB: NoSQL at Ludicrous Speed

No escaping the monadfuture<> f = …;f.get(); // not allowed

Page 35: ScyllaDB: NoSQL at Ludicrous Speed

Unless...future<> f = seastar::async([&] {

future<> f = …;f.get();

});

Page 36: ScyllaDB: NoSQL at Ludicrous Speed

Unless...future<> f = seastar::async([&] {

future<> f = …;f.get();

});

Page 37: ScyllaDB: NoSQL at Ludicrous Speed

Seastar memory allocator

● Non-Thread safe!○ Each core gets a private memory pool

Page 38: ScyllaDB: NoSQL at Ludicrous Speed

Seastar memory allocator

● Non-Thread safe!○ Each core gets a private memory pool

● Allocation back pressure○ Allocator calls a callback when low on memory○ Scylla evicts cache in response

Page 39: ScyllaDB: NoSQL at Ludicrous Speed

Seastar memory allocator

● Non-Thread safe!○ Each core gets a private memory pool

● Allocation back pressure○ Allocator calls a callback when low on memory○ Scylla evicts cache in response

● Inter-core free() through message passing

Page 40: ScyllaDB: NoSQL at Ludicrous Speed

❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing

AGENDA

Page 41: ScyllaDB: NoSQL at Ludicrous Speed

Usermode I/O scheduler

Storage

Block Layer

Filesystem

Page 42: ScyllaDB: NoSQL at Ludicrous Speed

Usermode I/O scheduler

Storage

Block Layer

Filesystem

Disk I/O Scheduler

Class A Class B

Page 43: ScyllaDB: NoSQL at Ludicrous Speed

Usermode I/O scheduler

Query

Commitlog

Compaction

Queue

Queue

Queue

UserspaceI/O

SchedulerDisk

Page 44: ScyllaDB: NoSQL at Ludicrous Speed

Figuring out optimal disk concurrency

Max useful disk concurrency

Page 45: ScyllaDB: NoSQL at Ludicrous Speed

Cassandra cache

Linux page cache

SSTables

● 4k granularity● Thread-safe● Synchronous APIs● General-purpose● Lack of control2● ...on the other hand

○ Exists○ Hundreds of man-years○ Handling lots of edge cases

Page 46: ScyllaDB: NoSQL at Ludicrous Speed

Cassandra cache

Linux page cache

SSTables

● Parasitic rows

SSTable page (4k)

Your data (300b)

Page 47: ScyllaDB: NoSQL at Ludicrous Speed

Cassandra cache

Linux page cache

SSTables

● Page faults

Page faultSuspend thread

Initiate I/OContext switch

I/O completesContext switchInterrupt

Map pageResume thread

App thread

Kernel

SSD

Page 48: ScyllaDB: NoSQL at Ludicrous Speed

Cassandra cache

Key cache

Row cache

On-heap /Off-heap

Linux page cache

SSTables

● Complex tuning

Page 49: ScyllaDB: NoSQL at Ludicrous Speed

Scylla cache

Unified cache

SSTables

Page 50: ScyllaDB: NoSQL at Ludicrous Speed

Probabilistic Cache Warmup

● A replica with a cold cache should be sent less requests

Page 51: ScyllaDB: NoSQL at Ludicrous Speed

Yet another allocator(Problems with malloc/free)

● Memory gets fragmented over time○ If the workload changes sizes of allocated objects○ Allocating a large contiguous block requires

evicting most of cache

Page 52: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 53: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 54: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 55: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 56: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 57: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 58: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 59: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 60: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 61: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 62: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 63: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 64: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 65: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 66: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 67: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 68: ScyllaDB: NoSQL at Ludicrous Speed

MemoryOOM :(

Page 69: ScyllaDB: NoSQL at Ludicrous Speed

MemoryOOM :(

Page 70: ScyllaDB: NoSQL at Ludicrous Speed

MemoryOOM :(

Page 71: ScyllaDB: NoSQL at Ludicrous Speed

MemoryOOM :(

Page 72: ScyllaDB: NoSQL at Ludicrous Speed

MemoryOOM :(

Page 73: ScyllaDB: NoSQL at Ludicrous Speed

MemoryOOM :(

Page 74: ScyllaDB: NoSQL at Ludicrous Speed

MemoryOOM :(

Page 75: ScyllaDB: NoSQL at Ludicrous Speed

Memory

Page 76: ScyllaDB: NoSQL at Ludicrous Speed

Log-structured memory allocation

● Bump-pointer allocation to current segment● Frees leave holes in segments● Compaction will try to solve this

Page 77: ScyllaDB: NoSQL at Ludicrous Speed

Compacting LSA● Teach allocator how to move objects around

○ Updating references● Garbage collect Compact!

○ Starting with the most sparse segments○ Lock to pin objects

● Used mostly for the cache○ Large majority of memory allocated○ Small subset of allocation sites

Page 78: ScyllaDB: NoSQL at Ludicrous Speed

❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing

AGENDA

Page 79: ScyllaDB: NoSQL at Ludicrous Speed

● Internal feedback loops to balance competing loads○ Consume what you export

Workload Conditioning

Page 80: ScyllaDB: NoSQL at Ludicrous Speed

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

WAN

CPU

Workload Conditioning

Page 81: ScyllaDB: NoSQL at Ludicrous Speed

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

Compaction Backlog Monitor

WAN

CPU

Workload Conditioning

Page 82: ScyllaDB: NoSQL at Ludicrous Speed

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

Compaction Backlog Monitor

WAN

CPU

Workload Conditioning

Adjust priority

Page 83: ScyllaDB: NoSQL at Ludicrous Speed

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

Memory Monitor

WAN

CPU

Workload Conditioning

Page 84: ScyllaDB: NoSQL at Ludicrous Speed

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

Memory Monitor

Adjust priority

WAN

CPU

Workload Conditioning

Page 85: ScyllaDB: NoSQL at Ludicrous Speed

❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Workload Conditioning❏ Closing

AGENDA

Page 86: ScyllaDB: NoSQL at Ludicrous Speed

● Careful system design and control of the software stack can maximize throughput

● Without sacrificing latency● Without requiring complex end-user tuning● While having a lot of fun

Conclusions

Page 87: ScyllaDB: NoSQL at Ludicrous Speed

● Download: http://www.scylladb.com● Twitter: @ScyllaDB● Source: http://github.com/scylladb/scylla● Mailing lists: scylladb-user @ groups.google.com● Slack: ScyllaDB-Users● Blog: http://www.scylladb.com/blog● Join: http://www.scylladb.com/company/careers● Me: [email protected]

How to interact

Page 88: ScyllaDB: NoSQL at Ludicrous Speed

SCYLLA, NoSQL at Ludicrous SpeedThank you.

@duarte_nunes