Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

Oleg Anastasyevlead platform developerOdnoklassniki.ru

Being Closer to Cassandra

#CASSANDRAEU

Top 10 of World’s social networks40M DAU, 80M MAU, 7M peak

~ 300 000 www req/sec, 20 ms render latency

>240 Gbit out

> 5 800 iron servers in 5 DCs99.9% java

* Odnoklassniki means “classmates” in english

#CASSANDRAEU

Cassandra @ * Since 2010

-branched 0.6-aiming at:

full operation on DC failure, scalability, ease of operations

*Now-23 clusters-418 nodes in total-240 TB of stored data

-survived several DC failures

#CASSANDRAEU

Case #1. The fast

#CASSANDRAEU

Like! 103 927 You and 103 927

#CASSANDRAEUData Range

00-64

Like! widget* Its everywhere

-Have it on every page, dozen-On feeds (AKA timeline)-3rd party websites elsewhere on internet

* Its on everything-Pictures and Albums-Videos-Posts and comments-3rd party shared URLs

Like! 103 927

#CASSANDRAEUData Range

00-64

Like! widget*High load

-1 000 000 reads/sec, 3 000 writes/sec

*Hard load profile-Read most -Long tail (40% of reads are random)-Sensitive to latency variations-3TB total dataset (9TB with RF) and growing-~ 60 billion likes for ~6bi entities

Like! 103 927

#CASSANDRAEU

RefId:long RefType:byte UserId:long Created

9999999999 PICTURE(2) 11111111111 11:00

Classic solution

= N >=1

= M>N

= N*140

You and 4256

SQL table

to render

SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,? (98% are NONE)

SELECT COUNT (*) WHERE RefId,RefType=?,? (80% are 0)

SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

#CASSANDRAEU

Cassandra solutionLikeByRef (

refType byte,refId bigint,userId bigint,

PRIMARY KEY ( (RefType,RefId), UserId)

LikeCount (refType byte,refId bigint,likers counter,

PRIMARY KEY ( (RefType,RefId))

= N*20%

so, to render

SELECT FROM LikeCount WHERE RefId,RefType=?,? (80% are 0)

SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,? (98% are NONE)

You and 4256

#CASSANDRAEU

>11 M iops

LikeByRef (refType byte,refId bigint,userId bigint,

PRIMARY KEY ( (RefType,RefId, UserId) )

*Quick workaround ?

SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

-Forces Order Pres Partitioner (random not scales)

-Key range scans-More network overhead-Partitions count >10x, Dataset size > x2

#CASSANDRAEU

*What is does- Includes pairs of (PartKey, ColumnKey) in

SSTable *-Filter.db

*The good-Eliminated 98 % of reads -Less false positives

*The bad-They become too large

GC Promotion Failures.. but fixable (CASSANDRA-2466)

By column bloom filter

#CASSANDRAEU

Are we there yet ?

- min 2 roundtrips per render (COUNT+RR) - THRIFT is slow, esp having lot of connections- EXISTS() is 200 Gbit/sec (140*8*1Mps*20%)

cassandra

00

application server> 400

1. COUNT()

2. EXISTS

#CASSANDRAEU

Co-locate!

- one-nio remoting (faster than java nio)- topology aware clients

odnoklassniki-like

cassandra

get() : LikeSummary

Remote Business Intf

Counters Cache

Social Graph Cache

#CASSANDRAEU

* Fast TOP N friend likers query1. Take friends from graph cache2. Check it with memory bloom filter3. Read some until N friends found

*Custom caches-Tuned for application

*Custom data merge logic- ... so you can detect and resolve conflicts

co-location wins

#CASSANDRAEU

Listen for mutations// Implement itinterface StoreApplyListener { boolean preapply(String key, ColumnFamily data); }

*Register itbetween commit logs replay and gossip

*RowMutation.apply()extend original mutation+ Replica, hints, ReadRepairs

// and register with CFSstore=Table.open(..) .getColumnFamilyStore(..);store.setListener(myListener);

#CASSANDRAEU

Like! optimized countersLikeCount (

refType byte,refId bigint,ip inet,counter intPRIMARY KEY ( (RefType,RefId), ip)

*Counters cache-Off heap (sun.misc.Unsafe)-Compact (30M in 1G RAM)-Read cached local node only

*Replicated cache state- cold replica cache problem- making (NOP) mutations

less reads- long tail aware

#CASSANDRAEU

Read latency variations*CS read behavior

1. Choose 1 node for data and N for digest2. Wait for data and digest3. Compare and return (or RR)

*Nodes suddenly slowdown-SEDA hiccup, commit log rotation, sudden IO

saturation, Network hiccup or partition, page cache miss

*The bad-You have spikes.-You have to wait (and timeout)

#CASSANDRAEU

Read Latency leveling* “Parallel” read handler

1. Ask all replicas for data in parallel2. Wait for CL responses and return

*The good-Minimal latency response-Constant load when DC fails

*The (not so) bad- “Additional” work and traffic

#CASSANDRAEU

More tiny tricks*On SSD io

-Deadline IO elevator-64k -> 4k read request size

*HintLog-Commit log for hints-Wait for all hints on startup

* Selective compaction-Compacts most read CFs more often

#CASSANDRAEU

Case #2. The fat

#CASSANDRAEU

*Messages in chats-Last page is accessed on open- long tail (80%) for rest

-150 billion, 100 TB in storage-Read most (120k reads/sec, 8k writes/sec)

#CASSANDRAEU

Messages have structure

-All chat’s messages in single partition-Single blob for message data

to reduce overhead

-The badConflicting modifications can happen

(users, anti-spam, etc..)

Message (chatId, msgId,

created, type,userIndex,deletedBy,...text)

MessageCF (chatId, msgId,

data blob,

PRIMARY KEY ( chatId, msgId )

#CASSANDRAEU

LW conflict resolution

Messages (chatId, msgId,version timestamp,data blobPRIMARY KEY ( chatId, msgId, version )

get

(version:ts1, data:d1)

write( ts1, data2 )

get

(version:ts1, data:d1)

write( ts1, data3 )

(ts2, data2)(ts3, data3)

delete(version:ts1)insert(version: ts3=now(), data3)

- merged on read

delete(version:ts1)insert(version: ts2=now(), data2)

#CASSANDRAEU

Specialized cache*Again. Because we can

-Off-heap (Unsafe)-Caches only freshest chat page-Saves its state to local (AKA system) CF

keys AND values seq read, much faster startup

- In memory compression2x more memory almost free

#CASSANDRAEU

Disk mgmt*4U HDDx24, up to 4TB/node

-Size tiered compaction = 4 TB sstable file-RAID10 ? LCS ?

* Split CF to 256 pieces*The good

-Smaller, more frequent memtable flushes-Same compaction work

in smaller sets

-Can distribute across disks

#CASSANDRAEU

Disk Allocation Policies*Default is

- “Take disk with most free space”

* Some disks have-Too much read iops

*Generational policy-Each disk has same # of same gen files

work better for HDD

#CASSANDRAEU

Case #3. The uglyfeed my Frankenstein

#CASSANDRAEU

*Chats overview-small dataset (230GB)-has hot set, short tail (5%)- list reorders often-130k read/s, 21k write/s

#CASSANDRAEU

Conflicting updates* List<Overview> is single blob

.. or you’ll have a lot of tombstones

* Lot of conflictsupdates of single column

*Need conflict detection*Has merge algoritm

#CASSANDRAEU

Vector clocks*Voldemort-byte[] key -> byte[] value + VC-Coordination logic on clients-Pluggable storage engines

* Plugged-CS 0.6 SSTables persistance -Fronted by specialized cache

we love caches

#CASSANDRAEU

Performance*3 node cluster, RF = 3

- Intel Xeon CPU E5506 2.13GHz RAM: 48Gb, 1x HDD, 1x SSD

*8 byte key -> 1 KB byte value

*Results-75 k /sec reads, 15 k/ sec writes

#CASSANDRAEU

Why cassandra ?*Reusable distributed DB components

fast persistance, gossip, Reliable Async Messaging, Fail detectors,Topology, Seq scans, ...

*Has structurebeyond byte[] key -> byte[] value

*Delivered promises* Implemented in Java

#CASSANDRAEU CASSANDRASUMMITEU

THANK YOU

one-niormi faster than java nio with fast and compact automagic java serialization

shared-memory-cachejava Off-Heap cache using shared memory

Oleg [email protected]/oa@m0nstermind

github.com/odnoklassniki

mailto:[email protected]

mailto:[email protected]

Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

Technology

refid bigint

renderprimary key reftype

userid bigint

message data

read cfs

columnfamily data

tb of stored data

n friends