Oleg Anastasyev lead platform developer Odnoklassniki.ru Being Closer to Cassandra
Jan 15, 2015
Oleg Anastasyevlead platform developerOdnoklassniki.ru
Being Closer to Cassandra
#CASSANDRAEU
Top 10 of World’s social networks40M DAU, 80M MAU, 7M peak
~ 300 000 www req/sec, 20 ms render latency
>240 Gbit out
> 5 800 iron servers in 5 DCs99.9% java
* Odnoklassniki means “classmates” in english
#CASSANDRAEU
Cassandra @ * Since 2010
-branched 0.6-aiming at:
full operation on DC failure, scalability, ease of operations
*Now-23 clusters-418 nodes in total-240 TB of stored data
-survived several DC failures
#CASSANDRAEU
Case #1. The fast
#CASSANDRAEU
Like! 103 927 You and 103 927
#CASSANDRAEUData Range
00-64
Like! widget* Its everywhere
-Have it on every page, dozen-On feeds (AKA timeline)-3rd party websites elsewhere on internet
* Its on everything-Pictures and Albums-Videos-Posts and comments-3rd party shared URLs
Like! 103 927
#CASSANDRAEUData Range
00-64
Like! widget*High load
-1 000 000 reads/sec, 3 000 writes/sec
*Hard load profile-Read most -Long tail (40% of reads are random)-Sensitive to latency variations-3TB total dataset (9TB with RF) and growing-~ 60 billion likes for ~6bi entities
Like! 103 927
#CASSANDRAEU
RefId:long RefType:byte UserId:long Created
9999999999 PICTURE(2) 11111111111 11:00
Classic solution
= N >=1
= M>N
= N*140
You and 4256
SQL table
to render
SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,? (98% are NONE)
SELECT COUNT (*) WHERE RefId,RefType=?,? (80% are 0)
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)
#CASSANDRAEU
Cassandra solutionLikeByRef (
refType byte,refId bigint,userId bigint,
PRIMARY KEY ( (RefType,RefId), UserId)
LikeCount (refType byte,refId bigint,likers counter,
PRIMARY KEY ( (RefType,RefId))
= N*20%
so, to render
SELECT FROM LikeCount WHERE RefId,RefType=?,? (80% are 0)
SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,? (98% are NONE)
You and 4256
#CASSANDRAEU
>11 M iops
LikeByRef (refType byte,refId bigint,userId bigint,
PRIMARY KEY ( (RefType,RefId, UserId) )
*Quick workaround ?
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)
-Forces Order Pres Partitioner (random not scales)
-Key range scans-More network overhead-Partitions count >10x, Dataset size > x2
#CASSANDRAEU
*What is does- Includes pairs of (PartKey, ColumnKey) in
SSTable *-Filter.db
*The good-Eliminated 98 % of reads -Less false positives
*The bad-They become too large
GC Promotion Failures.. but fixable (CASSANDRA-2466)
By column bloom filter
#CASSANDRAEU
Are we there yet ?
- min 2 roundtrips per render (COUNT+RR) - THRIFT is slow, esp having lot of connections- EXISTS() is 200 Gbit/sec (140*8*1Mps*20%)
cassandra
00
application server> 400
1. COUNT()
2. EXISTS
#CASSANDRAEU
Co-locate!
- one-nio remoting (faster than java nio)- topology aware clients
odnoklassniki-like
cassandra
get() : LikeSummary
Remote Business Intf
Counters Cache
Social Graph Cache
#CASSANDRAEU
* Fast TOP N friend likers query1. Take friends from graph cache2. Check it with memory bloom filter3. Read some until N friends found
*Custom caches-Tuned for application
*Custom data merge logic- ... so you can detect and resolve conflicts
co-location wins
#CASSANDRAEU
Listen for mutations// Implement itinterface StoreApplyListener { boolean preapply(String key, ColumnFamily data); }
*Register itbetween commit logs replay and gossip
*RowMutation.apply()extend original mutation+ Replica, hints, ReadRepairs
// and register with CFSstore=Table.open(..) .getColumnFamilyStore(..);store.setListener(myListener);
#CASSANDRAEU
Like! optimized countersLikeCount (
refType byte,refId bigint,ip inet,counter intPRIMARY KEY ( (RefType,RefId), ip)
*Counters cache-Off heap (sun.misc.Unsafe)-Compact (30M in 1G RAM)-Read cached local node only
*Replicated cache state- cold replica cache problem- making (NOP) mutations
less reads- long tail aware
#CASSANDRAEU
Read latency variations*CS read behavior
1. Choose 1 node for data and N for digest2. Wait for data and digest3. Compare and return (or RR)
*Nodes suddenly slowdown-SEDA hiccup, commit log rotation, sudden IO
saturation, Network hiccup or partition, page cache miss
*The bad-You have spikes.-You have to wait (and timeout)
#CASSANDRAEU
Read Latency leveling* “Parallel” read handler
1. Ask all replicas for data in parallel2. Wait for CL responses and return
*The good-Minimal latency response-Constant load when DC fails
*The (not so) bad- “Additional” work and traffic
#CASSANDRAEU
More tiny tricks*On SSD io
-Deadline IO elevator-64k -> 4k read request size
*HintLog-Commit log for hints-Wait for all hints on startup
* Selective compaction-Compacts most read CFs more often
#CASSANDRAEU
Case #2. The fat
#CASSANDRAEU
*Messages in chats-Last page is accessed on open- long tail (80%) for rest
-150 billion, 100 TB in storage-Read most (120k reads/sec, 8k writes/sec)
#CASSANDRAEU
Messages have structure
-All chat’s messages in single partition-Single blob for message data
to reduce overhead
-The badConflicting modifications can happen
(users, anti-spam, etc..)
Message (chatId, msgId,
created, type,userIndex,deletedBy,...text)
MessageCF (chatId, msgId,
data blob,
PRIMARY KEY ( chatId, msgId )
#CASSANDRAEU
LW conflict resolution
Messages (chatId, msgId,version timestamp,data blobPRIMARY KEY ( chatId, msgId, version )
get
(version:ts1, data:d1)
write( ts1, data2 )
get
(version:ts1, data:d1)
write( ts1, data3 )
(ts2, data2)(ts3, data3)
delete(version:ts1)insert(version: ts3=now(), data3)
- merged on read
delete(version:ts1)insert(version: ts2=now(), data2)
#CASSANDRAEU
Specialized cache*Again. Because we can
-Off-heap (Unsafe)-Caches only freshest chat page-Saves its state to local (AKA system) CF
keys AND values seq read, much faster startup
- In memory compression2x more memory almost free
#CASSANDRAEU
Disk mgmt*4U HDDx24, up to 4TB/node
-Size tiered compaction = 4 TB sstable file-RAID10 ? LCS ?
* Split CF to 256 pieces*The good
-Smaller, more frequent memtable flushes-Same compaction work
in smaller sets
-Can distribute across disks
#CASSANDRAEU
Disk Allocation Policies*Default is
- “Take disk with most free space”
* Some disks have-Too much read iops
*Generational policy-Each disk has same # of same gen files
work better for HDD
#CASSANDRAEU
Case #3. The uglyfeed my Frankenstein
#CASSANDRAEU
*Chats overview-small dataset (230GB)-has hot set, short tail (5%)- list reorders often-130k read/s, 21k write/s
#CASSANDRAEU
Conflicting updates* List<Overview> is single blob
.. or you’ll have a lot of tombstones
* Lot of conflictsupdates of single column
*Need conflict detection*Has merge algoritm
#CASSANDRAEU
Vector clocks*Voldemort-byte[] key -> byte[] value + VC-Coordination logic on clients-Pluggable storage engines
* Plugged-CS 0.6 SSTables persistance -Fronted by specialized cache
we love caches
#CASSANDRAEU
Performance*3 node cluster, RF = 3
- Intel Xeon CPU E5506 2.13GHz RAM: 48Gb, 1x HDD, 1x SSD
*8 byte key -> 1 KB byte value
*Results-75 k /sec reads, 15 k/ sec writes
#CASSANDRAEU
Why cassandra ?*Reusable distributed DB components
fast persistance, gossip, Reliable Async Messaging, Fail detectors,Topology, Seq scans, ...
*Has structurebeyond byte[] key -> byte[] value
*Delivered promises* Implemented in Java
#CASSANDRAEU CASSANDRASUMMITEU
THANK YOU
one-niormi faster than java nio with fast and compact automagic java serialization
shared-memory-cachejava Off-Heap cache using shared memory
Oleg [email protected]/oa@m0nstermind
github.com/odnoklassniki