©2013 DataStax Confidential. Do not distribute without consent. CTO, DataStax Jonathan Ellis Project Chair, Apache Cassandra Cassandra 2.1 (mostly) 1
©2013 DataStax Confidential. Do not distribute without consent.
CTO, DataStax
Jonathan EllisProject Chair, Apache Cassandra
Cassandra 2.1 (mostly)
1
Five Years of Cassandra
Jun-09 Mar-10 Jan-11 Nov-11 Sep-12 Jul-13
0.1 0.3 0.6 0.7 1.0 1.2...
2.0
DSE
Jul-08
•Massively scalable •High performance •Reliable/Available
Core values Cassandra HBase Redis MySQL
CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int);!!CREATE INDEX ON users(state);!SELECT * FROM users WHERE state=‘Texas’ AND birth_date > 1950;
New Core Value
•Massively scalable •High performance •Reliable/Available •Productivity + ease of use
CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int);
Collections
CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int);
CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text);!SELECT *FROM users NATURAL JOIN users_addresses;
Collections
CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int);
CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text);!SELECT *FROM users NATURAL JOIN users_addresses;X
Collections
CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int, email_addresses set<text>);
Collections
UPDATE usersSET email_addresses = email_addresses + {‘[email protected]’, ‘[email protected]’};
CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int, email_addresses set<text>);
Collections
Cassandra 2.0
Race conditionSELECT name!FROM users!WHERE username = 'pmcfadin';
Race conditionSELECT name!FROM users!WHERE username = 'pmcfadin';
(0 rows) SELECT name!FROM users!WHERE username = 'pmcfadin';
Race conditionSELECT name!FROM users!WHERE username = 'pmcfadin';
(0 rows) SELECT name!FROM users!WHERE username = 'pmcfadin';
INSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ba27e03fd9...',! '2011-06-20 13:50:00');
(0 rows)
Race conditionSELECT name!FROM users!WHERE username = 'pmcfadin';
(0 rows) SELECT name!FROM users!WHERE username = 'pmcfadin';
INSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ba27e03fd9...',! '2011-06-20 13:50:00');
(0 rows)
INSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ea24e13ad9...',! '2011-06-20 13:50:01');
Race condition
This one wins
SELECT name!FROM users!WHERE username = 'pmcfadin';
(0 rows) SELECT name!FROM users!WHERE username = 'pmcfadin';
INSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ba27e03fd9...',! '2011-06-20 13:50:00');
(0 rows)
INSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ea24e13ad9...',! '2011-06-20 13:50:01');
Lightweight transactions
Lightweight transactionsINSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ba27e03fd9...',! '2011-06-20 13:50:00')!IF NOT EXISTS;
Lightweight transactionsINSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ba27e03fd9...',! '2011-06-20 13:50:00')!IF NOT EXISTS;
INSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ea24e13ad9...',! '2011-06-20 13:50:01')!IF NOT EXISTS;
Lightweight transactionsINSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ba27e03fd9...',! '2011-06-20 13:50:00')!IF NOT EXISTS;
[applied]!-----------! True
INSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ea24e13ad9...',! '2011-06-20 13:50:01')!IF NOT EXISTS;
Lightweight transactions
[applied] | username | created_date | name !-----------+----------+----------------+----------------! False | pmcfadin | 2011-06-20 ... | Patrick McFadin
INSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ba27e03fd9...',! '2011-06-20 13:50:00')!IF NOT EXISTS;
[applied]!-----------! True
INSERT INTO users ! (username, name, email,! password, created_date)!VALUES ('pmcfadin',! 'Patrick McFadin',! ['[email protected]'],! 'ea24e13ad9...',! '2011-06-20 13:50:01')!IF NOT EXISTS;
Atomic log appends with LWTCREATE TABLE log (! log_name text,! seq int static,! logged_at timeuuid,! entry text,! primary key (log_name, logged_at)!);!!INSERT INTO log (log_name, seq) !VALUES ('foo', 0);
Atomic log appends with LWTBEGIN BATCH!!UPDATE log SET seq = 1!WHERE log_name = 'foo'!IF seq = 0;!!INSERT INTO log (log_name, logged_at, entry)!VALUES ('foo', now(), 'test');!!APPLY BATCH;!
Details•http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
•Paxos state is durable + quorum based •Paxos made Simple
•Immediate consistency with no leader election or failover •ConsistencyLevel.SERIAL •4 round trips vs 1 for normal updates
•http://www.slideshare.net/planetcassandra/c-summit-2013-eventual-consistency-hopeful-consistency-by-christos-kalantzis
Reads in a cluster
Client Coordinator
40% busy
90% busy
30% busy
Reads in a cluster
Client Coordinator
40% busy
90% busy
30% busy
Reads in a cluster
Client Coordinator
40% busy
90% busy
30% busy
Reads in a cluster
Client Coordinator
40% busy
90% busy
30% busy
Reads in a cluster
Client Coordinator
40% busy
90% busy
30% busy
A failure
Client Coordinator
40% busy
90% busy
30% busy
A failure
Client Coordinator
40% busy
90% busy
30% busy
A failure
Client Coordinator
40% busy
90% busy
30% busy
A failure
Client Coordinator
40% busy
90% busy
30% busyX
A failure
Client Coordinator
40% busy
90% busy
30% busyXtimeout
Rapid read protection
Client Coordinator
40% busy
90% busy
30% busy
Rapid read protection
Client Coordinator
40% busy
90% busy
30% busy
Rapid read protection
Client Coordinator
40% busy
90% busy
30% busy
Rapid read protection
Client Coordinator
40% busy
90% busy
30% busyX
Rapid read protection
Client Coordinator
40% busy
90% busy
30% busyX
Rapid read protection
Client Coordinator
40% busy
90% busy
30% busyX
Rapid read protection
Client Coordinator
40% busy
90% busy
30% busyXsuccess
Rapid Read Protection
NONE
Latency (mid-compaction)
Cold data
10,000 req/s 5,000 req/s
4,000 req/s 10 req/s
Cold data
10,000 req/s 5,000 req/s
4,000 req/s 10 req/s
Cold data compaction
10 req/s
10,000 req/s
Cassandra 2.1
User defined typesCREATE TYPE address ( street text, city text, zip_code int, phones set<text>)!CREATE TABLE users ( id uuid PRIMARY KEY, name text, addresses map<text, address>)!SELECT id, name, addresses.city, addresses.phones FROM users;! id | name | addresses.city | addresses.phones--------------------+----------------+-------------------------- 63bf691f | jbellis | Austin | {'512-4567', '512-9999'}
Collection indexingCREATE TABLE songs ( id uuid PRIMARY KEY, artist text, album text, title text, data blob, tags set<text>);!CREATE INDEX song_tags_idx ON songs(tags);!SELECT * FROM songs WHERE tags CONTAINS 'blues';! id | album | artist | tags | title----------+---------------+-------------------+-----------------------+------------------ 5027b27e | Country Blues | Lightnin' Hopkins | {'acoustic', 'blues'} | Worrying My Mind!!
(UDT indexing?)
Counters++
Counters++•simpler implementation, no more edge cases
Counters++•simpler implementation, no more edge cases•possible to properly repair now
Counters++•simpler implementation, no more edge cases•possible to properly repair now•significantly less garbage and internode traffic generated
Counters++•simpler implementation, no more edge cases•possible to properly repair now•significantly less garbage and internode traffic generated•better performance for 99% of uses
Counters++•simpler implementation, no more edge cases•possible to properly repair now•significantly less garbage and internode traffic generated•better performance for 99% of uses
•RF>1, replicate_on_write=true
Counters++•simpler implementation, no more edge cases•possible to properly repair now•significantly less garbage and internode traffic generated•better performance for 99% of uses
•RF>1, replicate_on_write=true
•topology changes not leading to data loss (#4071)
Counters++•simpler implementation, no more edge cases•possible to properly repair now•significantly less garbage and internode traffic generated•better performance for 99% of uses
•RF>1, replicate_on_write=true
•topology changes not leading to data loss (#4071)•commitlog now 100% safe to replay (#4417)
Counters++•simpler implementation, no more edge cases•possible to properly repair now•significantly less garbage and internode traffic generated•better performance for 99% of uses
•RF>1, replicate_on_write=true
•topology changes not leading to data loss (#4071)•commitlog now 100% safe to replay (#4417)•Internal format overhaul still coming in 3.0 (#6506)
What hasn’t changed
What hasn’t changed•same API
What hasn’t changed•same API•same average throughput
What hasn’t changed•same API•same average throughput•same restrictions on mixing counter and non-counter columns
What hasn’t changed•same API•same average throughput•same restrictions on mixing counter and non-counter columns•same restrictions on mixing counter and non-counter updates
What hasn’t changed•same API•same average throughput•same restrictions on mixing counter and non-counter columns•same restrictions on mixing counter and non-counter updates•same restrictions on counter deletes
What hasn’t changed•same API•same average throughput•same restrictions on mixing counter and non-counter columns•same restrictions on mixing counter and non-counter updates•same restrictions on counter deletes •same retry limitations
Writes (low contention)
Writes (high contention)
Data directories (2.0)/var/lib/cassandra/data/foo/bar/foo-bar-jb-1-CompressionInfo.db/var/lib/cassandra/data/foo/bar/foo-bar-jb-1-Data.db/var/lib/cassandra/data/foo/bar/foo-bar-jb-1-Filter.db/var/lib/cassandra/data/foo/bar/foo-bar-jb-1-Index.db/var/lib/cassandra/data/foo/bar/foo-bar-jb-1-Statistics.db/var/lib/cassandra/data/foo/bar/foo-bar-jb-1-Summary.db/var/lib/cassandra/data/foo/bar/foo-bar-jb-1-TOC.txt
Data directories (2.1)/var/lib/cassandra/flush/foo/bar-2fbb89709a6911e3b7dc4d7d4e3ca4b4/var/lib/cassandra/flush/foo/bar-2fbb89709a6911e3b7dc4d7d4e3ca4b4/foo-bar-ka-1-CompressionInfo.db/var/lib/cassandra/flush/foo/bar-2fbb89709a6911e3b7dc4d7d4e3ca4b4/foo-bar-ka-1-Data.db/var/lib/cassandra/flush/foo/bar-2fbb89709a6911e3b7dc4d7d4e3ca4b4/foo-bar-ka-1-Filter.db/var/lib/cassandra/flush/foo/bar-2fbb89709a6911e3b7dc4d7d4e3ca4b4/foo-bar-ka-1-Index.db/var/lib/cassandra/flush/foo/bar-2fbb89709a6911e3b7dc4d7d4e3ca4b4/foo-bar-ka-1-Statistics.db/var/lib/cassandra/flush/foo/bar-2fbb89709a6911e3b7dc4d7d4e3ca4b4/foo-bar-ka-1-Summary.db/var/lib/cassandra/flush/foo/bar-2fbb89709a6911e3b7dc4d7d4e3ca4b4/foo-bar-ka-1-TOC.txt
Inefficient bloom filters
+
= ?
+
=
Inefficient bloom filters
+
=
Inefficient bloom filters
Inefficient bloom filters
HyperLogLog applied
HLL and compaction
HLL and compaction
HLL and compaction
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
Implications for LCS (and STCS)
The new query cache
The new row cacheCREATE TABLE notifications ( target_user text, notification_id timeuuid, source_id uuid, source_type text, activity text, PRIMARY KEY (target_user, notification_id))WITH CLUSTERING ORDER BY (notification_id DESC) AND caching = 'rows_only' AND rows_per_partition_to_cache = '3';!
The new row cachetarget_user notification_id source_id source_type activity
nick e1bd2bcb- d972b679- photo jbellis liked
nick 321998c- d972b679- photo rbranson commented
nick ea1c5d35- 88a049d5- user mbulman created account
nick 5321998c- 64613f27- photo jbellis commented
nick 07581439- 076eab7e- user thobbs created account
rbranson 1c34467a- f04e309f- user jbellis created account
The new row cachetarget_user notification_id source_id source_type activity
nick e1bd2bcb- d972b679- photo jbellis liked
nick 321998c- d972b679- photo rbranson commented
nick ea1c5d35- 88a049d5- user mbulman created account
nick 5321998c- 64613f27- photo jbellis commented
nick 07581439- 076eab7e- user thobbs created account
rbranson 1c34467a- f04e309f- user jbellis created account
Read performance
Reads post-compaction
Questions?