Top Banner
Introduction to Apache Cassandra (for Java developers!) Nate McCall [email protected] @zznate
29

Introduction to apache_cassandra_for_develope

May 11, 2015

Download

Documents

zznate

A presentation for Data Day Austin on January 29th, 2011

Introduces how to effectively use Apache Cassandra for Java developers using the Hector Java client: http://github.com/rantav/hector
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to apache_cassandra_for_develope

Introduction to Apache Cassandra

(for Java developers!)

Nate [email protected]@zznate

Page 2: Introduction to apache_cassandra_for_develope

Brief Intro 

NOT a "key/value store"Columns are dynamic inside a column familySSTables are immutable SSTables merged on readsAll nodes share the same role (i.e. no single point of failure)

Trading ACID compliance for scalability is a fundamental design decision

Page 3: Introduction to apache_cassandra_for_develope

How does this impact development?

Substantially. 

For operations affecting the same data, that data will become consistent eventually as determined by the timestamps. 

But you can trade availability for consistency. (More on this later)

You can store whatever you want. It's all just bytes.

You need to think about how you will query the data before you write it.

Page 4: Introduction to apache_cassandra_for_develope

Neat. So Now What?

Like any database, you need a client!

• Python:o Telephus: http://github.com/driftx/Telephus (Twisted)o Pycassa: http://github.com/pycassa/pycassa

• Java:o Hector: http://github.com/rantav/hector (Examples https://github.com/zznate/hector-examples )o Pelops: http://github.com/s7/scale7-pelopso Kundera http://code.google.com/p/kundera/o Datanucleus JDO: http://github.com/tnine/Datanucleus-Cassandra-Plugin

• Grails:o grails-cassandra: https://github.com/wolpert/grails-cassandra

• .NET:o FluentCassandra: http://github.com/managedfusion/fluentcassandrao Aquiles: http://aquiles.codeplex.com/

• Ruby:o Cassandra: http://github.com/fauna/cassandra

• PHP:o phpcassa: http://github.com/thobbs/phpcassao SimpleCassie: http://code.google.com/p/simpletools-php/wiki/SimpleCassie

Page 5: Introduction to apache_cassandra_for_develope

... but do not roll your own

Page 6: Introduction to apache_cassandra_for_develope

Thrift

• Fast, efficient serialization and network IO. • Lots of clients available (you can probably use it in other

places as well)

Why you don't want to work with the Thrift API directly:• SuperColumn• ColumnOrSuperColumn• ColumnParent.super_column• ColumnPath.super_column• Map<ByteBuffer,Map<String,List<Mutation>>>

mutationMap 

Page 7: Introduction to apache_cassandra_for_develope

Higher Level Client

Hector• JMX Counters• Add/remove hosts:

o automatically o programaticallyo via JMX

• Plugable load balancing• Complete encapsulation of Thrift API• Type-safe approach to dealing with Apache Cassandra• Lightweight ORM (supports JPA 1.0 annotations)• Mavenized! http://repo2.maven.org/maven2/me/prettyprint/

Page 8: Introduction to apache_cassandra_for_develope

"CQL"

• Currently in Apache Cassandra trunk • Experimental• Lots of possibilities

from test/system/test_cql.py:

UPDATE StandardLong1 SET 1L="1", 2L="2", 3L="3", 4L="4" WHERE KEY="aa"

SELECT "cd1", "col" FROM Standard1 WHERE KEY = "kd"

DELETE "cd1", "col" FROM Standard1 WHERE KEY = "kd"

Page 9: Introduction to apache_cassandra_for_develope

Avro??

Gone. Added too much complexity after Thrift caught up.  

"None of the libraries distinguished themselves as being a particularly crappy choice for serialization." 

(See CASSANDRA-1765)

Page 10: Introduction to apache_cassandra_for_develope

Thrift API Methods

Retrieving

Writing/Removing

Meta Information

Schema Manipulation

Page 11: Introduction to apache_cassandra_for_develope

Thrift API Methods - Retrieving

get: retrieve a single column for a key

get_slice: retrieve a "slice" of columns for a key

multiget_slice: retrieve a "slice" of columns for a list of keys

get_count: counts the columns of key (you have to deserialize the row to do it)

get_range_slices: retrieve a slice for a range of keys

get_indexed_slices (FTW!)

Page 12: Introduction to apache_cassandra_for_develope

Thrift API Methods - Writing/Removing

insert

batch_mutate (batch insertion AND deletion)

remove

truncate**

Page 13: Introduction to apache_cassandra_for_develope

Thrift API Methods - Meta Information

describe_cluster_name

describe_version

describe_keyspace

describe_keyspaces

Page 14: Introduction to apache_cassandra_for_develope

Thrift API Methods - Schema

system_add_keyspace

system_update_keyspace

system_drop_keyspace

system_add_column_family

system_update_column_family

system_drop_column_family

Page 15: Introduction to apache_cassandra_for_develope

vs. RDBMS - Consistency Level

Consistency is tunable per request!

Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor).

*** CONSITENCY LEVEL FAILURE IS NOT A ROLLBACK***

Idempotent: an operation can be applied multiple times without changing the result

Page 16: Introduction to apache_cassandra_for_develope

vs. RDBMS - Append Only

Proper data modelling will minimizes seeks (Go to Tyler's presentation for more!)

Page 17: Introduction to apache_cassandra_for_develope

On to the Code...

https://github.com/zznate/cassandra-tutorial

Uses Maven. 

Really basic. 

Modify/abuse/alter as needed. 

Descriptions of what is going on and how to run each example are in the Javadoc comments. 

Sample data is based on North American Numbering Planhttp://en.wikipedia.org/wiki/North_American_Numbering_Plan

Page 18: Introduction to apache_cassandra_for_develope

Data Shape

512 202 30.27 097.74 W TX Austin512 203 30.27 097.74 L TX Austin512 204 30.32 097.73 W TX Austin512 205 30.32 097.73 W TX Austin512 206 30.32 097.73 L TX Austin

Page 19: Introduction to apache_cassandra_for_develope

Get a Single Column for a Key

GetCityForNpanxx.java

Retrieve a single column with:NameValueTimestampTTL

Page 20: Introduction to apache_cassandra_for_develope

Get the Contents of a Row

GetSliceForNpanxx.java

Retrieves a list of columns (Hector wraps these in a ColumnSlice)

"SlicePredicate" can either be explicit set of columns OR a range (more on ranges soon)

Another messy either/or choice encapsulated by Hector

Page 21: Introduction to apache_cassandra_for_develope

Get the (sorted!) Columns of a Row 

GetSliceForStateCity.java

Shows why the choice of comparator is important (this is the order in which the columns hit the disk - take advantage of it)

Can be easily modified to return results in reverse order (but this is slightly slower)

Page 22: Introduction to apache_cassandra_for_develope

Get the Same Slice from Several Rows

MultigetSliceForNpanxx.java

Very similar to get_slice examples, except we provide a list of keys

Page 23: Introduction to apache_cassandra_for_develope

Get Slices From a Range of Rows

GetRangeSlicesForStateCity.java

Like multiget_slice, except we can specify a KeyRange(encapsulated by RangeSlicesQuery#setKeys(start, end)

The results of this query will be significantly more meaningful with OrderPreservingPartitioner (try this at home!)

Page 24: Introduction to apache_cassandra_for_develope

Get Slices From a Range of Rows - 2

GetSliceForAreaCodeCity.java

Compound column name for controlling ranges

Comparator at work on text field

Page 25: Introduction to apache_cassandra_for_develope

Get Slices from Indexed Columns

GetIndexedSlicesForCityState.java

You only need to index a single column to apply clauses on other columns(BUT- the indexed column must be present with an EQUALS clause!)

(It's just another ColumnFamily maintained automatically)

Page 26: Introduction to apache_cassandra_for_develope

Insert, Update and Delete

... are effectively the same operation. 

InsertRowsForColumnFamilies.javaDeleteRowsForColumnFamily.java

Run each in succession (in whichever combination you like) and verify your results on the CLI

Hint: watch the timestamps

bin/cassandra-cli --host localhostuse Tutorial;list AreaCode;list Npanxx;list StateCity;

Page 27: Introduction to apache_cassandra_for_develope

Stuff I Punted on for the Sake of Brevity

meta_* methodsCassandraClusterTest.java: L43-81 @hector

system_* methodsSchemaManipulation.java @ hector-examplesCassandraClusterTest.java: L84-157 @hector

ORM (it works and is in production)ORM Documentation

multiple nodes

failure scenarios

Data modelling (go see Tyler's presentation)

Page 28: Introduction to apache_cassandra_for_develope

Things to Remember

• deletes and timestamp granularity• "range ghosts"• using the wrong column comparator and

InvalidRequestException• deletions actually write data• use column-level TTL to automate deletion• "how do I iterate over all the rows in a column family"?

o get_range_slices, but don't do thato a good sign your data model is wrong

Page 29: Introduction to apache_cassandra_for_develope

Dealing with *Lots* of Data (Briefly)

Two biggest headaches have been addressed:• Compaction pollutes os page cache (CASSANDRA-1470)• Greater than 143mil keys on a single SSTable means more

BF false positives (CASSANDRA-1555)

Hadoop integration: Yes. (Go see Jeremy's presentation)

Bulk loading: Yes. CASSANDRA-1278

For more information: http://wiki.apache.org/cassandra/LargeDataSetConsiderations