Introduction to apache_cassandra_for_develope

Post on 11-May-2015

10565 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

A presentation for Data Day Austin on January 29th, 2011Introduces how to effectively use Apache Cassandra for Java developers using the Hector Java client: http://github.com/rantav/hector

Transcript

Introduction to Apache Cassandra

(for Java developers!)

Nate McCallnate@datastax.com@zznate

Brief Intro 

NOT a "key/value store"Columns are dynamic inside a column familySSTables are immutable SSTables merged on readsAll nodes share the same role (i.e. no single point of failure)

Trading ACID compliance for scalability is a fundamental design decision

How does this impact development?

Substantially. 

For operations affecting the same data, that data will become consistent eventually as determined by the timestamps. 

But you can trade availability for consistency. (More on this later)

You can store whatever you want. It's all just bytes.

You need to think about how you will query the data before you write it.

Neat. So Now What?

Like any database, you need a client!

• Python:o Telephus: http://github.com/driftx/Telephus (Twisted)o Pycassa: http://github.com/pycassa/pycassa

• Java:o Hector: http://github.com/rantav/hector (Examples https://github.com/zznate/hector-examples )o Pelops: http://github.com/s7/scale7-pelopso Kundera http://code.google.com/p/kundera/o Datanucleus JDO: http://github.com/tnine/Datanucleus-Cassandra-Plugin

• Grails:o grails-cassandra: https://github.com/wolpert/grails-cassandra

• .NET:o FluentCassandra: http://github.com/managedfusion/fluentcassandrao Aquiles: http://aquiles.codeplex.com/

• Ruby:o Cassandra: http://github.com/fauna/cassandra

• PHP:o phpcassa: http://github.com/thobbs/phpcassao SimpleCassie: http://code.google.com/p/simpletools-php/wiki/SimpleCassie

... but do not roll your own

Thrift

• Fast, efficient serialization and network IO. • Lots of clients available (you can probably use it in other

places as well)

Why you don't want to work with the Thrift API directly:• SuperColumn• ColumnOrSuperColumn• ColumnParent.super_column• ColumnPath.super_column• Map<ByteBuffer,Map<String,List<Mutation>>>

mutationMap 

Higher Level Client

Hector• JMX Counters• Add/remove hosts:

o automatically o programaticallyo via JMX

• Plugable load balancing• Complete encapsulation of Thrift API• Type-safe approach to dealing with Apache Cassandra• Lightweight ORM (supports JPA 1.0 annotations)• Mavenized! http://repo2.maven.org/maven2/me/prettyprint/

"CQL"

• Currently in Apache Cassandra trunk • Experimental• Lots of possibilities

from test/system/test_cql.py:

UPDATE StandardLong1 SET 1L="1", 2L="2", 3L="3", 4L="4" WHERE KEY="aa"

SELECT "cd1", "col" FROM Standard1 WHERE KEY = "kd"

DELETE "cd1", "col" FROM Standard1 WHERE KEY = "kd"

Avro??

Gone. Added too much complexity after Thrift caught up.  

"None of the libraries distinguished themselves as being a particularly crappy choice for serialization." 

(See CASSANDRA-1765)

Thrift API Methods

Retrieving

Writing/Removing

Meta Information

Schema Manipulation

Thrift API Methods - Retrieving

get: retrieve a single column for a key

get_slice: retrieve a "slice" of columns for a key

multiget_slice: retrieve a "slice" of columns for a list of keys

get_count: counts the columns of key (you have to deserialize the row to do it)

get_range_slices: retrieve a slice for a range of keys

get_indexed_slices (FTW!)

Thrift API Methods - Writing/Removing

insert

batch_mutate (batch insertion AND deletion)

remove

truncate**

Thrift API Methods - Meta Information

describe_cluster_name

describe_version

describe_keyspace

describe_keyspaces

Thrift API Methods - Schema

system_add_keyspace

system_update_keyspace

system_drop_keyspace

system_add_column_family

system_update_column_family

system_drop_column_family

vs. RDBMS - Consistency Level

Consistency is tunable per request!

Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor).

*** CONSITENCY LEVEL FAILURE IS NOT A ROLLBACK***

Idempotent: an operation can be applied multiple times without changing the result

vs. RDBMS - Append Only

Proper data modelling will minimizes seeks (Go to Tyler's presentation for more!)

On to the Code...

https://github.com/zznate/cassandra-tutorial

Uses Maven. 

Really basic. 

Modify/abuse/alter as needed. 

Descriptions of what is going on and how to run each example are in the Javadoc comments. 

Sample data is based on North American Numbering Planhttp://en.wikipedia.org/wiki/North_American_Numbering_Plan

Data Shape

512 202 30.27 097.74 W TX Austin512 203 30.27 097.74 L TX Austin512 204 30.32 097.73 W TX Austin512 205 30.32 097.73 W TX Austin512 206 30.32 097.73 L TX Austin

Get a Single Column for a Key

GetCityForNpanxx.java

Retrieve a single column with:NameValueTimestampTTL

Get the Contents of a Row

GetSliceForNpanxx.java

Retrieves a list of columns (Hector wraps these in a ColumnSlice)

"SlicePredicate" can either be explicit set of columns OR a range (more on ranges soon)

Another messy either/or choice encapsulated by Hector

Get the (sorted!) Columns of a Row 

GetSliceForStateCity.java

Shows why the choice of comparator is important (this is the order in which the columns hit the disk - take advantage of it)

Can be easily modified to return results in reverse order (but this is slightly slower)

Get the Same Slice from Several Rows

MultigetSliceForNpanxx.java

Very similar to get_slice examples, except we provide a list of keys

Get Slices From a Range of Rows

GetRangeSlicesForStateCity.java

Like multiget_slice, except we can specify a KeyRange(encapsulated by RangeSlicesQuery#setKeys(start, end)

The results of this query will be significantly more meaningful with OrderPreservingPartitioner (try this at home!)

Get Slices From a Range of Rows - 2

GetSliceForAreaCodeCity.java

Compound column name for controlling ranges

Comparator at work on text field

Get Slices from Indexed Columns

GetIndexedSlicesForCityState.java

You only need to index a single column to apply clauses on other columns(BUT- the indexed column must be present with an EQUALS clause!)

(It's just another ColumnFamily maintained automatically)

Insert, Update and Delete

... are effectively the same operation. 

InsertRowsForColumnFamilies.javaDeleteRowsForColumnFamily.java

Run each in succession (in whichever combination you like) and verify your results on the CLI

Hint: watch the timestamps

bin/cassandra-cli --host localhostuse Tutorial;list AreaCode;list Npanxx;list StateCity;

Stuff I Punted on for the Sake of Brevity

meta_* methodsCassandraClusterTest.java: L43-81 @hector

system_* methodsSchemaManipulation.java @ hector-examplesCassandraClusterTest.java: L84-157 @hector

ORM (it works and is in production)ORM Documentation

multiple nodes

failure scenarios

Data modelling (go see Tyler's presentation)

Things to Remember

• deletes and timestamp granularity• "range ghosts"• using the wrong column comparator and

InvalidRequestException• deletions actually write data• use column-level TTL to automate deletion• "how do I iterate over all the rows in a column family"?

o get_range_slices, but don't do thato a good sign your data model is wrong

Dealing with *Lots* of Data (Briefly)

Two biggest headaches have been addressed:• Compaction pollutes os page cache (CASSANDRA-1470)• Greater than 143mil keys on a single SSTable means more

BF false positives (CASSANDRA-1555)

Hadoop integration: Yes. (Go see Jeremy's presentation)

Bulk loading: Yes. CASSANDRA-1278

For more information: http://wiki.apache.org/cassandra/LargeDataSetConsiderations

top related