Untitled

@ebenhewitt10. 14. 10

strange loopst louis

adopting apache

• i wrote this

agenda• context• features• data model• api

“If I had asked the people what they wanted, they would have said ‘faster horses’”.

--Henry Ford

so it turns out, there’s a lot of data in the world…

• Google processes 8 EB of data every year– 24 PB every day– 1PB is a quadrillion bytes– 1 EB is a 1024 PB

• eBay– 50TB of new data every day

• World of Warcraft – uses 1.3 PB to store the game

• Chevron– 2TB of data every day

• WalMart’s Customer Database– 2004, .5 petabyte = 500 TB

The movie Avatar required 1PB storage

…or the equivalent of a single MP3

…if that MP3 was 32 years

long

it ain’t getting any smaller• 2006: 166 exabytes• 2010: >1000 exabytes

how do you scale relational databases?

1. tune queries2. indexes3. vertical scaling

– works for a time– eventually need to add boxes

4. shard– create a horizontal partition (how to join now?)– argh

5. denormalize6. now you have new problems

– data replication, consistency– master/slave (SPOF)

7. update configuration management– start doing undesirable things (turn off journaling)– caching

the no sql value proposition:

• sql sux• rdbms sux• throw out

everything you know

• run around like a crazy person

“nosql” “big data”• mongodb• couchdb• tokyo cabinet• redis• riak• what about?– Poet, Lotus, Xindice– they’ve been around forever– rdbms was once the new kid…

what is

distributeddecentralizedfault tolerantelastic durabledatabase

cassandra.apache.org

daughter of Priam & Hecuba

innovation at scalegoogle bigtable (2006)• consistency model:

strong• data model: sparse map• clones: hbase,

hypertable• column family,

sequential writes, bloom filters, linear insert performance

• CP

amazon dynamo (2007)• consistency model:

client tune-able• data model: key-value• O(1) dht• clones: riak, voldemort• symmetric p2p, gossip• AP

proven• SimpleGeo >50 Large EC2 instances

• Digg: 3TB of data

• The Facebook stores 150TB of data on 150 nodes

• US Government has 400 nodes for analytics in intelligence community in partnership with Digital Reasoning

• Used at Twitter, Rackspace, Mahalo, Reddit,

no free lunch• no transactions• no joins• no ad hoc queries


cassandra properties• tuneably consistent• durable, fault tolerant• very fast writes• highly available• linear, elastic scalability• decentralized/symmetric• ~12 client languages – Thrift RPC API

• ~automatic provisioning of new nodes• 0(1) dht • big data

consistency

•consistency– all clients have same view of data

•availability– writeable in the face of node failure

•partition tolerance– processing can continue in the face of

network failure (crashed router, broken

daniel abadi: pacelc

partition! trade-off A & C

normal condition: tradeoff latency & consistency

write consistencyLevel Description

ZERO Good luck with thatANY 1 replica (hints count)

ONE 1 replica. read repair in bkgnd

QUORUM (N /2) + 1

ALL N = replication factor

Level Description

ZERO Ummm…ANY Try ONE instead

ONE 1 replica

QUORUM Return most recent TS after (N /2) + 1 reportALL N = replication factor

read consistency

durability

fast writes: staged eda• A general-purpose framework for high

concurrency & load conditioning• Decomposes applications into stages

separated by queues• Adopt a structured approach to event-

driven concurrency

highly


structure

keyspace• settings (eg, partitioner)

column family…• settings (eg, comparator, type [Std])

column…• name• value• timestamp

keyspac

• ~= database• typically one per application• some settings are configurable only

per keyspace– partitioner

• Configured in XML in YAML in API

create a keyspace//Create KeyspaceKsDef k = new KsDef();k.setName(keyspaceName);k.setReplication_factor(1);k.setStrategy_class

("org.apache.cassandra.locator.RackUnawareStrategy");

List<CfDef> cfDefs = new ArrayList<CfDef>();k.setCf_defs(cfDefs);

//Connect to ServerTTransport tr = new TSocket(HOST, PORT);TFramedTransport tf = new TFramedTransport(tr); //new defaultTProtocol proto = new TBinaryProtocol(tf);Cassandra.Client client = new Cassandra.Client(proto);tr.open();

partitioner smack-downRandom• system will use MD5

(key) to distribute data across nodes

• even distribution of keys from one CF across ranges/nodes

Order Preserving• key distribution

determined by token• lexicographical ordering• can specify the token

for this node to use• ‘scrabble’ distribution• required for range

queries – scan over rows like cursor

in index

column family• group records of similar kind• CFs are sparse tables• ex:– Tweet– Address– Customer– PointOfInterest

column family

n=42

user=eben

key123

key456

user=alison icon=

nickname=The

Situation

columns

keys

json-like notationUser { 123 : { user:eben, nickname: The Situation },

456 : { user: alison, icon: ,

: The Danger Zone}}

think of cassandra as

row-oriented• each row is uniquely identifiable by

key• rows group columns and super

a column has 3 parts1. name– byte[]– determines sort order– used in queries– indexed

2. value– byte[]– you don’t query on column values

3. timestamp– long (clock)– last-write-wins conflict resolution

get started$cassandra –f$bin/cassandra-cli cassandra> connect localhost/9160

cassandra> set Keyspace1.Standard1[‘eben’][‘age’]=‘29’

cassandra> set Keyspace1.Standard1[‘eben’][‘email’]=‘[email protected]’

cassandra> get Keyspace1.Standard1[‘eben'][‘age']

=> (column=6e616d65, value=29,

column comparators• byte• utf8• long• timeuuid (version 1)• lexicaluuid (any, usually version 4)• <pluggable>– ex: lat/long

super

super columns group columns under a common name

<<SCF>>PointOfInterest

super column

<<SC>>Central Park1001

7

<<SC>>Empire State Bldg

63112

desc=Fun to walk in.

phone=212.

555.11212

desc=Great view from

102nd floor!

<<SC>>The Loop

phone=314.

555.11212

desc=Home of Strange

Loop!

PointOfInterest { key: 85255 { Phoenix Zoo { phone: 480-555-5555, desc: They have animals

here. }, Spring Training { phone: 623-333-3333, desc: Fun for baseball

fans. }, }, //end phx

key: 10019 { Central Park { desc: Walk around. It's pretty.} , Empire State Building { phone: 212-777-7777, desc: Great view from 102nd floor. } } //end nyc

s

super column

super column family

flexible schema

key

column

super column

about super column families• sub-column names in a SCF are not

indexed– top level columns (SCF Name) are always

indexed• often used for denormalizing data

from standard CFs

rdbms: domain-based model

what answers do I have?big query language

cassandra: query-based model

what questions do I have?

replica/tion• configurable replication factor• replica placement strategy

rack unaware Simple Strategyrack aware Old Network Topology

Strategydata center shard Network Topology

Strategy


slice predicate• data structure describing columns to

return– SliceRange• start column name (byte[])• finish column name (can be empty to stop on

count)• reverse• count (like LIMIT)

read api• get() : Column– get the Col or SC at given ColPath COSC cosc = client.get(key, path, CL);

• get_slice() : List<ColumnOrSuperColumn>– get Cols in one row, specified by SlicePredicate: List<ColumnOrSuperColumn> results = client.get_slice(key, parent, predicate, CL);

• multiget_slice() : Map<key, List<CoSC>>– get slices for list of keys, based on SlicePredicate

Map<byte[],List<ColumnOrSuperColumn>> results = client.multiget_slice(rowKeys, parent, predicate, CL);

• get_range_slices() : List<KeySlice> – returns multiple Cols according to a range– range is startkey, endkey, starttoken, endtoken: List<KeySlice> slices = client.get_range_slices(

insert

insert(userIDKey, cp, new Column("name".getBytes(UTF8), "George Clinton".getBytes(), clock),

CL);

delete

String columnFamily = "Standard1";byte[] key = "k2".getBytes(); //row key

Clock clock = new Clock(System.currentTimeMillis());

ColumnPath colPath = new ColumnPath();colPath.column_family = columnFamily;colPath.column = "b".getBytes();

client.remove(key, colPath, clock, ConsistencyLevel.ALL);

batch_mutateMap<byte[], Map<String, List<Mutation>>> mutationMap = new HashMap<byte[], Map<String, List<Mutation>>>();

List<Mutation> mutationList = new ArrayList<Mutation>();mutationList.add(mutation);

Map<String, List<Mutation>> m = new HashMap<String, List<Mutation>>();

m.put(columnFamily, mutationList);

//just for this row key, though we could add moremutationMap.put(key, m);client.batch_mutate(mutationMap, ConsistencyLevel.ALL);

raw thrift: for masochists

• pycassa (python)• Telephus (twisted python)• fauna/cassandra gem (ruby)• hector (java)• pelops (java)• kundera (JPA)• hectorSharp (C#)

what about…

SELECT WHEREORDER BY

JOIN ON GROUP?

SELECT WHEREcassandra is an index factory

<<cf>>USERKey: UserIDCols: username, email, birth date, city, state How to support this query?

SELECT * FROM User WHERE city = ‘Scottsdale’

Create a new CF called UserCity: <<cf>>USERCITYKey: city

• Use an aggregate key state:city: { user1, user2}

• Get rows between AZ: & AZ; for all Arizona users

• Get rows between AZ:Scottsdale & AZ:Scottsdale1

for all Scottsdale users

SELECT WHERE pt 2

ORDER BY

Rows

are placed according to their Partitioner:

•Random: MD5 of key•Order-Preserving: actual key

are sorted by key, regardless of partitioner

Columns

are sorted according to CompareWith or CompareSubcolumnsWith

data• skinny rows, wide rows (billions of

columns)• denormalize known queries– secondary index support in 0.7

• client join others• 2 caching layers: row, index

is cassandra a good fit?• sub-millisecond writes• you need durability• you have lots of data > GBs

>= three servers• growing data over time• your app is evolving

– startup mode, fluid data structure

• loose domain data – “points of interest”

• multi data-center

• your programmers can deal– documentation– complexity– consistency model– change– visibility tools

• your operations can deal– hardware considerations– can move data– JMX monitoring

use cases• jboss.org/inifispan – data grid cache

• log data stream• hotelier– points of interest – guests

• geospatial• travel– segment analytics

With Hadoop!• BI w/o ETL• raptr.com – storage & analytics

for gaming stats• imagini– visual quizzes for

publishers– real time for 100s of

millions of users

coming in 0.7• secondary indexes• hadoop improvements• large row support ( > 2GB)• dynamic routing around slow nodes

YOU ALREADY HAVE THE RIGHT

DATABASE TODAYFOR THE APPLICATION YOU

HAVE TODAY

what would you do if scale wasn’t a problem?

@ebenhewittcassandra.apache.org

"An invention has to make sense in the world in which it is finished, not the world in which it is started”.

--Ray Kurzweil

Untitled

Documents

tb of new data

new cassandra

tb of data

eb of data

lot of data

new problems data replication

consistency model

sparse map data model