Top Banner
@ebenhewitt 10. 14. 10 strange loop st louis adopting apache
58
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Untitled

@ebenhewitt10. 14. 10

strange loopst louis

adopting apache

Page 2: Untitled

• i wrote this

Page 3: Untitled

agenda• context• features• data model• api

“If I had asked the people what they wanted, they would have said ‘faster horses’”.

--Henry Ford

Page 4: Untitled

so it turns out, there’s a lot of data in the world…

• Google processes 8 EB of data every year– 24 PB every day– 1PB is a quadrillion bytes– 1 EB is a 1024 PB

• eBay– 50TB of new data every day

• World of Warcraft – uses 1.3 PB to store the game

• Chevron– 2TB of data every day

• WalMart’s Customer Database– 2004, .5 petabyte = 500 TB

Page 5: Untitled

The movie Avatar required 1PB storage

…or the equivalent of a single MP3

Page 6: Untitled

…if that MP3 was 32 years

long

Page 7: Untitled

it ain’t getting any smaller• 2006: 166 exabytes• 2010: >1000 exabytes

Page 8: Untitled

how do you scale relational databases?

1. tune queries2. indexes3. vertical scaling

– works for a time– eventually need to add boxes

4. shard– create a horizontal partition (how to join now?)– argh

5. denormalize6. now you have new problems

– data replication, consistency– master/slave (SPOF)

7. update configuration management– start doing undesirable things (turn off journaling)– caching

Page 9: Untitled

the no sql value proposition:

• sql sux• rdbms sux• throw out

everything you know

• run around like a crazy person

Page 10: Untitled

“nosql” “big data”• mongodb• couchdb• tokyo cabinet• redis• riak• what about?– Poet, Lotus, Xindice– they’ve been around forever– rdbms was once the new kid…

Page 11: Untitled

what is

distributeddecentralizedfault tolerantelastic durabledatabase

cassandra.apache.org

daughter of Priam & Hecuba

Page 12: Untitled

innovation at scalegoogle bigtable (2006)• consistency model:

strong• data model: sparse map• clones: hbase,

hypertable• column family,

sequential writes, bloom filters, linear insert performance

• CP

amazon dynamo (2007)• consistency model:

client tune-able• data model: key-value• O(1) dht• clones: riak, voldemort• symmetric p2p, gossip• AP

Page 13: Untitled

proven• SimpleGeo >50 Large EC2 instances

• Digg: 3TB of data

• The Facebook stores 150TB of data on 150 nodes

• US Government has 400 nodes for analytics in intelligence community in partnership with Digital Reasoning

• Used at Twitter, Rackspace, Mahalo, Reddit,

Page 14: Untitled

no free lunch• no transactions• no joins• no ad hoc queries

Page 15: Untitled

agenda• context• features• data model• api

Page 16: Untitled

cassandra properties• tuneably consistent• durable, fault tolerant• very fast writes• highly available• linear, elastic scalability• decentralized/symmetric• ~12 client languages – Thrift RPC API

• ~automatic provisioning of new nodes• 0(1) dht • big data

Page 17: Untitled

consistency

•consistency– all clients have same view of data

•availability– writeable in the face of node failure

•partition tolerance– processing can continue in the face of

network failure (crashed router, broken

Page 18: Untitled

daniel abadi: pacelc

partition! trade-off A & C

normal condition: tradeoff latency & consistency

Page 19: Untitled

write consistencyLevel Description

ZERO Good luck with thatANY 1 replica (hints count)

ONE 1 replica. read repair in bkgnd

QUORUM (N /2) + 1

ALL N = replication factor

Level Description

ZERO Ummm…ANY Try ONE instead

ONE 1 replica

QUORUM Return most recent TS after (N /2) + 1 reportALL N = replication factor

read consistency

Page 20: Untitled

durability

Page 21: Untitled

fast writes: staged eda• A general-purpose framework for high

concurrency & load conditioning• Decomposes applications into stages

separated by queues• Adopt a structured approach to event-

driven concurrency

Page 22: Untitled

highly

Page 23: Untitled

agenda• context• features• data model• api

Page 24: Untitled

structure

keyspace• settings (eg, partitioner)

column family…• settings (eg, comparator, type [Std])

column…• name• value• timestamp

Page 25: Untitled

keyspac

• ~= database• typically one per application• some settings are configurable only

per keyspace– partitioner

• Configured in XML in YAML in API

Page 26: Untitled

create a keyspace//Create KeyspaceKsDef k = new KsDef();k.setName(keyspaceName);k.setReplication_factor(1);k.setStrategy_class

("org.apache.cassandra.locator.RackUnawareStrategy");

List<CfDef> cfDefs = new ArrayList<CfDef>();k.setCf_defs(cfDefs);

//Connect to ServerTTransport tr = new TSocket(HOST, PORT);TFramedTransport tf = new TFramedTransport(tr); //new defaultTProtocol proto = new TBinaryProtocol(tf);Cassandra.Client client = new Cassandra.Client(proto);tr.open();

Page 27: Untitled

partitioner smack-downRandom• system will use MD5

(key) to distribute data across nodes

• even distribution of keys from one CF across ranges/nodes

Order Preserving• key distribution

determined by token• lexicographical ordering• can specify the token

for this node to use• ‘scrabble’ distribution• required for range

queries – scan over rows like cursor

in index

Page 28: Untitled

column family• group records of similar kind• CFs are sparse tables• ex:– Tweet– Address– Customer– PointOfInterest

Page 29: Untitled

column family

n=42

user=eben

key123

key456

user=alison icon=

nickname=The

Situation

columns

keys

Page 30: Untitled

json-like notationUser { 123 : { user:eben, nickname: The Situation },

456 : { user: alison, icon: ,

: The Danger Zone}}

Page 31: Untitled

think of cassandra as

row-oriented• each row is uniquely identifiable by

key• rows group columns and super

Page 32: Untitled

a column has 3 parts1. name– byte[]– determines sort order– used in queries– indexed

2. value– byte[]– you don’t query on column values

3. timestamp– long (clock)– last-write-wins conflict resolution

Page 33: Untitled

get started$cassandra –f$bin/cassandra-cli cassandra> connect localhost/9160

cassandra> set Keyspace1.Standard1[‘eben’][‘age’]=‘29’

cassandra> set Keyspace1.Standard1[‘eben’][‘email’]=‘[email protected]

cassandra> get Keyspace1.Standard1[‘eben'][‘age']

=> (column=6e616d65, value=29,

Page 34: Untitled

column comparators• byte• utf8• long• timeuuid (version 1)• lexicaluuid (any, usually version 4)• <pluggable>– ex: lat/long

Page 35: Untitled

super

super columns group columns under a common name

Page 36: Untitled

<<SCF>>PointOfInterest

super column

<<SC>>Central Park1001

7

<<SC>>Empire State Bldg

63112

desc=Fun to walk in.

phone=212.

555.11212

desc=Great view from

102nd floor!

<<SC>>The Loop

phone=314.

555.11212

desc=Home of Strange

Loop!

Page 37: Untitled

PointOfInterest { key: 85255 { Phoenix Zoo { phone: 480-555-5555, desc: They have animals

here. }, Spring Training { phone: 623-333-3333, desc: Fun for baseball

fans. }, }, //end phx

key: 10019 { Central Park { desc: Walk around. It's pretty.} , Empire State Building { phone: 212-777-7777, desc: Great view from 102nd floor. } } //end nyc

s

super column

super column family

flexible schema

key

column

super column

Page 38: Untitled

about super column families• sub-column names in a SCF are not

indexed– top level columns (SCF Name) are always

indexed• often used for denormalizing data

from standard CFs

Page 39: Untitled

rdbms: domain-based model

what answers do I have?big query language

cassandra: query-based model

what questions do I have?

Page 40: Untitled

replica/tion• configurable replication factor• replica placement strategy

rack unaware Simple Strategyrack aware Old Network Topology

Strategydata center shard Network Topology

Strategy

Page 41: Untitled

agenda• context• features• data model• api

Page 42: Untitled

slice predicate• data structure describing columns to

return– SliceRange• start column name (byte[])• finish column name (can be empty to stop on

count)• reverse• count (like LIMIT)

Page 43: Untitled

read api• get() : Column– get the Col or SC at given ColPath COSC cosc = client.get(key, path, CL);

• get_slice() : List<ColumnOrSuperColumn>– get Cols in one row, specified by SlicePredicate: List<ColumnOrSuperColumn> results = client.get_slice(key, parent, predicate, CL);

• multiget_slice() : Map<key, List<CoSC>>– get slices for list of keys, based on SlicePredicate

Map<byte[],List<ColumnOrSuperColumn>> results = client.multiget_slice(rowKeys, parent, predicate, CL);

• get_range_slices() : List<KeySlice> – returns multiple Cols according to a range– range is startkey, endkey, starttoken, endtoken: List<KeySlice> slices = client.get_range_slices(

Page 44: Untitled

insert

insert(userIDKey, cp, new Column("name".getBytes(UTF8), "George Clinton".getBytes(), clock),

CL);

Page 45: Untitled

delete

String columnFamily = "Standard1";byte[] key = "k2".getBytes(); //row key

Clock clock = new Clock(System.currentTimeMillis());

ColumnPath colPath = new ColumnPath();colPath.column_family = columnFamily;colPath.column = "b".getBytes();

client.remove(key, colPath, clock, ConsistencyLevel.ALL);

Page 46: Untitled

batch_mutateMap<byte[], Map<String, List<Mutation>>> mutationMap = new HashMap<byte[], Map<String, List<Mutation>>>();

List<Mutation> mutationList = new ArrayList<Mutation>();mutationList.add(mutation);

Map<String, List<Mutation>> m = new HashMap<String, List<Mutation>>();

m.put(columnFamily, mutationList);

//just for this row key, though we could add moremutationMap.put(key, m);client.batch_mutate(mutationMap, ConsistencyLevel.ALL);

Page 47: Untitled

raw thrift: for masochists

• pycassa (python)• Telephus (twisted python)• fauna/cassandra gem (ruby)• hector (java)• pelops (java)• kundera (JPA)• hectorSharp (C#)

Page 48: Untitled

what about…

SELECT WHEREORDER BY

JOIN ON GROUP?

Page 49: Untitled

SELECT WHEREcassandra is an index factory

<<cf>>USERKey: UserIDCols: username, email, birth date, city, state How to support this query?

SELECT * FROM User WHERE city = ‘Scottsdale’

Create a new CF called UserCity: <<cf>>USERCITYKey: city

Page 50: Untitled

• Use an aggregate key state:city: { user1, user2}

• Get rows between AZ: & AZ; for all Arizona users

• Get rows between AZ:Scottsdale & AZ:Scottsdale1

for all Scottsdale users

SELECT WHERE pt 2

Page 51: Untitled

ORDER BY

Rows

are placed according to their Partitioner:

•Random: MD5 of key•Order-Preserving: actual key

are sorted by key, regardless of partitioner

Columns

are sorted according to CompareWith or CompareSubcolumnsWith

Page 52: Untitled

data• skinny rows, wide rows (billions of

columns)• denormalize known queries– secondary index support in 0.7

• client join others• 2 caching layers: row, index

Page 53: Untitled

is cassandra a good fit?• sub-millisecond writes• you need durability• you have lots of data > GBs

>= three servers• growing data over time• your app is evolving

– startup mode, fluid data structure

• loose domain data – “points of interest”

• multi data-center

• your programmers can deal– documentation– complexity– consistency model– change– visibility tools

• your operations can deal– hardware considerations– can move data– JMX monitoring

Page 54: Untitled

use cases• jboss.org/inifispan – data grid cache

• log data stream• hotelier– points of interest – guests

• geospatial• travel– segment analytics

With Hadoop!• BI w/o ETL• raptr.com – storage & analytics

for gaming stats• imagini– visual quizzes for

publishers– real time for 100s of

millions of users

Page 55: Untitled

coming in 0.7• secondary indexes• hadoop improvements• large row support ( > 2GB)• dynamic routing around slow nodes

Page 56: Untitled

YOU ALREADY HAVE THE RIGHT

DATABASE TODAYFOR THE APPLICATION YOU

HAVE TODAY

Page 57: Untitled

what would you do if scale wasn’t a problem?

Page 58: Untitled

@ebenhewittcassandra.apache.org

"An invention has to make sense in the world in which it is finished, not the world in which it is started”.

--Ray Kurzweil