Hadoop and Cassandra at Rackspace

Making Massive Manageable:

Hadoop and Cassandra (at Rackspace)

Big Data Workshop

Stu Hood (@stuhood) – Technical Lead, Rackspace

April 23rd 2010

My, what a large dataset you have...

Processing 3 TB/day of logs

Using Hadoop/Pig

And the sticking points?

“How fast can we provision machines?”

“How do we get data on/off the cluster?”

“How do we add structure?”

MapReduce

Distributed processing methodology

Adapt a problem to MapReduce

Scale forever

Crunch almost anything

Typically adding structure to unstructured data

Logs

Also great for structured

Graph processing

Machine learning

“You want to use how many clients?”

Need to store structured inputs/outputs

Solution needs to

Support arbitrary number of clients

Preferably provide locality

Possibly provide 'web' latency

Solutions of varying quality

Sharding the RDBMS

shard n. - A horizontal partition in a databaseExample: Sharding by userid

Provided by ORM?Fixed partitions: manual rebalancing

Developing from scratch?Adding/removing nodes

Handling failover

As a library? As a middle tier?


Leaving data in Hadoop

Storage in Map/SequenceFile

Serialized with Thrift/Avro/ProtoBuffs

No random access

High latency


Storing in HBase/Hypertable

Column stores implemented on Hadoop

Modeled after Google's Bigtable

Multiple points of failure

Namenode

Master

High (almost non-web) latency

And the newest contender...

Standing on the shoulders of: Amazon Dynamo

No node in the cluster is special

No special roles

No scaling bottlenecks

No single point of failure

Techniques

Gossip

Eventual consistency

Standing on the shoulders of: Google Bigtable

“Column family” data model

Range queries for rows:

Scan rows in order

Memtable/SSTable structure

Always writes sequentially to disk

Bloom filters to minimize random reads

Trounces B-Trees for big dataLinear insert performance

Log growth for reads

Enter Cassandra

Hybrid of ancestors

Adopts listed features

And adds:

A sweet logo!

Pluggable partitioning

Multi datacenter supportPluggable locality

awareness

Datamodel improvements

Enter Cassandra

Project status

Open sourced by Facebook in 2008 (no longer active)

Apache License

Graduated to Apache TLP February 2010

Major releases: 0.3 through 0.6 (0.7 in two months)

cassandra.apache.org

Enter Cassandra

The code base

Java, Apache Ant, Git/SVN

5+ committers from 3+ companies

Known deployments at:

Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit

Performance

Like peanut butter with jelly

Apache Cassandra 0.6:

MapReduce input support out of the box

Locality information partially exposed

Hadoop InputFormat

Pig LoadFunc

Hadoop + Cassandra at RAX

Multiple Hadoop clusters deployed

Smaller Cassandra deployments

Preparing for large scale Cassandra deployment

In the pipeline

MapReduce output support

Adding an OutputFormat with locality information

Improving locality for Hadoop inputs

Getting started

http://cassandra.apache.org/

Read "Getting Started"... Roughly:

Start one node

Test/develop app, editing node config as necessary

Launch cluster by starting more nodes with chosen config

Thanks!

Big Data Workshop

Participants!

Questions?

References

Brandon William's perf tests

http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png

Hadoop/Cassandra Integration

http://issues.apache.org/jira/browse/CASSANDRA-342

Hadoop and Cassandra at Rackspace

Documents