Top Banner
Making Massive Manageable: Hadoop and Cassandra (at Rackspace) Big Data Workshop Stu Hood (@stuhood) – Technical Lead, Rackspace April 23rd 2010
21

Hadoop and Cassandra at Rackspace

Jan 29, 2018

Download

Documents

Stu Hood
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop and Cassandra at Rackspace

Making Massive Manageable:

Hadoop and Cassandra (at Rackspace)

Big Data Workshop

Stu Hood (@stuhood) – Technical Lead, Rackspace

April 23rd 2010

Page 2: Hadoop and Cassandra at Rackspace

My, what a large dataset you have...

Processing 3 TB/day of logs

Using Hadoop/Pig

And the sticking points?

“How fast can we provision machines?”

“How do we get data on/off the cluster?”

“How do we add structure?”

Page 3: Hadoop and Cassandra at Rackspace

MapReduce

Distributed processing methodology

Adapt a problem to MapReduce

Scale forever

Crunch almost anything

Typically adding structure to unstructured data

Logs

Also great for structured

Graph processing

Machine learning

Page 4: Hadoop and Cassandra at Rackspace

“You want to use how many clients?”

Need to store structured inputs/outputs

Solution needs to

Support arbitrary number of clients

Preferably provide locality

Possibly provide 'web' latency

Page 5: Hadoop and Cassandra at Rackspace

Solutions of varying quality

Sharding the RDBMS

shard n. - A horizontal partition in a databaseExample: Sharding by userid

Provided by ORM?Fixed partitions: manual rebalancing

Developing from scratch?Adding/removing nodes

Handling failover

As a library? As a middle tier?

Page 6: Hadoop and Cassandra at Rackspace

Solutions of varying quality

Leaving data in Hadoop

Storage in Map/SequenceFile

Serialized with Thrift/Avro/ProtoBuffs

No random access

High latency

Page 7: Hadoop and Cassandra at Rackspace

Solutions of varying quality

Storing in HBase/Hypertable

Column stores implemented on Hadoop

Modeled after Google's Bigtable

Multiple points of failure

Namenode

Master

High (almost non-web) latency

Page 8: Hadoop and Cassandra at Rackspace

And the newest contender...

Page 9: Hadoop and Cassandra at Rackspace

Standing on the shoulders of: Amazon Dynamo

No node in the cluster is special

No special roles

No scaling bottlenecks

No single point of failure

Techniques

Gossip

Eventual consistency

Page 10: Hadoop and Cassandra at Rackspace

Standing on the shoulders of: Google Bigtable

“Column family” data model

Range queries for rows:

Scan rows in order

Memtable/SSTable structure

Always writes sequentially to disk

Bloom filters to minimize random reads

Trounces B-Trees for big dataLinear insert performance

Log growth for reads

Page 11: Hadoop and Cassandra at Rackspace

Enter Cassandra

Hybrid of ancestors

Adopts listed features

And adds:

A sweet logo!

Pluggable partitioning

Multi datacenter supportPluggable locality

awareness

Datamodel improvements

Page 12: Hadoop and Cassandra at Rackspace

Enter Cassandra

Project status

Open sourced by Facebook in 2008 (no longer active)

Apache License

Graduated to Apache TLP February 2010

Major releases: 0.3 through 0.6 (0.7 in two months)

cassandra.apache.org

Page 13: Hadoop and Cassandra at Rackspace

Enter Cassandra

The code base

Java, Apache Ant, Git/SVN

5+ committers from 3+ companies

Known deployments at:

Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit

Page 14: Hadoop and Cassandra at Rackspace

Performance

Page 15: Hadoop and Cassandra at Rackspace

Like peanut butter with jelly

Apache Cassandra 0.6:

MapReduce input support out of the box

Locality information partially exposed

Hadoop InputFormat

Pig LoadFunc

Page 16: Hadoop and Cassandra at Rackspace

Hadoop + Cassandra at RAX

Multiple Hadoop clusters deployed

Smaller Cassandra deployments

Preparing for large scale Cassandra deployment

Page 17: Hadoop and Cassandra at Rackspace

In the pipeline

MapReduce output support

Adding an OutputFormat with locality information

Improving locality for Hadoop inputs

Page 18: Hadoop and Cassandra at Rackspace

Getting started

http://cassandra.apache.org/

Read "Getting Started"... Roughly:

Start one node

Test/develop app, editing node config as necessary

Launch cluster by starting more nodes with chosen config

Page 19: Hadoop and Cassandra at Rackspace

Thanks!

Big Data Workshop

Participants!

Page 20: Hadoop and Cassandra at Rackspace

Questions?

Page 21: Hadoop and Cassandra at Rackspace

References

Brandon William's perf tests

http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png

Hadoop/Cassandra Integration

http://issues.apache.org/jira/browse/CASSANDRA-342