Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Handling realtime and analytic workloads in a single cluster with Hadoop and Cassandra

Handling realtime and analytic workloads in a single cluster with Hadoop and Cassandra

Piotr Kołaczkowski

[email protected]@pkolaczk

Piotr Kołaczkowski

[email protected]@pkolaczk

Basic Cassandra + Hadoop Integration

C*

C*

C*

C*

C*

C*

C*

C*

CassandraCluster

Hadoop Cluster

NameNode & JobTracker

DataNode DataNode

DataNode DataNode

DataNode DataNode

CFIF

CFOF

ColumnFamilyInputFormat

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Key: ByteBuffer

Value: SortedMap<ByteBuffer, IColumn>

(column name, value, timestamp)

row key

column name






Input Key:

jim

age: 36 car: camaro gender: M

Input Value:






Input Key:

carol

age: 37 car: subaru

Input Value:






Input Key:

johnny

age: 12 gender: M

Input Value:






Input Key:

suzy

age: 10 gender: F

Input Value:

CFIF – Wide Row Support

Input Key:

jim

age: 36

Input Value:






Input Key:

jim

car: camaro

Input Value:






Input Key:

jim

gender: M

Input Value:






Input Key:

carol

age: 37

Input Value:






Input Key:

carol

car: subaru

Input Value:





CFIF – Cassandra Secondary Index Support

IndexExpression expr = new IndexExpression( ByteBufferUtil.bytes("car"), IndexOperator.EQ, ByteBufferUitl.bytes("subaru") );

ConfigHelper.setInputRange( job.getConfiguration(), Arrays.asList(expr));





ColumnFamilyOutputFormat

● Key: ByteBuffer (row key)

● Value: List<Mutation>

– Mutation: insert or delete a column

C*

C*

C*

C*

C*

C*

C*

C*

CassandraCluster

ColumnFamilyRecordWriter

writequeue

client

thrift

CFOF – Creating Mutations

ByteBuffer rowkey = ByteBufferUtil.bytes(“carol”);

Column column = new Column();column.name = ByteBufferUtil.bytes(“age”);column.value = ByteBufferUtil.bytes(37);

List<Mutation> mutations;Mutation mutation = new Mutation();mutation.column_or_supercolumn = new ColumnOrSuperColumn();mutation.column_or_supercolumn.column = column;mutations.add(mutation);

context.write(rowkey, mutationList);

BulkOutputFormat

Hadoop Temporary Dir

SSTable 1 SSTable 2 SSTable N...

flush

write

BulkRecordWriter

Memory Buffer

DataStax Enterprise:Cassandra and Hadoop in a Single Cluster

Basic Features

● Single, simplified component

● Workload separation

● No SPOF

● Peer to peer

● JobTracker failover

● No additional Cassandra config

System Administrator's View

Address DC Rack Workload Status State Load Owns Token 148873535527910577765226390751398592512101.202.204.101 Analytics rack1 Analytics(JT) Up Normal 78,96 GB 12,50% 0 101.202.204.102 Analytics rack1 Analytics(TT) Up Normal 82,65 GB 12,50% 21267647932558653966460912964485513216 101.202.204.103 Analytics rack1 Analytics(TT) Up Normal 74,96 GB 12,50% 42535295865117307932921825928971026432 101.202.204.104 Analytics rack1 Analytics(TT) Up Normal 78,79 GB 12,50% 63802943797675961899382738893456539648 101.202.204.105 Cassandra rack1 Cassandra Up Normal 67,42 GB 12,50% 85070591730234615865843651857942052864 101.202.204.106 Cassandra rack1 Cassandra Up Normal 60,86 GB 12,50% 106338239662793269832304564822427566080101.202.204.107 Cassandra rack1 Cassandra Up Normal 81,27 GB 12,50% 127605887595351923798765477786913079296101.202.204.108 Cassandra rack1 Cassandra Up Normal 77,17 GB 12,50% 148873535527910577765226390751398592512

Easy monitoring of your nodes, regardless of their workload type

Wait, but where are my files?

Hadoop M/R

HDFS

Hadoop M/R

CFS

Cassandra Server

Cassandra File System Properties

● Decentralized

● Replicated

● HDFS compatible

– compatible with Hadoop filesystem utilities

– allows for running M/R programs on DSE without any change

● Compressed

CFS Architecture

CFS Compaction

● Keeps track of deleted rows (blocks)

● When all blocks in SSTable removed, deletes the whole SSTable

Cassandra Storage

block 1block 2block 3

block 4block 5block 6

ts 1ts 2

block 6 block 6block 7block 8

ts 3ts 4

block 6block 9block 10X

Hive Integration

● CassandraHiveMetaStore

– stores Hive database metadata in Cassandra

– no need to run a separate RDBMS

● CassandraStorageHandler

– allows for direct access to C* tables with CFIF and CFOF

Hive Integration – Example

CREATE EXTERNAL TABLE MyHiveTable(row_key string, col1 string, col2 string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' TBLPROPERTIES ("cassandra.ks.name" = "MyCassandraKS");

SELECT count(*) FROM MyHiveTable;

Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapred.reduce.tasks=<number>Starting Job = job_201306041030_0001, Tracking URL = http://192.168.123.10:50030/jobdetails.jsp?jobid=job_201306041030_0001Kill Command = /usr/bin/dse hadoop job -Dmapred.job.tracker=192.168.123.10:8012 -kill job_201306041030_0001Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 12013-06-04 15:11:54,573 Stage-1 map = 0%, reduce = 0%2013-06-04 15:11:58,622 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 1.04 sec2013-06-04 15:11:59,691 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 1.04 sec...2013-06-04 15:12:28,288 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:29,304 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:30,330 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:31,339 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 secMapReduce Total cumulative CPU time: 31 seconds 910 msecEnded Job = job_201306041030_0001MapReduce Jobs Launched: Job 0: Map: 9 Reduce: 1 Cumulative CPU: 31.91 sec HDFS Read: 0 HDFS Write: 0 SUCCESSTotal MapReduce CPU Time Spent: 31 seconds 910 msecOK1000000Time taken: 46.246 seconds

Custom Column Mapping

CREATE EXTERNAL TABLE Users( userid string, name string, email string, phone string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,user_name,primary_email,home_phone");

Cassandra: row key user_name primary_email home_phone

Hive: userid name email phone

Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Sports