Top Banner
Handling realtime and analytic workloads in a single cluster with Hadoop and Cassandra Handling realtime and analytic workloads in a single cluster with Hadoop and Cassandra Piotr Kołaczkowski [email protected] @pkolaczk Piotr Kołaczkowski [email protected] @pkolaczk
26

Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Dec 04, 2014

Download

Sports

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Handling realtime and analytic workloads in a single cluster with Hadoop and Cassandra

Handling realtime and analytic workloads in a single cluster with Hadoop and Cassandra

Piotr Kołaczkowski

[email protected]@pkolaczk

Piotr Kołaczkowski

[email protected]@pkolaczk

Page 2: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Basic Cassandra + Hadoop Integration

C*

C*

C*

C*

C*

C*

C*

C*

CassandraCluster

Hadoop Cluster

NameNode & JobTracker

DataNode DataNode

DataNode DataNode

DataNode DataNode

CFIF

CFOF

Page 3: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

ColumnFamilyInputFormat

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Key: ByteBuffer

Value: SortedMap<ByteBuffer, IColumn>

(column name, value, timestamp)

row key

column name

Page 4: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

ColumnFamilyInputFormat

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Input Key:

jim

age: 36 car: camaro gender: M

Input Value:

Page 5: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

ColumnFamilyInputFormat

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Input Key:

carol

age: 37 car: subaru

Input Value:

Page 6: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

ColumnFamilyInputFormat

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Input Key:

johnny

age: 12 gender: M

Input Value:

Page 7: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

ColumnFamilyInputFormat

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Input Key:

suzy

age: 10 gender: F

Input Value:

Page 8: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

CFIF – Wide Row Support

Input Key:

jim

age: 36

Input Value:

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Page 9: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

CFIF – Wide Row Support

Input Key:

jim

car: camaro

Input Value:

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Page 10: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

CFIF – Wide Row Support

Input Key:

jim

gender: M

Input Value:

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Page 11: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

CFIF – Wide Row Support

Input Key:

carol

age: 37

Input Value:

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Page 12: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

CFIF – Wide Row Support

Input Key:

carol

car: subaru

Input Value:

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Page 13: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

CFIF – Cassandra Secondary Index Support

IndexExpression expr = new IndexExpression( ByteBufferUtil.bytes("car"), IndexOperator.EQ, ByteBufferUitl.bytes("subaru") );

ConfigHelper.setInputRange( job.getConfiguration(), Arrays.asList(expr));

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Page 14: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

ColumnFamilyOutputFormat

● Key: ByteBuffer (row key)

● Value: List<Mutation>

– Mutation: insert or delete a column

C*

C*

C*

C*

C*

C*

C*

C*

CassandraCluster

ColumnFamilyRecordWriter

writequeue

client

thrift

Page 15: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

CFOF – Creating Mutations

ByteBuffer rowkey = ByteBufferUtil.bytes(“carol”);

Column column = new Column();column.name = ByteBufferUtil.bytes(“age”);column.value = ByteBufferUtil.bytes(37);

List<Mutation> mutations;Mutation mutation = new Mutation();mutation.column_or_supercolumn = new ColumnOrSuperColumn();mutation.column_or_supercolumn.column = column;mutations.add(mutation);

context.write(rowkey, mutationList);

Page 16: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

BulkOutputFormat

Hadoop Temporary Dir

SSTable 1 SSTable 2 SSTable N...

flush

write

BulkRecordWriter

Memory Buffer

Page 17: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

DataStax Enterprise:Cassandra and Hadoop in a Single Cluster

Page 18: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Basic Features

● Single, simplified component

● Workload separation

● No SPOF

● Peer to peer

● JobTracker failover

● No additional Cassandra config

Page 19: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

System Administrator's View

Address DC Rack Workload Status State Load Owns Token 148873535527910577765226390751398592512101.202.204.101 Analytics rack1 Analytics(JT) Up Normal 78,96 GB 12,50% 0 101.202.204.102 Analytics rack1 Analytics(TT) Up Normal 82,65 GB 12,50% 21267647932558653966460912964485513216 101.202.204.103 Analytics rack1 Analytics(TT) Up Normal 74,96 GB 12,50% 42535295865117307932921825928971026432 101.202.204.104 Analytics rack1 Analytics(TT) Up Normal 78,79 GB 12,50% 63802943797675961899382738893456539648 101.202.204.105 Cassandra rack1 Cassandra Up Normal 67,42 GB 12,50% 85070591730234615865843651857942052864 101.202.204.106 Cassandra rack1 Cassandra Up Normal 60,86 GB 12,50% 106338239662793269832304564822427566080101.202.204.107 Cassandra rack1 Cassandra Up Normal 81,27 GB 12,50% 127605887595351923798765477786913079296101.202.204.108 Cassandra rack1 Cassandra Up Normal 77,17 GB 12,50% 148873535527910577765226390751398592512

Easy monitoring of your nodes, regardless of their workload type

Page 20: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Wait, but where are my files?

Hadoop M/R

HDFS

Hadoop M/R

CFS

Cassandra Server

Page 21: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Cassandra File System Properties

● Decentralized

● Replicated

● HDFS compatible

– compatible with Hadoop filesystem utilities

– allows for running M/R programs on DSE without any change

● Compressed

Page 22: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

CFS Architecture

Page 23: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

CFS Compaction

● Keeps track of deleted rows (blocks)

● When all blocks in SSTable removed, deletes the whole SSTable

Cassandra Storage

block 1block 2block 3

block 4block 5block 6

ts 1ts 2

block 6 block 6block 7block 8

ts 3ts 4

block 6block 9block 10X

Page 24: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Hive Integration

● CassandraHiveMetaStore

– stores Hive database metadata in Cassandra

– no need to run a separate RDBMS

● CassandraStorageHandler

– allows for direct access to C* tables with CFIF and CFOF

Page 25: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Hive Integration – Example

CREATE EXTERNAL TABLE MyHiveTable(row_key string, col1 string, col2 string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' TBLPROPERTIES ("cassandra.ks.name" = "MyCassandraKS");

SELECT count(*) FROM MyHiveTable;

Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapred.reduce.tasks=<number>Starting Job = job_201306041030_0001, Tracking URL = http://192.168.123.10:50030/jobdetails.jsp?jobid=job_201306041030_0001Kill Command = /usr/bin/dse hadoop job -Dmapred.job.tracker=192.168.123.10:8012 -kill job_201306041030_0001Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 12013-06-04 15:11:54,573 Stage-1 map = 0%, reduce = 0%2013-06-04 15:11:58,622 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 1.04 sec2013-06-04 15:11:59,691 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 1.04 sec...2013-06-04 15:12:28,288 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:29,304 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:30,330 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:31,339 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 secMapReduce Total cumulative CPU time: 31 seconds 910 msecEnded Job = job_201306041030_0001MapReduce Jobs Launched: Job 0: Map: 9 Reduce: 1 Cumulative CPU: 31.91 sec HDFS Read: 0 HDFS Write: 0 SUCCESSTotal MapReduce CPU Time Spent: 31 seconds 910 msecOK1000000Time taken: 46.246 seconds

Page 26: Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Custom Column Mapping

CREATE EXTERNAL TABLE Users( userid string, name string, email string, phone string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,user_name,primary_email,home_phone");

Cassandra: row key user_name primary_email home_phone

Hive: userid name email phone