Top Banner
#strataconf + #hw2013 Real-Time Analytics with Cassandra and Hadoop Patricia Gorla Download code: bit.ly/1aB8Jy8 (12KB)
68

Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Mar 06, 2018

Download

Documents

vukhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

#strataconf + #hw2013

Real-Time Analytics with Cassandra and Hadoop

Patricia Gorla

Download code: bit.ly/1aB8Jy8 (12KB)

Page 2: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

About Me• Solr• Cassandra• Datastax MVP

Download code: bit.ly/1aB8Jy8 (12KB)

Page 3: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

• Introduction to Cassandra + 2 labs 15m Break ~ 14:30

• Analytics + 1 labs 15m Break ~ 16:30

• Extra Credit

Outline

Download code: bit.ly/1aB8Jy8 (12KB)

Page 4: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Introduction

Download code: bit.ly/1aB8Jy8 (12KB)

Page 5: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Getting Started

ArchitectureData Modeling

Download code: bit.ly/1aB8Jy8 (12KB)

Page 6: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

History• Powered inbox search at Facebook• Open-sourced in 2008

Page 7: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Why Cassandra?• Linear scalability• Availability• Set it and forget it

Page 8: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Many companies use Cassandra.

...

Page 9: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

What is Cassandra?• Dynamo distributed cluster (no vector

clocks)• Bigtable data model• No SPOF• Tuneably consistent

Page 10: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Cluster

Keyspace

Architecture

Page 11: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Column Family 1

Keyspace

Column Family 2

Page 12: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Column Family 1

Keyspace

Column Family 2

row1: {col1:val1,time,TTL; … }

Page 13: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Labintroduction/1-getting-started.md

Download code: bit.ly/1aB8Jy8 (12KB)

Page 14: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Getting StartedArchitecture

Data Modeling

Page 15: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

WritesCommit Log -> Memtable -> SSTables

Source: datastax.com

Page 16: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Incoming write to cluster.

Page 17: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra
Page 18: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra
Page 19: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Data replicated to replicants.

Page 20: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Data partitioning by token ranges.

Page 21: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Data partitioning by virtual nodes.

Page 22: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Reads

Page 23: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Source: fusionio.com

High-level overview of reads.

Page 24: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Source: datastax.com

Page 25: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

?

Reading from cluster.

Page 26: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Reading from cluster.

?

?

?

Page 27: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Reading from cluster.

Page 28: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Reading from cluster.

Page 29: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Fault tolerance

Page 30: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

?

Reading from cluster.

Page 31: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Reading from cluster.

?

?

?

Page 32: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Reading from cluster.

Page 33: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Reading from cluster.

Page 34: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Deletes• Distributed deletes are tricky• Tombstones may not be propagated• Don’t rely on a delete-heavy system

Page 35: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Getting StartedArchitectureData Modeling

Page 36: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

ProtocolsThrift

• Thrift, CQL• Synchronous

Binary• CQL• Asynchronous

Page 37: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

• Familiar syntax• Flexible data model over Cassandra

Why CQL?

Page 38: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

CQL: Creating a Keyspace

create KEYSPACE “Patisserie” with replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’: 1 } ;

use “Patisserie”;

Page 39: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

CQL: Creating a Column Family

create TABLE “customers” (customer text, age int, PRIMARY KEY (customer) ) ;

customer age

Yves Laurent 77

Coco Chanel 130

Pierre Cardin

CQL Schema

Page 40: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

CQL: Creating a Column Family

create TABLE “customers” (customer text, age int, PRIMARY KEY (customer) ) ;

”Yves Laurent”: {“age”:77}

“Coco Chanel”: {“age”:130}

“Pierre Cardin”: {}

Physical Representation

Page 41: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

CQL: Composite Columnscreate TABLE “customer_purchases” (customer text,

day text,

item text,

PRIMARY KEY (customer,day) ) ;

customer day item

ylaurent M rivoli

ylaurent T mille feuille

cchanel M pain au chocolat

pcardin W mille feuille

pcardin F croissant

Page 42: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

CQL: Composite Columnscreate TABLE “customer_purchases” (customer text,

day text,

item text,

PRIMARY KEY (customer,day) ) ;

”ylaurent”: { “M:item”: “rivoli”, “T:item”: “mille feuille” }

“cchanel”: { “M:item”: “pain au chocolat” }

“pcardin”: { “W:item”: “mille feuille”, “F:item”: croissant }

Page 43: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

CQL: Composite Primary Keys

create TABLE “daily_sales_by_item” (day text, customer text, hour timestamp, item text, PRIMARY KEY ((day,customer), hour) ) ;

day customer hour item

M cchanel 13 rivoli

M cchanel 15 mille feuille

M ylaurent 4 rivoli

T cchanel 17 mille feuille

W pcardin 20 croissant

Page 44: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

CQL: Composite Primary Keys

create TABLE “daily_sales_by_item” (day text, customer text, hour timestamp, item text, PRIMARY KEY ((day,customer), hour) ) ;

”M:cchanel”: { “13:item”: “rivoli”, “15:item”: “mille feuille” }

“M:ylaurent”: { “4:item”: “rivoli” }

“T:cchanel”: { “17:item”: “mille feuille" }

“W:pcardin”: { “20”item”: “croissant” }

Page 45: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

CQL: Collectionscreate TABLE “customer_purchases” (customer text,

day text,

item list<text>,

PRIMARY KEY (customer,day) ) ;

customer day item

ylaurent M [‘rivoli’, ‘rivoli’, ‘javanais’]

cchanel M [‘pain au chocolat’]

pcardin W [‘mille feuille’, ‘croissant’]

pcardin F [‘croissant’]

Page 46: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Data Modeling Labintroduction/2-data-modeling.md

Page 47: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Analytics

Page 48: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Cassandra and Analytics

Adapting the Data ModelMapReduce Paradigms

Page 49: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

An Unlikely Union

• Batch processing analytics and real-time data store

• MapReduce, Hive, Pig, Sqoop, Mahout

Page 50: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Why Cassandra and Hadoop?

• Unified workload• Availability• Simpler deployment

Page 51: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Datastax Enterprise

Data Locality

Data Locality

Data Locality

Page 52: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Datastax Enterprise

Task Trackers

Job Tracker

Page 53: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

CFS

MapReduce

Writing in / out is passed

through the CassandraFS

layer

Page 54: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Starting Analytics Node

$ bin/dse cassandra -t -j

# Starts task tracker and job tracker on# node

Page 55: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Hello, Wordcount

$ bin/dse hadoop fs -put wikipedia /

$ bin/dse hadoop jar wordcount.jar /wikipedia wc-output

Page 56: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Cassandra and HadoopAdapting the Data Model

MapReduce Paradigms

Page 57: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Hive

• SQL-like MapReduce abstraction• Data types• Efficient JOINs, GROUP BY

Page 58: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Cassandra and Hive

• Hive still has to have separate tables.• DSE stores them in a separate keyspace.• 1:1 mapping to Cassandra CFs• Schemas must match or columns will be

inaccessible.

Page 59: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

CFS

MapReduce

Hive Metastore is persisted in

Cassandra layer

Hive

Page 60: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Hive: Creating a DB

hive> CREATE EXTERNAL TABLE customers ( id string, name string, age int)STORED BY ‘o.a.h.h.cassandra.CassandraStorageHandler’TBLPROPERTIES ( “cassandra.ks.name” = “Oberweis”, “cassandra.ks.repfactor” = “2”, “cassandra.ks.strategy” = “o.a.c.l.SimpleStrategy”);

Page 61: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Hive: Multiple Data Centers

hive> CREATE EXTERNAL TABLE customers ( id string, name string, age int)STORED BY ‘o.a.h.h.cassandra.CassandraStorageHandler’TBLPROPERTIES ( “cassandra.ks.name” = “Oberweis”, “cassandra.ks.stratOptions” = “DC1:3, DC2:1”, “cassandra.ks.strategy” = “o.a.c.l.NTStrategy”);

Page 62: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

• What about composite columns?

• Must be retrieved as binary data, and then use UDF to deserialize it.

Hive

Page 63: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

• For each person, calculate how many pastries (and of what kind) they purchased.

Hive: Lab

Page 64: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Hive: Multiple Data Centers

hive> SELECT b.name, a.item, sum(a.amount)FROM Oberweis.daily_purchases aJOIN Oberweis.person b ON (a.person = b.id)GROUP BY b.name, a.item;

Page 65: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Extra Credit

Page 66: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

• What about real time?

• Neither Hadoop nor Hive are built for real-time

• Cassandra provides you with data locality

Real Time Considerations

Page 67: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Cassandra 2.0• Transactions• Triggers• Prepared Statements

Page 68: Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

#strataconf + #hw2013

Q&A

@[email protected] on IRC (#cassandra, #python)