South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Post on 15-Jan-2015

394 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

The Kiji Project is a modular, open-source framework that enables developers to efficiently build real-time Big Data applications. Kiji is built upon popular open-source technologies such as Cassandra, HBase, Hadoop, and Scalding, and contains components that implement functionality critical for Big Data applications, including the following: • Support for evolvable schemas of complex data types • Batch training of machine learning models with Hadoop • Real-time scoring with trained modelsIntegration with Hive and R • A REST endpoint Recently, we have updated Kiji to use Cassandra as a backing data store (previously, Kiji worked only with HBase). In this talk, we describe the process of integrating Cassandra and Kiji. Topics we cover include the following: • The Kiji architecture and data model • Implementing the Kiji data model in Cassandra using the Java driver and CQL3 • Integrating Cassandra with Hadoop 2.x • Building a flexible middleware platform that supports Cassandra and HBase (including projects that use both simultaneously) • Exposing unique features of Cassandra (e.g., variable consistency) to Kiji users

Transcript

Building a Flexible, Real-time Big Data Applications Platform

on Cassandra with Kiji

Clint KellyMember of Technical StaffWibiData

Cassandra Meetup23 April 2014

Agenda

Agenda

The problem

Agenda

The problemHow Kiji works

Agenda

The problemHow Kiji works

Kiji on Cassandra

!

!

!Open source

software

!

!

!

!

!

!

?

Data in

Data in

Data in

REST

Inspect

Inspect

Inspect

Inspect

Inspect

Train

Train

Train

“Trained model”

Train

“Trained model”

Train

“Trained model”

Train

“Trained model”

Train

“Trained model”

Model

Model

AaBb

Model

AaBb

Model

Model

Model

Apply

Apply

ApplyAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb

ApplyAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb

Apply

Batch

AaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb

Data out

Data out

Data out

REST

Data out

REST

REST

REST

REST

AaBb

AaBb

AaBb

Experiments / Deployment

Experiments / Deployment

Experiments / Deploymentc

d

c

d

Experiments / Deploymentc

d

c

d

3

Data in / out

Data in / out(REST)

Inspect and train

Apply

Apply(real-time)

!

?

!!

Kiji

How Kiji works

Kiji History

Kiji History

Kiji History

Kiji History

Kiji History

Kiji History

Kiji History

Kiji History

In production now

Fortune 500 retailer : Personalized recommendations

Opower: Energy usage and analytics reporting

How does it work?

Kiji

How does it work?

Kiji

EngineeringData

Science

How does it work?

Kiji

Data Science

Write

Engineering

How does it work?

Kiji

Data Science

Write

Channels Engineering

How does it work?

Kiji

Data Science

WriteLogs

DBs

EngineeringChannels

How does it work?

Kiji

Data Science

WriteLogs

DBs

Kij

iMR

EngineeringChannels

How does it work?

Kiji

Data Science

Write

Kij

iRE

ST

Stream

EngineeringChannels

How does it work?

Kiji

Data Science

Write

Read

Kij

iRE

ST

Stream

EngineeringChannels

How does it work?

KijiSchema(Cassandra)

Data Science

Write

Read

Kij

iRE

ST

Stream

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiMR

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

R

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

R

R

R

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

R

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

R

R

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

R

R

R

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

R

R

R

c

d

c

d

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Kiji Model Repository

Kij

iSco

rin

g

Freshness Policy

C

C

C

R

EngineeringChannels

Data

3

Data in / outKijiRESTKijiMR

Inspect and trainKijiHiveKijiMR

KijiExpress

Apply(real-time)

KijiModelRepositoryKijiScoring

Modular

Kiji on Cassandra

Kiji ~ BigTable

table

table

rowrowrowrowrowrowrowrowrowrowrowrow

row

Row key = entity ID

entity ID data

Composite entity IDs

data0xfa “bob”

Column families

payment0xfa “bob” interactions recommendations

inter:clicks

inter:search0xfa “bob” payment:

cardnumpayment:address

rec:scorer1

rec:scorer2

Columns

Timestamped versions

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Complex data types

record Search { string search_term; long session_id; device_type device;}

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Locality group

Locality group

Column families

Locality group

Locality group

Batch Batch Batch

Locality group

Batch Batch BatchReal-time

Real-time

Real-time

Locality group

Batch BatchReal-time

Real-time

Real-time

Batch

locality_group_real_timelocality_group_batch

Locality group

Batch BatchReal-time

Real-time

Real-time

Batch

locality_group_real_timelocality_group_batch

Locality group

Batch Batch

Real-time

Real-time

Real-time

Batch

locality_group_real_timelocality_group_batch

Locality group

Batch Batch Real-time

Real-time

Real-timeBatch

locality_group_real_timelocality_group_batch

Locality group

Batch Batch Real-time

Real-time

Real-timeBatch

On disk.Compressed.

locality_group_real_timelocality_group_batch

Locality group

Batch Batch Real-time

Real-time

Real-timeBatch

On disk.Compressed. In memory.

Row ➔ transactional consistency

Locality group ➔ Column family

CREATE TABLE loc_grp

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Entity ID ➔ Primary key

CREATE TABLE loc_grp (city text, user text,

PRIMARY KEY (city, user) )

WITH CLUSTERING ORDER BY (user ASC);

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Family, Qualifier, Version ➔ Clustering Columns

CREATE TABLE loc_grp (city text, user text,

family text, qualifier text, version bigint,

PRIMARY KEY (city, user, family, qualifier, version) )

WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Column values ➔ Blobs

CREATE TABLE loc_grp (city text, user text,

family text, qualifier text, version bigint, value blob,

PRIMARY KEY (city, user, family, qualifier, version) )

WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

bob:pay:cardnum:t

AMEX1234...

bob:pay:addr:t5

1234 Main St, SF

bob:inter:clicks:t9

...

bob:inter:clicks:t7

...

bob:inter:clicks:t6

...

0xfa

Implementation notes

Implementation notes

DataStax Java driver

Implementation notes

DataStax Java driverCassandra 2.0.6

Implementation notes

DataStax Java driverCassandra 2.0.6

Async API

Implementation notes

DataStax Java driverCassandra 2.0.6

Async APINew MapReduce InputFormat

Issues

Operations across locality groups

Operations across locality groupsKiji locality group ➔ C* column family

Operations across locality groupsKiji locality group ➔ C* column family

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups➔ not allowed in C* Kiji

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups➔ not allowed in C* Kiji

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups➔ not allowed in C* Kiji

Lose transactional consistency

Filters

HBase ➔ Rich server-side filtersCassandra ➔ WHERE clauses

Filters

HBase ➔ Rich server-side filtersCassandra ➔ WHERE clauses

Client-side filtering

Entity IDs with unhashed components

EntityId(state, city, username)

EntityId(state, city, username)

hashed

EntityId(state, city, username)

hashed unhashed

EntityId(state, city, username)

hashed unhashed

0x235af-alice

0x235af-bob

0x235af-cathy

0x235af-dave

0x38e0a-andy

0x38e0a-jane

0x38e0a-lucy

0x38e0a-nancy

HBase

EntityId(state, city, username)

hashed unhashed

0x235af-alice

0x235af-bob

0x235af-cathy

0x235af-dave

0x38e0a-andy

0x38e0a-jane

0x38e0a-lucy

0x38e0a-nancy

HBase0x235af | alice | bob | cathy | dave

0x38e0a | andy | jane | lucy | nancy

Cassandra

EntityId(state, city, username)

hashed unhashed

0x235af-alice

0x235af-bob

0x235af-cathy

0x235af-dave

0x38e0a-andy

0x38e0a-jane

0x38e0a-lucy

0x38e0a-nancy

HBase0x235af | alice | bob | cathy | dave

0x38e0a | andy | jane | lucy | nancy

Cassandra

Limited to width of C* wide row!

Project status

Next quarterCassandra in all Kiji components

Run MapReduce jobs with KijiExpressExpose Cassandra-specific features

3

Data in / outKijiRESTKijiMR

Inspect and trainKijiHiveKijiMR

KijiExpress

Apply(real-time)

KijiModelRepositoryKijiScoring

Thanks to Cassandra community

Mailing listsMeetups, webinars, conferences

top related