Top Banner
Overview of Cassandra and The Doradus OSS Project Randy Guck Principal Engineer, Dell Software
29

Overiew of Cassandra and Doradus

Jun 04, 2015

Download

Software

randyguck

Overview of the Doradus database open source project and the Cassandra database on which it is based. This presentation was given to the Orange County Big Data Meetup group on July 16, 2014.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Overiew of Cassandra and Doradus

Overview of Cassandra and The Doradus OSS Project

Randy Guck

Principal Engineer, Dell Software

Page 2: Overiew of Cassandra and Doradus

Overview

•  What is No SQL? – Common RDB roadblocks – NoSQL database types

•  Overview of Cassandra – What's unique – Limitations

•  Doradus – Architecture – Features – The OLAP and Spider storage managers – What each is good for – Where to get Doradus

Page 3: Overiew of Cassandra and Doradus

Why RDB Apps Look for Something Else

•  Performance – B-trees – Locking – One writable copy of each record

•  Scaling costs – RDBs scale "up" – Big boxes, SANs, fiber channel, etc.

•  What if you want... – Distributed access – No single points of failure – Instant failover – Sharding – Replication

Page 4: Overiew of Cassandra and Doradus

NoSQL Data Models

Data Model Examples Elastic? Queries? Relationships?

Key–Value LevelDB, Kyoto Cabinet, Redis

No No No

Distributed Key–Value

Dynamo, MemcacheDB, Riak, Voldemort

Yes No No

Column-Oriented Accumulo, Cassandra, HBase

Yes Some No

Document-Oriented

Couchbase, Elasticsearch, MongoDB

Yes Yes Some

Graph Neo4J, OrientDB, Titan No Yes Yes

Sharding + replication

AND/OR/ranges/etc.

Built-in support

Page 5: Overiew of Cassandra and Doradus

NoSQL Data Models

Data Model Examples Elastic? Queries? Relationships?

Key–Value LevelDB, Kyoto Cabinet, Redis

No No No

Distributed Key–Value

Dynamo, MemcacheDB, Riak, Voldemort

Yes No No

Column-Oriented Accumulo, Cassandra, HBase

Yes Some No

Document-Oriented

Couchbase, Elasticsearch, MongoDB

Yes Yes Some

Graph Neo4J, OrientDB, Titan No Yes Yes

Sharding + replication

AND/OR/ranges/etc.

Built-in support

Doradus goals

Page 6: Overiew of Cassandra and Doradus

NoSQL Common Traits

•  Distributed cluster of nodes – Commodity, shared-nothing servers – Scales horizontally – Expands elastically

•  Replication – Performant local access – Automatic failover

•  De-normalized data model

•  Schemaless/dynamic columns

•  Eventual consistency

N=5, RF=3

Page 7: Overiew of Cassandra and Doradus

Is NoSQL Catching On?

Source: db-engines.com

Page 8: Overiew of Cassandra and Doradus

Overview of Cassandra

•  Wide column NoSQL database

•  Open sourced by Facebook

•  Apache Project with active community

•  Commercially support by DataStax, Acunu, others

•  Used by 1,500+ companies

•  "Pure peer" architecture

•  Largest known Cassandra cluster: 300+ TB data and 400+ machines.

Page 9: Overiew of Cassandra and Doradus

What is Cassandra best for?

•  Continuous data streams – Logs, events, audit records, measurements, ... – Fast data ingestion – Predictable read performance

•  Partitionable data – "1,000's of little databases in one"

•  Elastic scalability – Expand/upgrade/repair without downtime

•  Not good for: – Blob store – Persistent queue – OLTP transactions

Page 10: Overiew of Cassandra and Doradus

CQL Static Table

CREATE  TABLE  songs  (        id            uuid  PRIMARY  KEY,        title      text,        album      text,        artist    text,        data        blob  );  CREATE  INDEX  ON  songs  (artist);  

Row Key Columns: "<column  name>"="<column  value>"  

62c36...   "album"="90125"   "artist"="Yes"   "data"=<audio>   "title"="Changes"  

837a2...   "album"="Crystal  Ball"   "artist"="Styx"   "data"=<audio>   "title"="Put  Me  On"  

2de83...   "album"="Nevermind"   "artist"="Nirvana"   "data"=<audio>   "title"="Breed"  

...  

Page 11: Overiew of Cassandra and Doradus

CQL Clustered Table

CREATE  TABLE  playlists  (        id                  uuid,        song_order  int,        song_id        uuid,      //  copied  from  songs.id        title            text,      //  copied  from  songs.title        album            text,      //  copied  from  songs.album        artist          text,      //  copied  from  songs.artist        PRIMARY  KEY  (id,  song_order)      //  compound  key  );  

Row Key Columns: "<song_order>:<column  name>"="<column  value>"  

28d23...  

"1:"=""   "1:album"="90125"   "1:artist"="Yes"   "1:song_id"="62c36..."  

"1:title"="Changes"   "2:"=""   "2:album"="Nevermind"   "2:artist"="Nirvana"  

"2:song_id"="2de83..."   "2:title"="Breed"   "3:"=""   ...  

2ed91...  "1:"=""   "1:album"="Crystal  Ball"   "1:artist"="Styx"   "1:song_id"="837a2..."  

"1:title"="Put  Me  On"   "2:"=""   ...  

...  

Page 12: Overiew of Cassandra and Doradus

Row Key Columns: "<song_order>:<column  name>"="<column  value>"  

28d23...  

"1:"=""   "1:album"="90125"   "1:artist"="Yes"   "1:song_id"="62c36..."  

"1:title"="Changes"   "2:"=""   "2:album"="Nevermind"   "2:artist"="Nirvana"  

"2:song_id"="2de83..."   "2:title"="Breed"   "3:"=""   ...  

2ed91...  "1:"=""   "1:album"="Crystal  Ball"   "1:artist"="Styx"   "1:song_id"="837a2..."  

"1:title"="Put  Me  On"   "2:"=""   ...  

...  

CQL Clustered Table (cont.)

CQL "Rows"

CREATE  TABLE  playlists  (        id                  uuid,        song_order  int,        song_id        uuid,      //  copied  from  songs.id        title            text,      //  copied  from  songs.title        album            text,      //  copied  from  songs.album        artist          text,      //  copied  from  songs.artist        PRIMARY  KEY  (id,  song_order)      //  compound  key  );  

Page 13: Overiew of Cassandra and Doradus

Can we make Cassandra more appealing?

•  Data Model – No direct support for relationships

•  Indexing – Secondary indexes: single column only – Hash table only: no range searching

•  Searching – No joins, embedded queries – No aggregate queries – Limited equalities (e.g., SELECT * WHERE <key> IN (<list>)) – No full text search – No OR clauses – ...

Page 14: Overiew of Cassandra and Doradus

What is Doradus?

•  Java service that enhances Cassandra

•  Adds features: – REST API (JSON and XML) – Multi-tenancy – Graph model – Multi-field/full text query language – Automatic data aging – OLAP and Spider storage services

•  Compatible with NoSQL tenets such as idempotent updates

•  Under development for ~3 years

•  Open source: Apache 2.0 License

Page 15: Overiew of Cassandra and Doradus

Doradus Graph Model

•  A cluster hosts one of more applications •  An application own tables which store objects •  An object consists of single- and multi-valued fields •  A pair of link fields form a bi-directional relationship

Message {Size, SendDate}

Participant {ReceiptDate}

Address {Name}

Person {Name, Department}

Attachment {Size, Extension} Managerè

çEmployees

êPerson

Address é

êAttachments

Messageé

Recipientsè

çMessageAsRecipient

Addressè

çParticipants

Senderè

çMessageAsSender

Page 16: Overiew of Cassandra and Doradus

Example Object and Aggregate Queries

•  Lucene full text query GET  /Email/Person/_query?q=FirstName:j*  AND  NOT  Office:[q  TO  z]  

•  Link path with filtering GET  /Email/Message/_query?q=  

Sender.WHERE(ReceiptDate>'2010-­‐06-­‐01').Address.Name="*.com"  

•  Quantifiers GET  /Email/Message/_aggregate?m=COUNT(*)    

&q=ANY(Recipients).ALL(Address).NONE(Person).Department:sales  &f=Tags,TOP(3,TRUNCATE(SendDate,DAY))  

 

•  Transitive links GET  /Email/Person/_query?q=DirectReports^(3).LastName=wilson  

&f=DirectReports(Name,DirectReports(Name))    

Page 17: Overiew of Cassandra and Doradus

Doradus: Architecture

Application

Doradus

Cassandra

REST API

Thrift or CQL

Data and Log files

Page 18: Overiew of Cassandra and Doradus

Doradus: Multi-Data Center Clusters

Cassandra

Doradus

Cassandra Cassandra

Doradus

Cassandra

Doradus

Cassandra Cassandra

Doradus

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6

Rack 1, Data Center 1 Rack 1, Data Center 2

Applications Applications

DC=2, N=6, RF=3

Page 19: Overiew of Cassandra and Doradus

Doradus: Internal Architecture

App App App Monitor

App

Spider Storage Service

OLAP Storage Service

Cassandra Cluster

JMX

REST: Embedded Jetty Server

Cassandra Interface do

rad

us.

yam

l

REST

Page 20: Overiew of Cassandra and Doradus

Doradus OLAP Service

•  Borrows from online analytical processing – Sharding as data "cubes" – Columnar storage

•  Very dense storage – No indexes! – Value arrays are compressed

•  Fast load time – Up to 500,000 objects/second/node – Small "data lag" time

•  Very fast queries – Searches millions of objects/second – Full DQL object and aggregate query support

Page 21: Overiew of Cassandra and Doradus

OLAP Data Loading

Events Events Events

Events Events People

Events Events Computers

Events Events Domains

Sources

Page 22: Overiew of Cassandra and Doradus

OLAP Data Loading

T1 Events Events Events

Events Events People

Events Events Computers

Events Events Domains

T2

T3

T4

T4

Sources Segments

Changes in last n minutes

Page 23: Overiew of Cassandra and Doradus

OLAP Data Loading

T1 Events Events Events

Events Events People

Events Events Computers

Events Events Domains

T2

T3

T4

T4

2013-03-01

2013-02-28

2013-02-27

Sources Segments Shards

… …

Changes in last n minutes

Date-based shards

Page 24: Overiew of Cassandra and Doradus

OLAP Data Loading

T1 Events Events Events

Events Events People

Events Events Computers

Events Events Domains

T2

T3

T4

T4

2013-03-01

2013-02-28

2013-02-27

Sources Segments Shards OLAP Store

… …

Changes in last n minutes

Date-based shards

Page 25: Overiew of Cassandra and Doradus

OLAP Use Case

•  Data: Windows Events – 115M events

•  Test parameters – Server: Quad Xeon CPUs, 32GB memory, 3 disks – Cassandra memory: 1GB – Load app/embedded Doradus memory: 4GB – Load threads: 5 – Batch size: 5,000 events – Shard size: 1 day (860 shards total)

•  Test results – Total objects loaded: ~1 billion – Total time: 32 minutes, 56 seconds – Load rate: 502,991 objects/second – Final database size: ~2GB

Page 26: Overiew of Cassandra and Doradus

Doradus Spider Service

•  Analogous to Lucene + NoSQL

•  Fully inverted field indexing – Configurable analyzers – Stored-only (non-indexed) fields

•  Unique features: – Automatic table-level sharding – Statistics

– Pre-computed aggregate queries – Refreshed in background

– Object-level data aging

•  Use case example: – Indexing a massive number of documents

Page 27: Overiew of Cassandra and Doradus

OLAP and Spider: When to Use

•  Spider is best for: – Unstructured/variable-

structure data – Configurable indexing – Fine-grained updates with

immediate indexing – Document storage and

searching – Emphasis on full-text/multi-

field searching

•  OLAP is best for: – High-volume data streams – High performance analytic

queries – Dense data storage – Immutable/semi-mutable

data – Data that can be loaded in

batches – Data that can be partitioned

(e.g., time-sharded)

Page 28: Overiew of Cassandra and Doradus

Summary

•  What's cool about Doradus? – Bi-directional links with referential integrity – Link paths: simpler than joins – Idempotent updates – Partial object updates – Simple transitive searching – OLAP: dense storage and fast queries – It's free!

Page 29: Overiew of Cassandra and Doradus

Thank you ! Doradus is available at: https://github.com/dell-oss/Doradus Contact me: [email protected]