Top Banner
Apache Cassandra: A Brief History: Dive into the Dynamo whitepaper
40

Apache cassandra an introduction

Apr 12, 2017

Download

Technology

Shehaaz Saif
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache cassandra  an introduction

Apache Cassandra:A Brief History: Dive into the Dynamo whitepaper

Page 2: Apache cassandra  an introduction

About me@Shehaaz

I love hacking on Wearable/iOT devices.

Page 3: Apache cassandra  an introduction

Topics Today● History and Dynamo● Time series data modeling ● Example App

Page 4: Apache cassandra  an introduction

History● Peer-to-Peer (All nodes are EQUAL)

○ Centralized peer-to-peer networks■ Node connects to “Directory” server.

● e.g: Napster○ Unstructured networks

■ Nodes randomly connect to each other● e.g: Kazaa, Gossip

○ Structured networks■ Nodes organized into a specific topology (consistent Hashing)

● e.g: Cassandra Ring

Page 5: Apache cassandra  an introduction

Napster: Centralized P2P

Page 6: Apache cassandra  an introduction

Road to Cassandra● 1999: Napster and other “questionable” P2P services● 2006: Google Big Table

○ C* has similar data storage.

● 2007: Amazon Dynamo (Avinash Lakshman)

○ C* has similar architecture

● 2008: Facebook Open Sourced C* (Avinash Lakshman)

Page 7: Apache cassandra  an introduction

CAP Theorem ● Consistency

○ All nodes see the same data at the same time

● Availability○ A guarantee that every request receives a response about whether it succeeded or failed

● Partition Tolerance○ The system continues to operate despite arbitrary message loss or failure of part of the system

e.g: Increasing Availability (increase Rep.Factor) Reduce Consistency. You can only have two out of the three!

Page 8: Apache cassandra  an introduction

DynamoThe motivation: ● You must ALWAYS be able to add to your

shopping cart! (High Availability)

● Conflict resolution is done at the application:○ merge conflicting shopping carts.

● Primary Key access to data store (RDB limitations) ○ e.g: best seller list, customer preferences, etc

Page 9: Apache cassandra  an introduction

Dynamo ArchitectureKey principles:

1. Incremental scalability○ Add nodes w/o disrupting system

2. Symmetry○ Every node has same responsibility

3. Decentralization○ peer-to-peer over centralized control

4. Heterogeneity ○ The work distribution must be

proportional to the capabilities of the individual servers.

Page 10: Apache cassandra  an introduction

Distributed Hash TableData OrganizationDistributed Hash Table (DHT) using Consistent Hashing:

The keys are mapped to form a ring. The output range of the hash function is treated as a fixed circular “ring”. (i.e: The largest Hash Value wraps around to the smallest hash value)

Page 11: Apache cassandra  an introduction

Inserting data: High LevelHash(RowKey) = 4500

circle clockwise and insert in Node 5

Page 12: Apache cassandra  an introduction

Row Level Hashing?

1 T:22:00:02, HR:71 T:22:00:01, HR:72

2 T:22:00:05, HR:90 T:22:00:02, HR:95

Patient ID (Partition Key) Event Time (Clustering Column)

Page 13: Apache cassandra  an introduction

Dynamo ArchitectureConsistent Hashing ● Advantage:

○ Departure or Arrival of a node only affects immediate neighbors. Every node is in charge of the previous node clockwise.

○ Only K/N nodes need to be remapped when a node drops. K= #keys N= #Nodes

● Disadvantage:○ ?

Page 14: Apache cassandra  an introduction

Dynamo Architecture

Page 15: Apache cassandra  an introduction

Dynamo ArchitectureConsistent Hashing● Disadvantage?

Page 16: Apache cassandra  an introduction

Dynamo ArchitectureConsistent Hashing● Disadvantage

○ Random Node position assignment leads to non-uniform data and load distribution

○ Some nodes could simply suck

Page 17: Apache cassandra  an introduction

Disadvantage Diagram

Page 18: Apache cassandra  an introduction

Virtual nodes to rescue! ● Instead of mapping a node to a single

point in the ring, each node gets assigned to multiple locations in the ring….(what does that mean?)

Virtual Nodes!

Page 19: Apache cassandra  an introduction

Virtual NodesThree node cluster with zero V-nodes

p = Position

Page 20: Apache cassandra  an introduction

Virtual Nodes

● V-Nodes look like nodes in the system● Regular node can be responsible for more

than one V-Node

Page 21: Apache cassandra  an introduction

Virtual Nodes: Add NodeAdding a new Node:

● This will evenly balance the data in the cluster. Server #4 will get data from all the servers.

○ How? ■ Server 4 is next to 1,2 and 3

Page 22: Apache cassandra  an introduction

V-Nodes: Remove NodeWhen a node goes down the data is evenly distributed.

When #1 went down, #2 and #3 took over the data.

If we didn’t have virtual nodes #2 would have been overloaded.

Page 23: Apache cassandra  an introduction

ReplicationWhy?To achieve high availabilitye.g: Replication Factor: 3Hash(KEY1) = 500 Node #1 is the coordinator node for values 0 to 999Its job is to replicate it to TWO other nodes.In modern C* it is the job of the Node that received the write.

Page 24: Apache cassandra  an introduction

ReplicationServer 1 copies the data to TWO other nodes clockwise to satisfy Replication Factor: 3

If 1 goes down 2 will make sure to keep R.F=3

Page 25: Apache cassandra  an introduction

Example Application● Patient in critical care. Needs a vital sign

dashboard● Arduino based Heart Rate and spO2

measuring device.● Pretty graph and gain insight from the data

Page 26: Apache cassandra  an introduction

Arduino + e-Health PCB

Page 27: Apache cassandra  an introduction

System Diagram

Page 28: Apache cassandra  an introduction

Setup GCloud C* cluster

Page 29: Apache cassandra  an introduction

Requirements.txt

Page 30: Apache cassandra  an introduction

Example Code

Page 31: Apache cassandra  an introduction

Create Tables

Page 32: Apache cassandra  an introduction

What’s Wrong?1. We will eventually run out of columns. Cassandra

allows 2 billions columns per row

63.3 years

Page 33: Apache cassandra  an introduction

What’s Wrong?2. RowKey Hashing will create a hotspot in the cluster. (Remember Row Level Hashing?)

Page 34: Apache cassandra  an introduction

Data modeling in C*

Time Series data modeling.

Page 35: Apache cassandra  an introduction

Create Tables

A.K.A: Compound Row Key

Page 36: Apache cassandra  an introduction

Table

1,2015-02-17 T:22:00:01, HR:71 T:22:00:00, HR:72

2,2015-02-17 T:22:00:05, HR:90 T:22:00:02, HR:95

Patient ID (Partition Key) Event Time (Clustering Column)

Data is SORTED and stored Sequentially on Disk

Page 37: Apache cassandra  an introduction

Insert Data

Page 38: Apache cassandra  an introduction

Query Data

Page 40: Apache cassandra  an introduction

ResourcesAmazon Dynamo paper:http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Cassandra High Availability by Robbie Stricklandhttp://www.amazon.com/gp/product/1783989122/ref=cm_cr_ryp_prd_ttl_sol_0