Top Banner
1 CS 525 Advanced Distributed Systems Spring 2014 Indranil Gupta (Indy) Feb 11, 2014 Key-value/NoSQL Stores Lecture 7 2014, I. Gupta Based mostly on Cassandra NoSQL presentation Cassandra 1.0 documentation at datastax.com Cassandra Apache project wiki HBase
33

CS 525 Advanced Distributed Systems Spring 2014

Jan 27, 2016

Download

Documents

Barny

CS 525 Advanced Distributed Systems Spring 2014. Indranil Gupta (Indy) Feb 11, 2014 Key-value/NoSQL Stores Lecture 7. Based mostly on Cassandra NoSQL presentation Cassandra 1.0 documentation at datastax.com Cassandra Apache project wiki HBase.  2014, I . Gupta. Cassandra. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 525 Advanced Distributed Systems  Spring 2014

1

CS 525 Advanced Distributed Systems

Spring 2014

Indranil Gupta (Indy)

Feb 11, 2014

Key-value/NoSQL Stores

Lecture 7

2014, I. Gupta

Based mostly on •Cassandra NoSQL presentation•Cassandra 1.0 documentation at datastax.com•Cassandra Apache project wiki•HBase

Page 2: CS 525 Advanced Distributed Systems  Spring 2014

2

Cassandra

• Originally designed at Facebook

• Open-sourced

• Some of its myriad users:

• With this many users, one would think– Its design is very complex

– Let’s find out!

Page 3: CS 525 Advanced Distributed Systems  Spring 2014

3

Why Key-value Store?

• (Business) Key -> Value

• (twitter.com) tweet id -> information about tweet

• (amazon.com) item number -> information about it

• (kayak.com) Flight number -> information about flight, e.g., availability

• (yourbank.com) Account number -> information about it

• Search is usually built on top of a key-value store

Page 4: CS 525 Advanced Distributed Systems  Spring 2014

4

Isn’t that just a database? • Yes, sort of

• Relational Databases (RDBMSs) have been around for ages

• MySQL is the most popular among them

• Data stored in tables

• Schema-based, i.e., structured tables

• Queried using SQL (Structured Query Language)

SQL queries: SELECT user_id from users WHERE username = “jbellis”

Example’s Source

Page 5: CS 525 Advanced Distributed Systems  Spring 2014

5

Mismatch with today’s workloads

• Data: Large and unstructured

• Lots of random reads and writes

• Foreign keys rarely needed

• Need– Speed

– Avoid Single point of Failure (SPoF)

– Low TCO (Total cost of operation) and fewer sysadmins

– Incremental Scalability

– Scale out, not up: use more machines that are off the shelf (COTS), not more powerful machines

Page 6: CS 525 Advanced Distributed Systems  Spring 2014

6

CAP Theorem

• Proposed by Eric Brewer (Berkeley)

• Subsequently proved by Gilbert and Lynch

• In a distributed system you can satisfy at

most 2 out of the 3 guarantees1. Consistency: all nodes see same data at any time, or

reads return latest written value

2. Availability: the system allows operations all the time

3. Partition-tolerance: the system continues to work in spite of network partitions

• Cassandra– Eventual (weak) consistency, Availability, Partition-tolerance

• Traditional RDBMSs– Strong consistency over availability under a partition

Page 7: CS 525 Advanced Distributed Systems  Spring 2014

7

CAP Tradeoff

• Starting point for NoSQL Revolution

• Conjectured by Eric Brewer in 2000, proved by Gilbert and Lynch in 2002

• A distributed storage system can achieve at most two of C, A, and P.

• When partition-tolerance is important, you have to choose between consistency and availability

Consistency

Partition-tolerance Availability

RDBMSs

Cassandra, RIAK, Dynamo, Voldemort

HBase, HyperTable,BigTable, Spanner

7

Page 8: CS 525 Advanced Distributed Systems  Spring 2014

8

Eventual Consistency

• If all writers stop (to a key), then all its values (replicas) will converge eventually.

• If writes continue, then system always tries to keep converging.

– Moving “wave” of updated values lagging behind the latest values sent by clients, but always trying to catch up.

• May return stale values to clients (e.g., if many back-to-back writes).

• But works well when there a few periods of low writes. System converges quickly.

Page 9: CS 525 Advanced Distributed Systems  Spring 2014

9

Cassandra – Data Model • Column Families:

– Like SQL tables

– but may be unstructured (client-specified)

– Can have index tables

• “Column-oriented databases”/ “NoSQL”

– Columns stored together, rather than rows

– No schemas

» Some columns missing from some entries

– NoSQL = “Not Only SQL”

– Supports get(key) and put(key, value) operations

– Often write-heavy workloads

Page 10: CS 525 Advanced Distributed Systems  Spring 2014

10

Let’s go Inside Cassandra: Key -> Server

Mapping• How do you decide which server(s) a key-value

resides on?

Page 11: CS 525 Advanced Distributed Systems  Spring 2014

11

11

N80

0Say m=7

N32

N45

Backup replicas forkey K13

Cassandra uses a Ring-based DHT but without routingKey->server mapping is the “Partitioning Function”

N112

N96

N16

Read/write K13

Primary replica forkey K13

(Remember this?)

Coordinator (typically one per DC)

Page 12: CS 525 Advanced Distributed Systems  Spring 2014

12

Writes • Need to be lock-free and fast (no reads or disk

seeks)

• Client sends write to one front-end node in Cassandra cluster

– Front-end = Coordinator, assigned per key

• Which (via Partitioning function) sends it to all replica nodes responsible for key

– Always writable: Hinted Handoff mechanism

» If any replica is down, the coordinator writes to all other replicas, and keeps the write locally until down replica comes back up.

» When all replicas are down, the Coordinator (front end) buffers writes (for up to a few hours).

– Provides Atomicity for a given key (i.e., within a ColumnFamily)

• One ring per datacenter– Per-DC coordinator elected to coordinate with other DCs

– Election done via Zookeeper, which runs a Paxos variant

Page 13: CS 525 Advanced Distributed Systems  Spring 2014

13

Writes at a replica node

On receiving a write

•1. log it in disk commit log

•2. Make changes to appropriate memtables– In-memory representation of multiple key-value pairs

•Later, when memtable is full or old, flush to disk– Data File: An SSTable (Sorted String Table) – list of key value pairs,

sorted by key

– Index file: An SSTable of (key, position in data sstable) pairs

– And a Bloom filter (for efficient search) – next slide

•Compaction: Data udpates accumulate over time and SStables and logs need to be compacted

– Merge SSTables, e.g., by merging updates for a key

– Run periodically and locally at each server

•Reads need to touch log and multiple SSTables– A row may be split over multiple SSTables

– Reads may be slower than writes

Page 14: CS 525 Advanced Distributed Systems  Spring 2014

14

Bloom Filter• Compact way of representing a set of items

• Checking for existence in set is cheap

• Some probability of false positives: an item not in set may check true as being in set

• Never false negativesLarge Bit Map

01

23

69

127

111

Key-K

Hash1

Hash2

Hashk

On insert, set all hashed bits.

On check-if-present, return true if all hashed bits set.•False positives

False positive rate low•k=4 hash functions•100 items• 3200 bits•FP rate = 0.02%

.

.

Page 15: CS 525 Advanced Distributed Systems  Spring 2014

15

Deletes and Reads

• Delete: don’t delete item right away– Add a tombstone to the log

– Compaction will eventually remove tombstone and delete item

• Read: Similar to writes, except– Coordinator can contacts a number of replicas (e.g., in same rack) specified by

consistency level

» Forwards read to replicas that have responded quickest in past

» Returns latest timestamp value

– Coordinator also fetches value from multiple replicas

» check consistency in the background, initiating a read-repair if any two values are different

» Brings all replicas up to date

– A row may be split across multiple SSTables => reads need to touch multiple SSTables => reads slower than writes (but still fast)

Page 16: CS 525 Advanced Distributed Systems  Spring 2014

16

Cassandra uses Quorums• Quorum = way of selecting sets so that any pair of

sets intersect– E.g., any arbitrary set with at least Q=N/2 +1 nodes

– Where N = total number of replicas for this key

• Reads– Wait for R replicas (R specified by clients)

– In the background, check for consistency of remaining N-R replicas, and initiate read repair if needed

• Writes come in two default flavors– Block until quorum is reached

– Async: Write to any node

• R = read replica count, W = write replica count

• If W+R > N and W > N/2, you have consistency, i.e., each read returns the latest written value

• Reasonable: (W=1, R=N) or (W=N, R=1) or (W=Q, R=Q)

Page 17: CS 525 Advanced Distributed Systems  Spring 2014

17

Cassandra Consistency Levels

• In reality, a client can choose one of these levels for a read/write operation:

– ANY: any node (may not be replica)

– ONE: at least one replica

– Similarly, TWO, THREE

– QUORUM: quorum across all replicas in all datacenters

– LOCAL_QUORUM: in coordinator’s DC

– EACH_QUORUM: quorum in every DC

– ALL: all replicas all DCs

• For Write, you also have – SERIAL: claims to implement linearizability for transactions

Page 18: CS 525 Advanced Distributed Systems  Spring 2014

18

Cluster Membership – Gossip-Style

1

1 10120 66

2 10103 62

3 10098 63

4 10111 65

2

43

Protocol:

•Nodes periodically gossip their membership list

•On receipt, the local membership list is updated, as shown

•If any heartbeat older than Tfail, node is marked as failed

1 10118 64

2 10110 64

3 10090 58

4 10111 65

1 10120 70

2 10110 64

3 10098 70

4 10111 65

Current time : 70 at node 2

(asynchronous clocks)

AddressHeartbeat Counter

Time (local)

Fig and animation by: Dongyun Jin and Thuy Ngyuen

Cassandra uses gossip-based cluster membership

Page 19: CS 525 Advanced Distributed Systems  Spring 2014

19

Gossip-Style Failure Detection

• If the heartbeat has not increased for more than Tfail seconds (according to local time), the member is considered failed

• But don’t delete it right away

• Wait an additional Tfail seconds, then delete the member from the list

– Why?

Page 20: CS 525 Advanced Distributed Systems  Spring 2014

20

• What if an entry pointing to a failed process is deleted right after Tfail (= 24) seconds?

• Fix: remember for another Tfail

• Ignore gossips for failed members – Don’t include failed members in gossip messages

1

1 10120 66

2 10103 62

3 10098 55

4 10111 65

2

43

1 10120 66

2 10110 64

3 10098 50

4 10111 65

1 10120 66

2 10110 64

4 10111 65

1 10120 66

2 10110 64

3 10098 75

4 10111 65

Current time : 75 at process 2

Gossip-Style Failure Detection

Page 21: CS 525 Advanced Distributed Systems  Spring 2014

21

Cluster Membership, contd.

• Suspicion mechanisms to adaptively set the timeout

• Accrual detector: Failure Detector outputs a value (PHI) representing suspicion

• Apps set an appropriate threshold

• PHI = 5 => 10-15 sec detection time

• PHI calculation for a member– Inter-arrival times for gossip messages

– PHI(t) = - log(CDF or Probability(t_now – t_last))/log 10

– PHI basically determines the detection timeout, but takes into account historical inter-arrival time variations for gossiped heartbeats

Page 22: CS 525 Advanced Distributed Systems  Spring 2014

22

Data Placement Strategies

• Replication Strategy: two options:1. SimpleStrategy

2. NetworkTopologyStrategy

1.SimpleStrategy: uses the Partitioner1. RandomPartitioner: Chord-like hash partitioning

2. ByteOrderedPartitioner: Assigns ranges of keys to servers.

» Easier for range queries (e.g., Get me all twitter users starting with [a-b])

2.NetworkTopologyStrategy: for multi-DC deployments

– Two replicas per DC: allows a consistency level of ONE

– Three replicas per DC: allows a consistency level of LOCAL_QUORUM

– Per DC

» First replica placed according to Partitioner

» Then go clockwise around ring until you hit different rack

Page 23: CS 525 Advanced Distributed Systems  Spring 2014

23

Snitches

• Maps: IPs to racks and DCs. Configured in cassandra.yaml config file

• Some options:– SimpleSnitch: Unaware of Topology (Rack-unaware)

– RackInferring: Assumes topology of network by octet of server’s IP address

» 101.201.301.401 = x.<DC octet>.<rack octet>.<node octet>

– PropertyFileSnitch: uses a config file

– EC2Snitch: uses EC2.

» EC2 Region = DC

» Availability zone = rack

• Other snitch options available

Page 24: CS 525 Advanced Distributed Systems  Spring 2014

24

Vs. SQL

• MySQL is one of the most popular (and has been for a while)

• On > 50 GB data

• MySQL – Writes 300 ms avg

– Reads 350 ms avg

• Cassandra – Writes 0.12 ms avg

– Reads 15 ms avg

Page 25: CS 525 Advanced Distributed Systems  Spring 2014

25

Cassandra Summary

• While RDBMS provide ACID (Atomicity Consistency Isolation Durability)

• Cassandra provides BASE– Basically Available Soft-state Eventual Consistency

– Prefers Availability over consistency

• Other NoSQL products– MongoDB, Riak (look them up!)

• Next: HBase– Prefers (strong) Consistency over Availability

Page 26: CS 525 Advanced Distributed Systems  Spring 2014

26

HBase

• Google’s BigTable was first “blob-based” storage system

• Yahoo! Open-sourced it -> HBase

• Major Apache project today

• Facebook uses HBase internally

• API– Get/Put(row)

– Scan(row range, filter) – range queries

– MultiPut

Page 27: CS 525 Advanced Distributed Systems  Spring 2014

27

HBase Architecture

Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Small group of servers runningZab, a consensus protocol (Paxos-like)

HDFS

Page 28: CS 525 Advanced Distributed Systems  Spring 2014

28

HBase Storage hierarchy

• HBase Table– Split it into multiple regions: replicated across servers

» One Store per combination of ColumnFamily (subset of columns with similar query patterns) + region

• Memstore for each Store: in-memory updates to Store; flushed to disk when full

– StoreFiles for each store for each region: where the data lives

- Blocks

• HFile– SSTable from Google’s BigTable

Page 29: CS 525 Advanced Distributed Systems  Spring 2014

29

HFile

Source: http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/

SSN:000-01-2345

(For a census table example)

DemographicEthnicity

Page 30: CS 525 Advanced Distributed Systems  Spring 2014

30

Strong Consistency: HBase Write-Ahead Log

Write to HLog before writing to MemStoreThus can recover from failure

Source: http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html

Page 31: CS 525 Advanced Distributed Systems  Spring 2014

31

Log Replay

• After recovery from failure, or upon bootup (HRegionServer/HMaster)

– Replay any stale logs (use timestamps to find out where the database is w.r.t. the logs)

– Replay: add edits to the MemStore

• Keeps one HLog per HRegionServer rather than per region

– Avoids many concurrent writes, which on the local file system may involve many disk seeks

Page 32: CS 525 Advanced Distributed Systems  Spring 2014

32

Cross-data center replication

HLog

Zookeeper is actually a file system for control information1. /hbase/replication/state2. /hbase/replication/peers /<peer cluster number>3. /hbase/replication/rs/<hlog>

Page 33: CS 525 Advanced Distributed Systems  Spring 2014

33

Wrap UpWrap Up

• Key-value stores and NoSQL trade off between Availability and Consistency, while providing partition-tolerance

• Presentations and reviews start this Thursday– Presenters: please email me slides at least 24 hours before

your presentation time

– Everyone: please read instructions on website

– If you have not signed up for a presentation slot, you’ll need to do all reviews.

– Project meetings starting soon.