This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
A Decentralized Structure Storage Model - Avinash Lakshman
& Prashanth Malik - Presented by Srinidhi Katla CASSANDRA
Slide 2
Topics covered: What Is Cassandra Motive Data Model
Architecture The After Story Applications
Slide 3
Features of Cassandra Distributed Storage system Manages very
large amounts of Data Highly available No Single point of failure
Simple data model Dynamic control over Data layout and format
Designed to run on cheap commodity hardware Handles high throughput
while not sacrificing high read efficiency
Slide 4
Motives behind Cassandra Storage needs of Inbox search problem
o High write throughput o Increasing number of users o High search
latencies due to data distribution. Operational Requirements : o
Scalability o Handle Hardware failure Inbox Search was launched in
2008 for 100 million users ; Is Deployed as backend storage system
for multiple services within FB
Slide 5
Data Model Is based on Amazons Dynamo and Googles Big Table.
Table : distributed Multi-dimensional map indexed by a key Consists
: Row key, Column, Column Family, Super column Family Row Key : Can
be considered equivalent to primary index of the RDBMS. Column : is
a name, value, time (e.g., color=red). Column Family : Set of
columns grouped together Simple column Family Super column Family :
column family within column family
Slide 6
Column Family Image courtesy :
http://www.ebaytechblog.com/author/jhpatel/#.VSPslfnF8SM
Slide 7
Column Family (Conti..) Access column using convention :
column_family:column Super column :
column_family:supercolumn:column
Slide 8
Facebook super column abstraction term search : User Id = row
key ; Terms searched = supercolumn; Message identifiers of message
containing the word = column Interaction User ID : rowkey;
receipients IDs : supercolumn Individual message identifier =
columns
Slide 9
API Cassandra has thrift querying : insert (table, key, row
Mutation) get(table, key, column Name) delete(table, key,
columnName)
Slide 10
Architecture Partitioning Replication Membership and Failure
Detection Bootstrapping Scaling the cluster Local Persistence
Slide 11
Partitioning Data is partitioned dynamically over the nodes to
aid scaling. Implements order preserving consistent hashing.(CH)
Through consistent Hashing, coordinator for each data key is
determined. Advantages of CH : Departure and Arrival of node only
affects its neighbours. Disadvantage of CH : Non-uniform data
distribution. Hashing is unaware of the heterogenity of the
performance of nodes. Solution by Cassandra: Lightly loaded nodes
move on the ring to alleviate heavily loaded nodes.
Slide 12
Replication : Required for ensuring High availability and
durability Replication Factor N Coordinator node is responsible for
replication of data at N-1 nodes. Replication Policies : Rack
Unaware : replicated to N-1 successors of coordinator Rack Aware
Zookeeper is chosen, Data Center Aware informs the nodes what
replicas to store Meta data about ranges a node is responsible for
is stored in ZooKeeper as well as the node. Preference list
Slide 13
Membership and Failure Detection Membership is based on
Scuttlebutt Gossip based mechanism. Efficient CPU utilization
Efficient utilization of gossip channel Used for membership and to
disseminate system related control state Failure detection : To
check if the node is available and to avoid attempts to communicate
with the unreachable nodes. uses Modified Accrual Failure detector
Failure detection emits suspicion level defined as instead of
Boolean value.
Slide 14
Boot Strapping & Scaling Token assigned to new node is
gossiped among all the nodes. New node is assigned token so as
alleviate the heavily loaded node. New node reads the configuration
file from the ZooKeeper. Node outages are usually transient =>
Rebalancing of partition assignment or repair of unreachable
replicas should be avoided. Change of node membership is manual.
The heavily loaded node splits the data and responsibility.
Operational experience shows that the data can be transferred at a
rate of 40 Mbps from single node. Speed can be improved by having
multiple replicas take part in bootstrapping
Slide 15
Local Persistence Relies on local file system Dedicated disk on
each machine for commit log to maximise the disk throughput Write
:Data is first written to commit log and later to in-memory data
structure. After the data limit crosses a threshold value in the
in-memory DS, it is dumped to the disk. - Index is created for
efficient lookup. Many files exist on the disk over time. Merge
process to collate these files into one file. Similar to compaction
process in Big Table. Generate index for 256K block for efficient
lookup in columns
Slide 16
Local Persistence (Conti) Read: Query the in-memory DS 1 st.
Then look up in the disk. Files are looked up in the order of new
to old. Bloom filter to check if the key exists in the file. Column
indices
Slide 17
Reads and Writes Request for a key is routed to a node in the
cluster. Node determines the replicas and route request. Fail
request if the replies are not received within time. For Writes :
routes request to replica and waits for a quorun of replicas to
acknowledge the completion of writes For Reads : Based on Client
set consistency guarantee value, request is routed to either the
closest replica or request is routed to all replicas and wait for
the quorum of responses
Slide 18
Implementation Cassandra on each machine partitioning module,
cluster membership, failure detection, storage engine Implemented
ground up using Java Purge commit log entries using rolling commit
log mechanism for 128 MB chunk. In memory DS and datafile for every
column family All writes to disk are sequential to maximize the
throughput No locks since the files dumped to the disk are not
mutated.
Slide 19
The After Story It was released as an open source project on
Google code in July 2008 which is now being developed and marketed
by Apache as Apache Cassandra (henceforth referred as Cassandra in
this slide).Google code In Apache Cassandra, Super columns are
stripped due to performance issues. Instead composite column is
introduced Cassandra Query Language presents a data model familiar
to relational database users. Cassandra partitioning is still based
on consistent hashing, but has moved away from load balancing in
favor of virtual nodes, Order preserving hash function was ripped
in favor of a true OrderedPartitioner (later superseded by
ByteOrderedPartitioner).
Slide 20
The After Story (Conti..) In modern Cassandra terminology, the
coordinator is the node that processes a given clients request and
routes it to the appropriate replicas; it is not necessarily itself
a replica. Zookeeper usage was restricted to Facebooks in-house
Cassandra branch; Modern Cassandra management tools include
DataStaxs OpsCenter and Netflixs Priam.
Slide 21
Big Players Facebook Inbox search feature was implemented on
Cassandra where every user is an index and the recipient and
messages are stored as columns. The sytem currently stores more
than 50 TB of data on a 150 node cluster with a median search
latency of approximately 15 ms. Netflix, a video streaming firm
stores 95% of its data in Cassandra Ebay has implemented Cassandra
for the features like counts for own want like data on its web
page. Coursera, an online training service, has Cassandra
implemented for its mobile applications