A Decentralized Structure Storage Model - Avinash Lakshman & Prashanth Malik - Presented by Srinidhi Katla CASSANDRA.

A Decentralized Structure Storage Model - Avinash Lakshman & Prashanth Malik - Presented by Srinidhi Katla CASSANDRA

Topics covered: What Is Cassandra Motive Data Model Architecture The After Story Applications

Features of Cassandra Distributed Storage system Manages very large amounts of Data Highly available No Single point of failure Simple data model Dynamic control over Data layout and format Designed to run on cheap commodity hardware Handles high throughput while not sacrificing high read efficiency

Motives behind Cassandra Storage needs of Inbox search problem o High write throughput o Increasing number of users o High search latencies due to data distribution. Operational Requirements : o Scalability o Handle Hardware failure Inbox Search was launched in 2008 for 100 million users ; Is Deployed as backend storage system for multiple services within FB

Data Model Is based on Amazons Dynamo and Googles Big Table. Table : distributed Multi-dimensional map indexed by a key Consists : Row key, Column, Column Family, Super column Family Row Key : Can be considered equivalent to primary index of the RDBMS. Column : is a name, value, time (e.g., color=red). Column Family : Set of columns grouped together Simple column Family Super column Family : column family within column family

Column Family Image courtesy : http://www.ebaytechblog.com/author/jhpatel/#.VSPslfnF8SM

Column Family (Conti..) Access column using convention : column_family:column Super column : column_family:supercolumn:column

Facebook super column abstraction term search : User Id = row key ; Terms searched = supercolumn; Message identifiers of message containing the word = column Interaction User ID : rowkey; receipients IDs : supercolumn Individual message identifier = columns

API Cassandra has thrift querying : insert (table, key, row Mutation) get(table, key, column Name) delete(table, key, columnName)

Architecture Partitioning Replication Membership and Failure Detection Bootstrapping Scaling the cluster Local Persistence

Partitioning Data is partitioned dynamically over the nodes to aid scaling. Implements order preserving consistent hashing.(CH) Through consistent Hashing, coordinator for each data key is determined. Advantages of CH : Departure and Arrival of node only affects its neighbours. Disadvantage of CH : Non-uniform data distribution. Hashing is unaware of the heterogenity of the performance of nodes. Solution by Cassandra: Lightly loaded nodes move on the ring to alleviate heavily loaded nodes.

Replication : Required for ensuring High availability and durability Replication Factor N Coordinator node is responsible for replication of data at N-1 nodes. Replication Policies : Rack Unaware : replicated to N-1 successors of coordinator Rack Aware Zookeeper is chosen, Data Center Aware informs the nodes what replicas to store Meta data about ranges a node is responsible for is stored in ZooKeeper as well as the node. Preference list

Membership and Failure Detection Membership is based on Scuttlebutt Gossip based mechanism. Efficient CPU utilization Efficient utilization of gossip channel Used for membership and to disseminate system related control state Failure detection : To check if the node is available and to avoid attempts to communicate with the unreachable nodes. uses Modified Accrual Failure detector Failure detection emits suspicion level defined as instead of Boolean value.

Boot Strapping & Scaling Token assigned to new node is gossiped among all the nodes. New node is assigned token so as alleviate the heavily loaded node. New node reads the configuration file from the ZooKeeper. Node outages are usually transient => Rebalancing of partition assignment or repair of unreachable replicas should be avoided. Change of node membership is manual. The heavily loaded node splits the data and responsibility. Operational experience shows that the data can be transferred at a rate of 40 Mbps from single node. Speed can be improved by having multiple replicas take part in bootstrapping

Local Persistence Relies on local file system Dedicated disk on each machine for commit log to maximise the disk throughput Write :Data is first written to commit log and later to in-memory data structure. After the data limit crosses a threshold value in the in-memory DS, it is dumped to the disk. - Index is created for efficient lookup. Many files exist on the disk over time. Merge process to collate these files into one file. Similar to compaction process in Big Table. Generate index for 256K block for efficient lookup in columns

Local Persistence (Conti) Read: Query the in-memory DS 1 st. Then look up in the disk. Files are looked up in the order of new to old. Bloom filter to check if the key exists in the file. Column indices

Reads and Writes Request for a key is routed to a node in the cluster. Node determines the replicas and route request. Fail request if the replies are not received within time. For Writes : routes request to replica and waits for a quorun of replicas to acknowledge the completion of writes For Reads : Based on Client set consistency guarantee value, request is routed to either the closest replica or request is routed to all replicas and wait for the quorum of responses

Implementation Cassandra on each machine partitioning module, cluster membership, failure detection, storage engine Implemented ground up using Java Purge commit log entries using rolling commit log mechanism for 128 MB chunk. In memory DS and datafile for every column family All writes to disk are sequential to maximize the throughput No locks since the files dumped to the disk are not mutated.

The After Story It was released as an open source project on Google code in July 2008 which is now being developed and marketed by Apache as Apache Cassandra (henceforth referred as Cassandra in this slide).Google code In Apache Cassandra, Super columns are stripped due to performance issues. Instead composite column is introduced Cassandra Query Language presents a data model familiar to relational database users. Cassandra partitioning is still based on consistent hashing, but has moved away from load balancing in favor of virtual nodes, Order preserving hash function was ripped in favor of a true OrderedPartitioner (later superseded by ByteOrderedPartitioner).

The After Story (Conti..) In modern Cassandra terminology, the coordinator is the node that processes a given clients request and routes it to the appropriate replicas; it is not necessarily itself a replica. Zookeeper usage was restricted to Facebooks in-house Cassandra branch; Modern Cassandra management tools include DataStaxs OpsCenter and Netflixs Priam.

Big Players Facebook Inbox search feature was implemented on Cassandra where every user is an index and the recipient and messages are stored as columns. The sytem currently stores more than 50 TB of data on a 150 node cluster with a median search latency of approximately 15 ms. Netflix, a video streaming firm stores 95% of its data in Cassandra Ebay has implemented Cassandra for the features like counts for own want like data on its web page. Coursera, an online training service, has Cassandra implemented for its mobile applications

References: http://www.ebaytechblog.com/author/jhpatel/#.VSPslfnF 8SM http://www.ebaytechblog.com/author/jhpatel/#.VSPslfnF 8SM http://www.divconq.com/2010/cassandra-columns-and- supercolumns-and-rows/ http://www.divconq.com/2010/cassandra-columns-and- supercolumns-and-rows/ http://docs.datastax.com/en/articles/cassandra/cassandrat henandnow.html http://docs.datastax.com/en/articles/cassandra/cassandrat henandnow.html

QUESTIONS?

A Decentralized Structure Storage Model - Avinash Lakshman & Prashanth Malik - Presented by Srinidhi Katla CASSANDRA.

Documents

column slide

column super column

column family slide

column family conti

access column

data key

super column family

replication of data