Top Banner
Introduction AFS GFS and Hadoop Amazon S3, Dynamo, and Cassandra Big Data Storage Technologies James Lee The George Washington University April 11, 2012 James Lee Big Data Storage Technologies
12

Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

Sep 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

IntroductionAFS

GFS and HadoopAmazon S3, Dynamo, and Cassandra

Big Data Storage Technologies

James Lee

The George Washington University

April 11, 2012

James Lee Big Data Storage Technologies

Page 2: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

IntroductionAFS

GFS and HadoopAmazon S3, Dynamo, and Cassandra

What is Big Data?

I When the size of the datagrows to become as big of aproblem to store and processas the problem you aretrying to solve with the data.

James Lee Big Data Storage Technologies

Page 3: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

IntroductionAFS

GFS and HadoopAmazon S3, Dynamo, and Cassandra

Why are traditional filesystem insufficient?

I Upper limit on filesystemsize

I Limited redundancy

I Limited bandwidth

James Lee Big Data Storage Technologies

Page 4: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

IntroductionAFS

GFS and HadoopAmazon S3, Dynamo, and Cassandra

So what are the options for scaling out?

I Depends on business needs.

I Scale within a rack, within a datacenter, or across wide-areanetworks.

I Several different technologies available for achieving thosegoals.

I May have to make compromises in places.

James Lee Big Data Storage Technologies

Page 5: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

IntroductionAFS

GFS and HadoopAmazon S3, Dynamo, and Cassandra

Andrew File System

I Distributed filesystemdeveloped in 1980s.

I Used primarily byUniversities.

I Has traditional filesystemsemantics.

I Scales to hundreds ofterabytes.

James Lee Big Data Storage Technologies

Page 6: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

Source: http: // caligari. dartmouth. edu/ classes/ afs/ print_ pages. shtml

Page 7: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

Source: http: // caligari. dartmouth. edu/ classes/ afs/ print_ pages. shtml

Page 8: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

IntroductionAFS

GFS and HadoopAmazon S3, Dynamo, and Cassandra

What does Google do?

Look at Google’s requirements:

I hundreds of millions of huge files

I have to be read very quickly

I writes less important

I have to be redundant, but not synchronous

I concurrent access to files should have low overhead

These ideas have been implemented in the Apache Hadoop project.James Lee Big Data Storage Technologies

Page 9: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

IntroductionAFS

GFS and HadoopAmazon S3, Dynamo, and Cassandra

Hadoop

I Written in Java (no filesystem semantics)

I Stores files in large blocks (64 MB) that get lazily-replicated

I Rack-aware replication

I Master ‘NameNode’ tracks location of blocks

I Writes only optimized for appending data

I Scales to tens of thousands of nodes; > 100 PB

James Lee Big Data Storage Technologies

Page 10: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

Source: http: // arst. ch/ s9l

Page 11: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

IntroductionAFS

GFS and HadoopAmazon S3, Dynamo, and Cassandra

Amazon has very different requirements than a search engine:

I Willing to compromise on data consistency across system forHA

I Deal with more general-purpose data access

I Handle random access to smaller components

Amazon developed their own distributed FS called Dynamo.

James Lee Big Data Storage Technologies

Page 12: Big Data Storage Technologies · Amazon S3, Dynamo, and Cassandra Amazon has very di erent requirements than a search engine: I Willing to compromise on data consistency across system

IntroductionAFS

GFS and HadoopAmazon S3, Dynamo, and Cassandra

Dynamo

I Decentralized, peer-to-peerarchitecture.

I System determines node to selectby MD5 hash.

I Nodes always query neighbors forlatest version.

I Implemented in Apache Cassandraproject. Source: http: // arst. ch/ s9l

James Lee Big Data Storage Technologies