Big table

Presented By:Riddhi TandonAkshay Gupta

Vasu Ragan Lohia

Outline1) Introduction. 2) Google Services3) GFS4) Chubby5) Map Reduce6) Big Table7) Structure Of BigTable8) Log Files and Compaction9) Load Balancing10) LookUp11) Compression:Snappy12) Conclusion

Introduction

Google.com domain was registered on September 15, 1993.

Google services are highly efficient, robust and trustworthy.

If I start to name them, First would be obviously Google Search, Docs,

App Engine, Maps, Gmail and many more.

Google is best known for it’s reliable and fast services, but what’s there

working behind the scene?

Let’s have a short introduction of Google.

About Google:

Google is an Internet Information Provider Company (according to NASDAQ).

It makes money from its advertising business : AdWords & AdSense.

Google lets your business grow by advertising and you pay it for CPC (Cost Per Click) or CPM (Cost Per Impression).

Google has setup a revolutionary advertising model in the world.

By earning from these businesses, Google makes amazing and costly products (according to its maintenance) , which we get for free.

What is Google ?

How come Google’s services so fast?

Undoubtedly, there are number of aspects that matter behind this (like Hardware, Software, Operating System, Best Staff in the world etc. )But, What I am going to explain here is the Software part.

GFS

Chubby

Map Reduce

Bigtable

GFS stands for Google File System.

It’s a Proprietary(means for their personal use, not open source) distributed file system developed by Google for their services.

It is specially designed to provide efficient, reliable access to data using large clusters of commodity hardware, means they are using low cost hardware, not state-of-the-art computers. Google uses relatively inexpensive computers running Linux Operating System and the GFS works just fine with them !

What is GFS?

Chubby is a Lock Service. (It’s related to gain access of Shared resources)

It is used to synchronize accesses to shared resources.

It is now used as a replacement of Google’s Domain Name System.

What is Chubby?

MapReduce is a software framework that process massive amounts of unstructured data.

It allows developers to write programs that process data in parallel across a distributed cluster of processors or stand-alone computers.

It is now used by Google mainly for their Web Indexing Service, applied since 2004.

Map() procedure performs all the process related to Filtering and Sorting.

Reduce() procedure performs all the Summary related operations.

What is Map Reduce?

What is Google BigTable ?

BigTable is a compressed, high performance, and proprietary data storage system built on Google File System, Chubby Lock Service, SSTable (log-structured storage like LevelDB) and a few other Google technologies.

It’s Proprietary Data Storage System (that means it is for Google’s personal use only).

Most important point, It’s a Non-Relational Database.

It uses amazing Load Balancing Structure so that it runs on Commodity Hardware.

It uses Snappy compression utility for compacting the data.

It’s a Database, which uses compression utilities to store and retrieve data efficiently.

It uses a special structure for storing data, therefore it gives high performance. (Load Balancing Structure)

It’s proprietary, that means it is for Google’s personal use only. It is not open source.

Google BigTable is built upon different Google technologies.

Means:-

Requirements ?

BigTable is designed to run on Commodity Hardware ( Low cost

computers ).

Thus BigTable can run on any PC like ours.

Very less incremental cost for new services and expansion of

computing power

Special Features

It’s a Robust database, That means it can work similarly even in worse situation.

BigTable given highest importance to Read and Query performance.

Higher Data Availability : -A write is immediately replicated to multiple data centers.

Automatic Scaling :BigTable uses a distributed architecture to automatically

manage scaling to very large data sets.

Structure of BigTable

Each table is a Multi-Dimensional Sparse Map( Memory Efficient hash-map implementation).

The table consists of (1) Rows, (2) Columns and (3) Each cell has a Time Version (Time-Stamp).

Time Version results in multiple copies of each cell with different times, resulting Unimaginable Redundancy which is requirement for Google services, so don’t ever think it as a drawback of this system.

Google does Web Indexing to get the data of all the websites. They store all the URLs, their titles, time-stamp and many more required fields

Web Indexing :- indexing the contents of a website

Load Balancing Structure

(dummy sitemap of my website Codeplaza, where 5 fields are shown)

Consider this one huge Table with millions of entries. In order to manage such tables,they are split at row boundaries and saved

as Tablets. Each Tablets size is 100-200 MB and each machine stores about 100 of them.100-

200 MB of data can store thousands (even more ) rows.

Example showing 4 rows = 1 tablet.

This setup allows us Fine-Grain Load Balancing. (Suppose, if one tablet is receives lots of queries, it can share or divide data with other tablets or move the busy tablet to another not-so-busy machine.)

This setup also allows Fast Rebuilding. (Means, when a machine goes down, other machines take one tablet from the downed machine, so 100 machines get a new tablet, but the load on each machine to pick up new tablet is fairly small.)

Tablets are stored on systems as Immutable SSTables and a tail of logs (one log per machine).

SSTable stands for ‘Sorted String Table’. Some also call it ‘Static and Sorted Table’. The figure below shows a dummy structure of SSTable.

Log Files and Compaction

When system memory is filled, it compacts some tablets.

Two compactions :- Minor and Major compactions.

Minor compactions involve only a few tablets, while Major compactions ones involve the whole system results in reclaim of hard disk space. The location of the tablets are actually stored in special BigTable cells.

Immutable SSTable :- Mutation means to change/update over time. Remember the

mutants from X-Men & Krrish-3. (Mutants are special kind of species , whose DNA is changed over time . )Thus , SSTables which are Immutable , they are never changed or updated , that is , they are Static !

Know ,the question is that, How the entries in SSTable are stored or modification is done to a Immutable SSTable?

Answer to the above question is that , remove the old one, Make a new SSTable.Sounds weird ? But , It is a great idea because it saves a lot of time of searching and sorting for updating data on a single (large)table.

LookUp Lookup is a three-level system.

Benefit :- NO Big Bottleneck in the system and it also make heavy use of Pre-Fetching and Caching

Tablet Location Hierarchy

Chubby file contains location of the root tablet.

Root tablet contains all tablet locations in Metadata table.

Metadata table stores locations of actual tablets.

Client moves up the hierarchy (Metadata -> Root -> Chubby), if location of tablet is unknown or incorrect.

Compression : Snappy

Lot of redundant data in system (especially through time), so they make heavy use of compression.

Compression looks for similar values along the rows, columns, and times. ( Here comes the use of priority as mentioned earlier. Less priority , less data fetching and more compression. )

Used variations of BMDiff and Zippy to develop compression software. BMDiff gives them high write speeds (~100MB/s) and even faster read speeds (~1000MB/s). Zippy compresses very fast.After Research, They built a software named “Snappy”.

Snappy is a compression/decompression library which does not aim for maximum compression, instead, it aims for very high speeds and reasonable compression. (On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.)

Actual Hierarchical Load Balancing Structure

request arrives at ROOT (Master Computer).

ROOT checks its master record and sends the request to the right PC.

SSTable contains the records of tablets.

This is how, it works

Via Meta Tablets, request is sent to tablet containing original data tablet and the data is fetched then.

Bigtable has achieved its goals of high performance, data availability and scalability. It has been successfully deployed in real apps (Personalized

Search, Orkut, Google Maps, …)

Significant advantages of building own storage system like flexibility in designing data model, control over implementation and other infrastructure on which Bigtable relies on.

Conclusion

Thank You

Big table

Education

google bigtable

google search

google file system

short introduction of

different google technologies

mb of data

andretrieve data

large data sets