Big Data

Oleh :

Andrew B. Osmond

About Me

FB : http://facebook.com/ab.osmond

Kantor : Ged. N202, Fakultas Teknik Elektro, Universitas Telkom

Gmail : [email protected]

Tel-U : [email protected]

Why Data is so Big?

Data Anywhere

Big Data refers to massive, often unstructured data that is beyond the processing capabilities of traditional data management tools.

Big Data can take up terabytes and petabytes of storage space in diverse formats including text, video, sound, images etc.

Traditional relational database management systems cannot deal with such large masses of data.

Data Anywhere

What can we do with the big data?

Big Data Architecture

Nature of Data

Working With Data

Datasource

Data Scrubbing

Data Formats

Datasource

Open Data

Open data is data that can be used, re-use, and redistributed freely by anyone for any purpose.

Example : World Health Organization is available at

http://www.who.int/research/en/

Machine Learning Datasets is available at http://bitly.com/bundles/bigmlcom/2

The World Bank is available at http://data.worldbank.org/

Hilary Mason research-quality datasets is available at https://bitly.com/bundles/hmason/1

Text Files

commonly used for storage of data, because it is easy to transform into different formats, and it is often easier to recover and continue processing the remaining contents than with other formats.

SQL Database

NoSQL Database

Document Store

http://www.mongodb.com mongodb, http://couchdb.apache.org/ couchdb

Key value store

Apache Cassandra, Dynamo, Hbase, Amazon SimpleDB

Graph-based store

Neo4j, InfoGrid, Horton

Document Store

Key Value Store

Graph Store

Leading Technologies

Relational databases failed to store and process Big Data.

As a result, a new class of big data technology has emerged and is being used in many big data analytics environments.

The technology include : Hadoop, MapReduce, NoSQL

Hadoop

Opensource framework

Java based programming framework

Processing and storing large of datasets

Distributed Computing Environment

Components : HDFS, MapReduce

Hadoop SQL

Data is stored in form of compressed files across n number of commodity servers

Data is stored in form of tables and columns with relation in them

Fault tolerant if one node fails ,system still work

If any one node crashes ,it gives error so as to maintain consistency

Map Reduce

programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.

Hadoop is the physical implementation of Mapreduce .

It is combination of 2 java functions : Mapper() and Reducer()

Map Reduce Algorithm

Big Data

Documents

storage of data

unstructured data

process big data

large masses of data

mapreduce hadoop sql

large of datasets

diverse formats

different formats