Oleh : Andrew B. Osmond
Sep 24, 2015
Oleh :
Andrew B. Osmond
About Me
FB : http://facebook.com/ab.osmond
Kantor : Ged. N202, Fakultas Teknik Elektro, Universitas Telkom
Gmail : [email protected]
Tel-U : [email protected]
Why Data is so Big?
Why Data is so Big?
Why Data is so Big?
Data Anywhere
Big Data refers to massive, often unstructured data that is beyond the processing capabilities of traditional data management tools.
Big Data can take up terabytes and petabytes of storage space in diverse formats including text, video, sound, images etc.
Traditional relational database management systems cannot deal with such large masses of data.
Data Anywhere
What can we do with the big data?
Big Data Architecture
Nature of Data
Working With Data
Datasource
Data Scrubbing
Data Formats
Datasource
Open Data
Open data is data that can be used, re-use, and redistributed freely by anyone for any purpose.
Example : World Health Organization is available at
http://www.who.int/research/en/
Machine Learning Datasets is available at http://bitly.com/bundles/bigmlcom/2
The World Bank is available at http://data.worldbank.org/
Hilary Mason research-quality datasets is available at https://bitly.com/bundles/hmason/1
Text Files
commonly used for storage of data, because it is easy to transform into different formats, and it is often easier to recover and continue processing the remaining contents than with other formats.
SQL Database
NoSQL Database
Document Store
http://www.mongodb.com mongodb, http://couchdb.apache.org/ couchdb
Key value store
Apache Cassandra, Dynamo, Hbase, Amazon SimpleDB
Graph-based store
Neo4j, InfoGrid, Horton
Document Store
Key Value Store
Graph Store
Leading Technologies
Relational databases failed to store and process Big Data.
As a result, a new class of big data technology has emerged and is being used in many big data analytics environments.
The technology include : Hadoop, MapReduce, NoSQL
Hadoop
Opensource framework
Java based programming framework
Processing and storing large of datasets
Distributed Computing Environment
Components : HDFS, MapReduce
Hadoop SQL
Data is stored in form of compressed files across n number of commodity servers
Data is stored in form of tables and columns with relation in them
Fault tolerant if one node fails ,system still work
If any one node crashes ,it gives error so as to maintain consistency
Map Reduce
programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.
Hadoop is the physical implementation of Mapreduce .
It is combination of 2 java functions : Mapper() and Reducer()
Map Reduce Algorithm