Akselerasi Pertumbuhan Startup dengan Big Data Dwika Sudrajat IT Consultant Florida, Hong Kong & Jakarta. November 23 th , 2016 ▐ email: [email protected]▐ Florida: +1-407-2502812 ▐ Hong Kong: +852-54152971 ▐ Jakarta: +62-8161108571 ▐ FB: dwika.sudrajat ▐ TW: @dwikasudrajat ▐ managingconsultant.blogspot.com ▐ dwikasudrajat.blogspot.com ▐ dwikasudrajat.wordpress.com
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
What technologies do you think they are running on?
Page 6
Conventional Startup Development Team
Page 7
Today Startup Development Team
Page 8
From LAMP to MEAN
Page 9
Page 10
Modern web development stack
Page 11
MEAN.JS a full-stack JavaScript using MongoDB, Express, AngularJS, and NodeJS
What is Big Data?
Page 13
Data
Page 14
Hadoop, Why?
Hadoop, Volume, Velocity, Variety
Page 16
Data Growing
Real Application of Big Data Today
SHORT LIFESPAN OF THE DATA
FAST
MO
VIN
G D
ATA
FAST
DAT
A PR
OC
ESSI
NG
HIGH VARIETY OF DATA
Challenges
Page 19
Data Volume and Variety
Four V’s and a C
Not only volume makes big data big, it’s all about the three V’s: High Volume, Variety, Velocity High Value!
In addition the Challenge : the data is very complex in nature, often unstructured: Text documents, emails, images and videos, etc. Click stream data, social media feed data, etc.
Page 21
Eliminate A Single Point Of Failure load balancer itself does not become a single point of failure. Load balancers must be implemented in high availability cluster
Page 22
Page 23
Page 24
Page 25
Rack 2 Rack 3Rack 1
A Typical Hadoop Cluster
ClientDATA ASSIGNMENT TO NODES
DATA READDATA WRITE
METADATA FORBLOCK INFO
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Job Tracker
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Master Node
Slave Nodes
Slave Nodes
Slave Nodes
Name Node
JOB ASSIGNMENT
TASK ASSIGNMENT
1. Client2. Master Node
Name Node Job Tracker
3. Slave Nodes Data Nodes Task
Trackers Map /
Reduce
1. Client consults Name Node2. Client writes block to Data
Node3. Data Node replicates block4. Cycle repeats for next blocks
Rack 2 Rack 3Rack 1
Hadoop File System (HDFS)
Data Node 1 Data Node 4 Data Node 7
Data Node 2 Data Node 5 Data Node 8
Data Node 3 Data Node 6 Data Node 9
Name Node
Client
FILE
FILE
DATA ASSIGNMENT TO NODES
DATA READDATA WRITE
METADATA FORBLOCK INFO
Rack 1: Data Node 1 Data Node 2 …Rack 2: Data Node 4 …
– Files are replicated to handle hardware failure– Detect failures and recovers from them
▐ Optimized for Batch Processing– Provides very high aggregate bandwidth
Hadoop, Why?
▐ Need to process Multi Petabyte Datasets▐ Need common infrastructure
– Efficient, reliable, Open Source Apache License▐ The above goals are same as Condor, but
Workloads are IO bound and not CPU bound
Hive, Why?▐ Need a Multi Petabyte Warehouse▐ Hive is a Hadoop subproject!
What is MapReduce?▐ Data-parallel programming model for clusters of commodity
machines▐ Pioneered by Google Processes 20 PB of data per day▐ Popularized by open-source Hadoop project
Used by Yahoo!, Facebook, Amazon, …
Hadoop at Facebook▐ Production cluster
4800 cores, 600 machines, 16GB per machine – April 20098000 cores, 1000 machines, 32 GB per machine – July 20094 SATA disks of 1 TB each per machine2 level network hierarchy, 40 machines per rackTotal cluster size is 2 PB, projected to be 12 PB in Q3 2009
▐ Test cluster• 800 cores, 16GB each
2016 - Hadoop clusters
▐ ~20,000 machines running Hadoop▐ Largest clusters are currently 2000 nodes▐ Several Petabytes of user data (compressed, unreplicated)▐ Run hundreds of thousands of jobs every month
2016 - Big Data Server Farm
Page 35
Conclusions
The Digital Age brings many opportunities but also challenges.
Big Data and Analytics can face the challenges and realize the opportunities.
It is within anyone’s grasp, do it incremental and iterative. Hadoop cloud solutions are scalable, flexible and cost-
efficient, but sometimes limited in functionality (or not standardized).
Need for good Data Scientists in a mixed team of competences to make the right choices.