Top Banner
Evolution of Big Data Architecture Gang Tao
38

Big Data Computing Architecture

Jan 15, 2017

Download

Technology

Gang Tao
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data Computing Architecture

Evolution of Big Data Architecture

Gang Tao

Page 2: Big Data Computing Architecture

Computing Trend

Page 3: Big Data Computing Architecture
Page 4: Big Data Computing Architecture

Non Functional Requirements

• Latency

• Throughput

• Fault Tolerant

• Scalable

• Exactly Once Semantic

Page 5: Big Data Computing Architecture

Hadoop

Page 6: Big Data Computing Architecture

Hadoop Eco System

Page 7: Big Data Computing Architecture

Map Reduce

Move computation to Data

Page 8: Big Data Computing Architecture

Map Reduce

Page 9: Big Data Computing Architecture

Hadoop the Limitation

• Map/Reduce is hard to use

• Latency is high

• inevitable data movement

Page 10: Big Data Computing Architecture

Lambda

Page 11: Big Data Computing Architecture

Lambda Architecture

query = function(all data)

Page 12: Big Data Computing Architecture

Design Principle

human fault-tolerance – the system is unsusceptible to data loss or data corruption because at scale it could be irreparable.

data immutability – store data in it’s rawest form immutable and for perpetuity. (INSERT/ SELECT/DELETE but no UPDATE !)

recomputation – with the two principles above it is always possible to (re)-compute results by running a function on the raw data.

Page 13: Big Data Computing Architecture

Lambda Architecture

Page 14: Big Data Computing Architecture

Batch Processing

Page 15: Big Data Computing Architecture

Batch Processing

Using only batch processing, leaves you always with a portion of non- processed data.

Page 16: Big Data Computing Architecture

Realtime Stream Processing

Page 17: Big Data Computing Architecture

query = function(all data) + function(new data)

Page 18: Big Data Computing Architecture

Lambda Architecture

Page 19: Big Data Computing Architecture

A Reference Case

Page 20: Big Data Computing Architecture

Lambda the Bad?

What is Good?

What is Bad?

Inmutable Data , Reprocessing

Keeping code written in two different systems perfectly in sync

Page 21: Big Data Computing Architecture

Over complicated Lambda

Page 22: Big Data Computing Architecture

Batch VS. Stream

Page 23: Big Data Computing Architecture

Stream & Realtime

Page 24: Big Data Computing Architecture

Storm Logic View

Bolts

Spouts

Move data to Computation

Page 25: Big Data Computing Architecture

Storm Deployment View

Page 26: Big Data Computing Architecture

Code Compare

Page 27: Big Data Computing Architecture

DAG• DAG : Directed Acyclic Graph

• Used in Spark, Storm, Flink etc.

Page 28: Big Data Computing Architecture

Out of Control

Build Complexity with Simplicity

Page 29: Big Data Computing Architecture

Stream Processing Model

One at a time Micro batch

Page 30: Big Data Computing Architecture

Stream Processing Model

One at a time Micro Batch

Low Latency Y N

High Throughput N Y

at least once Y Y

excatly once Sometimes Y

simple programing model Y N

Page 31: Big Data Computing Architecture

Stream Computing the Limitation• Queries must be written before data

• There should be another way to query past data

• Queries cannot be run twice

• All results will be lost when any error occurs All data have gone when bugs found

• Disorders of events break results

• Recorded time based queries? Or arrival time based queries?

Page 32: Big Data Computing Architecture

Batch/Stream Unification

Page 33: Big Data Computing Architecture

Stream with Spark

Page 34: Big Data Computing Architecture

Apache Flink

Page 35: Big Data Computing Architecture

Flink Distributed Snapshot

Page 36: Big Data Computing Architecture

Fault Tolerance in Stream

• At Least Once : ensure all operators see all events

• Stream -> Replay on failure

• Exactly Once :

• Flink : distributed Snapshot

• Spark : Micro Batch

Page 37: Big Data Computing Architecture

Jay Kreps

Page 38: Big Data Computing Architecture

Kappa Architecture