YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Big Data Computing Architecture

Evolution of Big Data Architecture

Gang Tao

Page 2: Big Data Computing Architecture

Computing Trend

Page 3: Big Data Computing Architecture
Page 4: Big Data Computing Architecture

Non Functional Requirements

• Latency

• Throughput

• Fault Tolerant

• Scalable

• Exactly Once Semantic

Page 5: Big Data Computing Architecture

Hadoop

Page 6: Big Data Computing Architecture

Hadoop Eco System

Page 7: Big Data Computing Architecture

Map Reduce

Move computation to Data

Page 8: Big Data Computing Architecture

Map Reduce

Page 9: Big Data Computing Architecture

Hadoop the Limitation

• Map/Reduce is hard to use

• Latency is high

• inevitable data movement

Page 10: Big Data Computing Architecture

Lambda

Page 11: Big Data Computing Architecture

Lambda Architecture

query = function(all data)

Page 12: Big Data Computing Architecture

Design Principle

human fault-tolerance – the system is unsusceptible to data loss or data corruption because at scale it could be irreparable.

data immutability – store data in it’s rawest form immutable and for perpetuity. (INSERT/ SELECT/DELETE but no UPDATE !)

recomputation – with the two principles above it is always possible to (re)-compute results by running a function on the raw data.

Page 13: Big Data Computing Architecture

Lambda Architecture

Page 14: Big Data Computing Architecture

Batch Processing

Page 15: Big Data Computing Architecture

Batch Processing

Using only batch processing, leaves you always with a portion of non- processed data.

Page 16: Big Data Computing Architecture

Realtime Stream Processing

Page 17: Big Data Computing Architecture

query = function(all data) + function(new data)

Page 18: Big Data Computing Architecture

Lambda Architecture

Page 19: Big Data Computing Architecture

A Reference Case

Page 20: Big Data Computing Architecture

Lambda the Bad?

What is Good?

What is Bad?

Inmutable Data , Reprocessing

Keeping code written in two different systems perfectly in sync

Page 21: Big Data Computing Architecture

Over complicated Lambda

Page 22: Big Data Computing Architecture

Batch VS. Stream

Page 23: Big Data Computing Architecture

Stream & Realtime

Page 24: Big Data Computing Architecture

Storm Logic View

Bolts

Spouts

Move data to Computation

Page 25: Big Data Computing Architecture

Storm Deployment View

Page 26: Big Data Computing Architecture

Code Compare

Page 27: Big Data Computing Architecture

DAG• DAG : Directed Acyclic Graph

• Used in Spark, Storm, Flink etc.

Page 28: Big Data Computing Architecture

Out of Control

Build Complexity with Simplicity

Page 29: Big Data Computing Architecture

Stream Processing Model

One at a time Micro batch

Page 30: Big Data Computing Architecture

Stream Processing Model

One at a time Micro Batch

Low Latency Y N

High Throughput N Y

at least once Y Y

excatly once Sometimes Y

simple programing model Y N

Page 31: Big Data Computing Architecture

Stream Computing the Limitation• Queries must be written before data

• There should be another way to query past data

• Queries cannot be run twice

• All results will be lost when any error occurs All data have gone when bugs found

• Disorders of events break results

• Recorded time based queries? Or arrival time based queries?

Page 32: Big Data Computing Architecture

Batch/Stream Unification

Page 33: Big Data Computing Architecture

Stream with Spark

Page 34: Big Data Computing Architecture

Apache Flink

Page 35: Big Data Computing Architecture

Flink Distributed Snapshot

Page 36: Big Data Computing Architecture

Fault Tolerance in Stream

• At Least Once : ensure all operators see all events

• Stream -> Replay on failure

• Exactly Once :

• Flink : distributed Snapshot

• Spark : Micro Batch

Page 37: Big Data Computing Architecture

Jay Kreps

Page 38: Big Data Computing Architecture

Kappa Architecture


Related Documents