Top Banner
Building a Stream Processing System for Playable Ads Data at VMFive Gordon Tai Data Engineer @ HadoopCon 2015
43

HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Apr 21, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Building a Stream Processing System for Playable Ads Data at VMFive

Gordon Tai Data Engineer

@ HadoopCon 2015

Page 2: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

3Techcrunch Beijing, Aug. 2014Champion

Page 3: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

137

Page 4: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Gordon Tai 戴資⼒力

InterestsCluster computing, open-source

Research• FedRDD: Federated RDDs for

Multicluster Computing • Criteria-Based Cluster Scheduling on

Hadoop YARN • FedLoop: Looping on Federated

MapReduce • FedMR: Federated MapReduce to

Transparently Run Applications

HobbiesCooking and Basketball

237

Page 5: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

How did we start from here …

337

Page 6: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Try Apps Before You Install

Page 7: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Playable Ads Data

5

.. . .

• Allows for interactive ads

• Play without install / download

• Preference “intensity” info

• Ordered event stream

• Real-time data

37

Page 8: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Playable Ads Data

6t

E E E E E E E E E E E

E E E E E E E E E E

E E E E E E E E E E E E E E E E

E E E E E E E E E E E E

E E E E E E E E E E E E E E

37

Page 9: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Playable Ads Data

6t

E E E E E E E E E E E

E E E E E E E E E E

E E E E E E E E E E E E E E E E

E E E E E E E E E E E E

E E E E E E E E E E E E E E

Sessions

37

Page 10: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Playable Ads Data

6t

E E E E E E E E E E E

E E E E E E E E E E

E E E E E E E E E E E E E E E E

E E E E E E E E E E E E

E E E E E E E E E E E E E E

37

Page 11: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

7

Can����������� ������������������  we����������� ������������������  inspect����������� ������������������  the����������� ������������������  situation����������� ������������������  of����������� ������������������  a����������� ������������������  single����������� ������������������  session����������� ������������������  ?����������� ������������������  

Can����������� ������������������  we����������� ������������������  query����������� ������������������  

on����������� ������������������  how����������� ������������������  every����������� ������������������  

AdPlay����������� ������������������  session����������� ������������������  

ended?����������� ������������������  

Can����������� ������������������  we����������� ������������������  query����������� ������������������  on����������� ������������������  the����������� ������������������  platform����������� ������������������  types����������� ������������������  of����������� ������������������  error����������� ������������������  sessions����������� ������������������  ?

What����������� ������������������  about����������� ������������������  by����������� ������������������  

device?����������� ������������������  By����������� ������������������  time?

37

Page 12: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

7

Can����������� ������������������  we����������� ������������������  inspect����������� ������������������  the����������� ������������������  situation����������� ������������������  of����������� ������������������  a����������� ������������������  single����������� ������������������  session����������� ������������������  ?����������� ������������������  

Can����������� ������������������  we����������� ������������������  query����������� ������������������  

on����������� ������������������  how����������� ������������������  every����������� ������������������  

AdPlay����������� ������������������  session����������� ������������������  

ended?����������� ������������������  

Can����������� ������������������  we����������� ������������������  query����������� ������������������  on����������� ������������������  the����������� ������������������  platform����������� ������������������  types����������� ������������������  of����������� ������������������  error����������� ������������������  sessions����������� ������������������  ?

What����������� ������������������  about����������� ������������������  by����������� ������������������  

device?����������� ������������������  By����������� ������������������  time?

37

Page 13: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

8

Requirements [WHEN] Need to query merged sessions immediately.

[WHAT] Output sessions to a separate dataset.

[HOW] Can’t interfere with current flow.

1. Merge events into sessions

37

Page 14: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

9

Storm vs. Spark Streaming• Benchmark:*

Storm 10,000 records / sec / node

Spark Streaming 400,000 records /sec / node

• Storm => not really that popular

* http://www.cs.duke.edu/~kmoses/cps516/dstream.html

37

- version 0.9.x - Spark was released later and is already in 1.5.x

• Isn’t Spark Streaming the obvious choice?

Page 15: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

10

Storm vs. Spark Streaming• Storm: essentially a stream processing framework. Can also do micro-

batch processing (with Trident API).

• Spark: essentially a batch processing framework that does stream processing using micro-batch.

Batch Stream

Micro-batch

EEEEEEE…

EEEEEEE…

micro-batch

streaming

E

process

processprocess

37

Page 16: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Storm vs. Spark StreamingAdPlay’s use case:

stream processing framework: real-time merging before DB landing

EEE E

E

EE E EE

E E E E?

? E EE E

EE …

database landing

1234567

123456

1234End

End

End

Why Storm?

• Streaming fits our use case better.

• Programming model provides general primitives to fit our application logic.

……

1137

Page 17: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Apache Storm Quick RecapA processing framework for streams of data. React to data as it happens.

Core concepts - Data Model

Tuple

Stream

• Immutable set of K-V pairs • “Events”

• Unbounded sequence of tuples

stream

T T T T T T T T T…

1237

Page 18: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Apache Storm Quick RecapA processing framework for streams of data. React to data as it happens.

Core concepts - Programming Model

Spout

Bolt

• Source of data streams (tuples) • All kinds of sources …

• Consumes streams and potentially produce new streams

Topology• DAG (directed acyclic graph) formed by

wiring spouts and bolts

Bolt

Bolt

Bolt

Bolt

Spout

Spout

1337

Page 19: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Apache Storm Quick RecapParallelism

Spout

task #1

task #N

Bolt

task #1

task #M

Grouping

same field value

goes to same task

Fields Grouping

random assignment

Shuffle Grouping

1437

Page 20: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Events Puller Spout

Sess

ion

Mer

ger

Bolt

task #1

task #2

task #N

(sid

, eve

nt)

(sid

, ev

ent) (sid, event)

Mon

goD

B In

sert

Bol

t

task #1

task #2

task #M

(sid

, se

ssio

n)

Discrete Events

Collection (capped)

Merged Sessions

Collection

stream by tail query

DB landing after merging15

37

Page 21: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

2. Reliability• Cluster computing frameworks is prone to failures.

• Two main considerations for reliability:

- Event process guarantee - Failure of stateful computation

1637

Page 22: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Event process guarantee

3 types of guarantees

• At most once [0,1]:

Each event is processed only once, regardless of success/failure.

• At least once [1 … n]:

Each event can be redelivered multiple times to ensure success.

• Exactly once [1]:

Events are never lost and are never redelivered. Perfect delivery.

1737

Page 23: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Storm’s built-in fault tolerance?

Tuple Acknowledgment

Bolt

Bolt

Bolt

BoltSpout

Spout

T

anchor with a tuple ID

1837

Page 24: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Storm’s built-in fault tolerance?

Tuple Acknowledgment

Bolt

Bolt

Bolt

BoltSpout

Spout

T

ack(tu

ple)

1937

Page 25: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Storm’s built-in fault tolerance?

Tuple Acknowledgment

Bolt

Bolt

Bolt

BoltSpout

Spout

T

fail(tuple)

Still dependent of the data source for processing guarantees.

2037

Page 26: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Event process guarantee

3 types of event sources

• Unreliable:

No means to replay a previously-received message.

• Reliable:

Can somehow replay a message if processing fails at any point.

• Durable:

Can replay any message or set of messages given selection criteria.

2137

Page 27: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Events Puller Spout

Sess

ion

Mer

ger

Bolt

task #1

task #2

task #N

(sid

, eve

nt)

(sid

, ev

ent) (sid, event)

Mon

goD

B In

sert

Bol

t

task #1

task #2

task #M

(sid

, se

ssio

n)

Discrete Events

Collection (capped)

Merged Sessions

Collection

stream by tail query

DB landing after merging

unreliable

2237

Page 28: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Kafka Quick Overview• Publish-subscribe distributed messaging queue.

• Configurable message retention.

• Highly fault tolerant. N-1 nodes fault tolerant for N partitioning.

• High throughput: Producer - 2M msgs / sec. Consumer - 100M/sec.*

* http://kafka.apache.org/07/performance.html 2337

Page 29: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Events Puller Spout

Sess

ion

Mer

ger

Bolt

task #1

task #2

task #N

(sid

, eve

nt)

(sid

, ev

ent) (sid, event)

Mon

goD

B In

sert

Bol

t

task #1

task #2

task #M

(sid

, se

ssio

n)

Merged Sessions

Collection

DB landing after merging

discrete-events-topic

E E E E… consume topic

…topic

topic

2437

Page 30: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Failure of stateful computation

•Bolts hold “state” information.

- In-memory Map for merging AdPlay sessions

•Although Kafka is a durable data source, deciding the

selection criteria on failure can still be very hard.

•Solution: store all stateful info in an external

in-memory storage.

2537

Page 31: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Events Puller Spout

Sess

ion

Mer

ger

Bolt

task #1

task #N

(sid

, eve

nt) (sid, event)

Mon

goD

B In

sert

Bol

t

task #1

task #M

(sid

, se

ssio

n)

Merged Sessions

Collection

discrete-events-topic

E E E E…

…topic

topic

External Object State Storage

ZADD sid time event

ZRANGE sid 0 -1

EEE? …

key: sid value: sorted set of events

online monitoring

2637

Page 32: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

• Exposing Kafka to public network can be dangerous.

3. Security

• Pull input to the internal pipeline.

- Separate read / write permissions.

- Ingest only recognizable data.

2737

Page 33: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

• PaaS version of Apache Kafka.

AWS Kinesis

• Fixed data retention for 24 hours.

• Producers and consumers using AWS SDK / AWS KCL.

• Fully self-managed.

• Usage unit: Kinesis Stream- Hourly charge per shard every hour.

- Also charged per 1M PUTs.

2837

Page 34: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Events Puller Spout

Session Merger Bolt

discrete events topic A

External Object State Storage

key: sid value: sorted set of events

discrete events topic B

discrete events topic C

discrete events topic D

MongoDB Insert Bolt

Kafka Topic

Dispatcher

(Kinesis consumer)

producer user

consumer user

2937

Page 35: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

4. Schema Adaptability

• We soon realized that our event log schema was modified very fast.

• Problem: log schema was hardcoded into our stream processing logic.

- required a single topology for every different schema

• Stupid, I know ;)

• Solution: need a central serialization system to manage schema.

- replace JSON after event logs enter stream pipeline.

3037

Page 36: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Apache Avro

•A data serialization system.

•Rich data structures.

- String - Numbers (int, long, float) - Bytes - Boolean - null - Nested objects - …

•When Avro data is read, the schema when writing it is always present.

3137

Page 37: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Apache Avro

.avsc schema definitionThe actual event JSON

3237

Page 38: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

discrete events topic A

discrete events topic B

discrete events topic C

Kafka Topic

Dispatcher

topology A

topology B

topology C

External object state storage

3337

Page 39: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

discrete events topic

Kafka Topic

Dispatcher

single session merge

topology

External Object

State Storage

schema_1.avsc schema_2.avsc …

E

E.avro E.avro

E.avro

3437

Page 40: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

discrete events topic

Kafka Topic

Dispatcher

single session merge

topology

External Object

State Storage

schema_1_ updated.avsc

schema_2.avsc …

3437

backup events topic

backup session merge

topology

Page 41: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Kappa Architecture*

•Everything is a stream.

•Backup your data into a durable buffer (Kafka).

•Resubmit an updated job that consumes the backup.

* http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html 3537

Page 42: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

So, what did we went through?

• Building a stream processing system for ordered events stream.

• Reliability: Kafka + Redis

• Security: AWS Kinesis

• Schema Adaptability: Avro + Kappa architecture

Every use case is unique. Look closely to your needs ;)

3637