HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Building a Stream Processing System for Playable Ads Data at VMFive

Gordon Tai Data Engineer

@ HadoopCon 2015

3Techcrunch Beijing, Aug. 2014Champion

137

Gordon Tai 戴資⼒力

InterestsCluster computing, open-source

Research• FedRDD: Federated RDDs for

Multicluster Computing • Criteria-Based Cluster Scheduling on

Hadoop YARN • FedLoop: Looping on Federated

MapReduce • FedMR: Federated MapReduce to

Transparently Run Applications

HobbiesCooking and Basketball

237

How did we start from here …

337

Try Apps Before You Install

Playable Ads Data

5

.. . .

• Allows for interactive ads

• Play without install / download

• Preference “intensity” info

• Ordered event stream

• Real-time data

37

Playable Ads Data

6t

E E E E E E E E E E E

E E E E E E E E E E

E E E E E E E E E E E E E E E E

E E E E E E E E E E E E

E E E E E E E E E E E E E E

37

Playable Ads Data

6t


E E E E E E E E E E




Sessions

37

Playable Ads Data

6t


E E E E E E E E E E




37

7

Can�� we�� inspect�� the�� situation�� of�� a�� single�� session�� ?��

Can�� we�� query��

on�� how�� every��

AdPlay�� session��

ended?��

Can�� we�� query�� on�� the�� platform�� types�� of�� error�� sessions�� ?

What�� about�� by��

device?�� By�� time?

37

7

Can�� we�� inspect�� the�� situation�� of�� a�� single�� session�� ?��

Can�� we�� query��

on�� how�� every��

AdPlay�� session��

ended?��

Can�� we�� query�� on�� the�� platform�� types�� of�� error�� sessions�� ?

What�� about�� by��

device?�� By�� time?

37

8

Requirements [WHEN] Need to query merged sessions immediately.

[WHAT] Output sessions to a separate dataset.

[HOW] Can’t interfere with current flow.

1. Merge events into sessions

37

9

Storm vs. Spark Streaming• Benchmark:*

Storm 10,000 records / sec / node

Spark Streaming 400,000 records /sec / node

• Storm => not really that popular

* http://www.cs.duke.edu/~kmoses/cps516/dstream.html

37

- version 0.9.x - Spark was released later and is already in 1.5.x

• Isn’t Spark Streaming the obvious choice?

http://www.cs.duke.edu/~kmoses/cps516/dstream.html

10

Storm vs. Spark Streaming• Storm: essentially a stream processing framework. Can also do micro-

batch processing (with Trident API).

• Spark: essentially a batch processing framework that does stream processing using micro-batch.

Batch Stream

Micro-batch

EEEEEEE…

EEEEEEE…

micro-batch

streaming

E

process

processprocess

37

Storm vs. Spark StreamingAdPlay’s use case:

stream processing framework: real-time merging before DB landing

EEE E

E

EE E EE

E E E E?

? E EE E

EE …

…

database landing

1234567

123456

1234End

End

End

Why Storm?

• Streaming fits our use case better.

• Programming model provides general primitives to fit our application logic.

……

1137

Apache Storm Quick RecapA processing framework for streams of data. React to data as it happens.

Core concepts - Data Model

Tuple

Stream

• Immutable set of K-V pairs • “Events”

• Unbounded sequence of tuples

stream

T T T T T T T T T…

1237

Apache Storm Quick RecapA processing framework for streams of data. React to data as it happens.

Core concepts - Programming Model

Spout

Bolt

• Source of data streams (tuples) • All kinds of sources …

• Consumes streams and potentially produce new streams

Topology• DAG (directed acyclic graph) formed by

wiring spouts and bolts

Bolt

Bolt

Bolt

Bolt

Spout

Spout

1337

Apache Storm Quick RecapParallelism

Spout

task #1

task #N

…

Bolt

task #1

task #M

…

Grouping

…

same field value

goes to same task

Fields Grouping

…

random assignment

Shuffle Grouping

1437

Events Puller Spout

Sess

ion

Mer

ger

Bolt

task #1

task #2

task #N

…

(sid

, eve

nt)

(sid

, ev

ent) (sid, event)

Mon

goD

B In

sert

Bol

t

task #1

task #2

task #M

…

(sid

, se

ssio

n)

Discrete Events

Collection (capped)

Merged Sessions

Collection

stream by tail query

DB landing after merging15

37

2. Reliability• Cluster computing frameworks is prone to failures.

• Two main considerations for reliability:

- Event process guarantee - Failure of stateful computation

1637

Event process guarantee

3 types of guarantees

• At most once [0,1]:

Each event is processed only once, regardless of success/failure.

• At least once [1 … n]:

Each event can be redelivered multiple times to ensure success.

• Exactly once [1]:

Events are never lost and are never redelivered. Perfect delivery.

1737

Storm’s built-in fault tolerance?

Tuple Acknowledgment

Bolt

Bolt

Bolt

BoltSpout

Spout

T

anchor with a tuple ID

1837



Bolt

Bolt

Bolt

BoltSpout

Spout

T

ack(tu

ple)

1937



Bolt

Bolt

Bolt

BoltSpout

Spout

T

fail(tuple)

Still dependent of the data source for processing guarantees.

2037

Event process guarantee

3 types of event sources

• Unreliable:

No means to replay a previously-received message.

• Reliable:

Can somehow replay a message if processing fails at any point.

• Durable:

Can replay any message or set of messages given selection criteria.

2137

Events Puller Spout

Sess

ion

Mer

ger

Bolt

task #1

task #2

task #N

…

(sid

, eve

nt)

(sid

, ev

ent) (sid, event)

Mon

goD

B In

sert

Bol

t

task #1

task #2

task #M

…

(sid

, se

ssio

n)

Discrete Events

Collection (capped)

Merged Sessions

Collection

stream by tail query

DB landing after merging

unreliable

2237

Kafka Quick Overview• Publish-subscribe distributed messaging queue.

• Configurable message retention.

• Highly fault tolerant. N-1 nodes fault tolerant for N partitioning.

• High throughput: Producer - 2M msgs / sec. Consumer - 100M/sec.*

* http://kafka.apache.org/07/performance.html 2337

http://www.cs.duke.edu/~kmoses/cps516/dstream.html

Events Puller Spout

Sess

ion

Mer

ger

Bolt

task #1

task #2

task #N

…

(sid

, eve

nt)

(sid

, ev

ent) (sid, event)

Mon

goD

B In

sert

Bol

t

task #1

task #2

task #M

…

(sid

, se

ssio

n)

Merged Sessions

Collection

DB landing after merging

discrete-events-topic

E E E E… consume topic

…topic

topic

2437

Failure of stateful computation

•Bolts hold “state” information.

- In-memory Map for merging AdPlay sessions

•Although Kafka is a durable data source, deciding the

selection criteria on failure can still be very hard.

•Solution: store all stateful info in an external

in-memory storage.

2537

Events Puller Spout

Sess

ion

Mer

ger

Bolt

task #1

task #N

…

(sid

, eve

nt) (sid, event)

Mon

goD

B In

sert

Bol

t

task #1

task #M

…

(sid

, se

ssio

n)

Merged Sessions

Collection

discrete-events-topic

E E E E…

…topic

topic

External Object State Storage

ZADD sid time event

ZRANGE sid 0 -1

EEE? …

key: sid value: sorted set of events

online monitoring

2637

• Exposing Kafka to public network can be dangerous.

3. Security

• Pull input to the internal pipeline.

- Separate read / write permissions.

- Ingest only recognizable data.

2737

• PaaS version of Apache Kafka.

AWS Kinesis

• Fixed data retention for 24 hours.

• Producers and consumers using AWS SDK / AWS KCL.

• Fully self-managed.

• Usage unit: Kinesis Stream- Hourly charge per shard every hour.

- Also charged per 1M PUTs.

2837

Events Puller Spout

Session Merger Bolt

…

discrete events topic A

External Object State Storage

key: sid value: sorted set of events

discrete events topic B

discrete events topic C

discrete events topic D

MongoDB Insert Bolt

Kafka Topic

Dispatcher

(Kinesis consumer)

producer user

consumer user

2937

4. Schema Adaptability

• We soon realized that our event log schema was modified very fast.

• Problem: log schema was hardcoded into our stream processing logic.

- required a single topology for every different schema

• Stupid, I know ;)

• Solution: need a central serialization system to manage schema.

- replace JSON after event logs enter stream pipeline.

3037

Apache Avro

•A data serialization system.

•Rich data structures.

- String - Numbers (int, long, float) - Bytes - Boolean - null - Nested objects - …

•When Avro data is read, the schema when writing it is always present.

3137

Apache Avro

.avsc schema definitionThe actual event JSON

3237

…

discrete events topic A

discrete events topic B

discrete events topic C

Kafka Topic

Dispatcher

topology A

topology B

topology C

…

External object state storage

3337

discrete events topic

Kafka Topic

Dispatcher

single session merge

topology

External Object

State Storage

schema_1.avsc schema_2.avsc …

E

E.avro E.avro

E.avro

3437

discrete events topic

Kafka Topic

Dispatcher

single session merge

topology

External Object

State Storage

schema_1_ updated.avsc

schema_2.avsc …

3437

backup events topic

backup session merge

topology

Kappa Architecture*

•Everything is a stream.

•Backup your data into a durable buffer (Kafka).

•Resubmit an updated job that consumes the backup.

* http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html 3537

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

So, what did we went through?

• Building a stream processing system for ordered events stream.

• Reliability: Kafka + Redis

• Security: AWS Kinesis

• Schema Adaptability: Avro + Kappa architecture

Every use case is unique. Look closely to your needs ;)

3637

[email protected]

vmfive.com

Email:

Site:

3737

mailto:[email protected]

http://vmfive.com

HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive

Data & Analytics