Building a Stream Processing System for Playable Ads Data at VMFive Gordon Tai Data Engineer @ HadoopCon 2015
Apr 21, 2017
Building a Stream Processing System for Playable Ads Data at VMFive
Gordon Tai Data Engineer
@ HadoopCon 2015
3Techcrunch Beijing, Aug. 2014Champion
137
Gordon Tai 戴資⼒力
InterestsCluster computing, open-source
Research• FedRDD: Federated RDDs for
Multicluster Computing • Criteria-Based Cluster Scheduling on
Hadoop YARN • FedLoop: Looping on Federated
MapReduce • FedMR: Federated MapReduce to
Transparently Run Applications
HobbiesCooking and Basketball
237
How did we start from here …
337
Try Apps Before You Install
Playable Ads Data
5
.. . .
• Allows for interactive ads
• Play without install / download
• Preference “intensity” info
• Ordered event stream
• Real-time data
37
Playable Ads Data
6t
E E E E E E E E E E E
E E E E E E E E E E
E E E E E E E E E E E E E E E E
E E E E E E E E E E E E
E E E E E E E E E E E E E E
37
Playable Ads Data
6t
E E E E E E E E E E E
E E E E E E E E E E
E E E E E E E E E E E E E E E E
E E E E E E E E E E E E
E E E E E E E E E E E E E E
Sessions
37
Playable Ads Data
6t
E E E E E E E E E E E
E E E E E E E E E E
E E E E E E E E E E E E E E E E
E E E E E E E E E E E E
E E E E E E E E E E E E E E
37
7
Can����������� ������������������ we����������� ������������������ inspect����������� ������������������ the����������� ������������������ situation����������� ������������������ of����������� ������������������ a����������� ������������������ single����������� ������������������ session����������� ������������������ ?����������� ������������������
Can����������� ������������������ we����������� ������������������ query����������� ������������������
on����������� ������������������ how����������� ������������������ every����������� ������������������
AdPlay����������� ������������������ session����������� ������������������
ended?����������� ������������������
Can����������� ������������������ we����������� ������������������ query����������� ������������������ on����������� ������������������ the����������� ������������������ platform����������� ������������������ types����������� ������������������ of����������� ������������������ error����������� ������������������ sessions����������� ������������������ ?
What����������� ������������������ about����������� ������������������ by����������� ������������������
device?����������� ������������������ By����������� ������������������ time?
37
7
Can����������� ������������������ we����������� ������������������ inspect����������� ������������������ the����������� ������������������ situation����������� ������������������ of����������� ������������������ a����������� ������������������ single����������� ������������������ session����������� ������������������ ?����������� ������������������
Can����������� ������������������ we����������� ������������������ query����������� ������������������
on����������� ������������������ how����������� ������������������ every����������� ������������������
AdPlay����������� ������������������ session����������� ������������������
ended?����������� ������������������
Can����������� ������������������ we����������� ������������������ query����������� ������������������ on����������� ������������������ the����������� ������������������ platform����������� ������������������ types����������� ������������������ of����������� ������������������ error����������� ������������������ sessions����������� ������������������ ?
What����������� ������������������ about����������� ������������������ by����������� ������������������
device?����������� ������������������ By����������� ������������������ time?
37
8
Requirements [WHEN] Need to query merged sessions immediately.
[WHAT] Output sessions to a separate dataset.
[HOW] Can’t interfere with current flow.
1. Merge events into sessions
37
9
Storm vs. Spark Streaming• Benchmark:*
Storm 10,000 records / sec / node
Spark Streaming 400,000 records /sec / node
• Storm => not really that popular
* http://www.cs.duke.edu/~kmoses/cps516/dstream.html
37
- version 0.9.x - Spark was released later and is already in 1.5.x
• Isn’t Spark Streaming the obvious choice?
10
Storm vs. Spark Streaming• Storm: essentially a stream processing framework. Can also do micro-
batch processing (with Trident API).
• Spark: essentially a batch processing framework that does stream processing using micro-batch.
Batch Stream
Micro-batch
EEEEEEE…
EEEEEEE…
micro-batch
streaming
E
process
processprocess
37
Storm vs. Spark StreamingAdPlay’s use case:
stream processing framework: real-time merging before DB landing
EEE E
E
EE E EE
E E E E?
? E EE E
EE …
…
database landing
1234567
123456
1234End
End
End
Why Storm?
• Streaming fits our use case better.
• Programming model provides general primitives to fit our application logic.
……
1137
Apache Storm Quick RecapA processing framework for streams of data. React to data as it happens.
Core concepts - Data Model
Tuple
Stream
• Immutable set of K-V pairs • “Events”
• Unbounded sequence of tuples
stream
T T T T T T T T T…
1237
Apache Storm Quick RecapA processing framework for streams of data. React to data as it happens.
Core concepts - Programming Model
Spout
Bolt
• Source of data streams (tuples) • All kinds of sources …
• Consumes streams and potentially produce new streams
Topology• DAG (directed acyclic graph) formed by
wiring spouts and bolts
Bolt
Bolt
Bolt
Bolt
Spout
Spout
1337
Apache Storm Quick RecapParallelism
Spout
task #1
task #N
…
Bolt
task #1
task #M
…
Grouping
…
same field value
goes to same task
Fields Grouping
…
random assignment
Shuffle Grouping
1437
Events Puller Spout
Sess
ion
Mer
ger
Bolt
task #1
task #2
task #N
…
(sid
, eve
nt)
(sid
, ev
ent) (sid, event)
Mon
goD
B In
sert
Bol
t
task #1
task #2
task #M
…
(sid
, se
ssio
n)
Discrete Events
Collection (capped)
Merged Sessions
Collection
stream by tail query
DB landing after merging15
37
2. Reliability• Cluster computing frameworks is prone to failures.
• Two main considerations for reliability:
- Event process guarantee - Failure of stateful computation
1637
Event process guarantee
3 types of guarantees
• At most once [0,1]:
Each event is processed only once, regardless of success/failure.
• At least once [1 … n]:
Each event can be redelivered multiple times to ensure success.
• Exactly once [1]:
Events are never lost and are never redelivered. Perfect delivery.
1737
Storm’s built-in fault tolerance?
Tuple Acknowledgment
Bolt
Bolt
Bolt
BoltSpout
Spout
T
anchor with a tuple ID
1837
Storm’s built-in fault tolerance?
Tuple Acknowledgment
Bolt
Bolt
Bolt
BoltSpout
Spout
T
ack(tu
ple)
1937
Storm’s built-in fault tolerance?
Tuple Acknowledgment
Bolt
Bolt
Bolt
BoltSpout
Spout
T
fail(tuple)
Still dependent of the data source for processing guarantees.
2037
Event process guarantee
3 types of event sources
• Unreliable:
No means to replay a previously-received message.
• Reliable:
Can somehow replay a message if processing fails at any point.
• Durable:
Can replay any message or set of messages given selection criteria.
2137
Events Puller Spout
Sess
ion
Mer
ger
Bolt
task #1
task #2
task #N
…
(sid
, eve
nt)
(sid
, ev
ent) (sid, event)
Mon
goD
B In
sert
Bol
t
task #1
task #2
task #M
…
(sid
, se
ssio
n)
Discrete Events
Collection (capped)
Merged Sessions
Collection
stream by tail query
DB landing after merging
unreliable
2237
Kafka Quick Overview• Publish-subscribe distributed messaging queue.
• Configurable message retention.
• Highly fault tolerant. N-1 nodes fault tolerant for N partitioning.
• High throughput: Producer - 2M msgs / sec. Consumer - 100M/sec.*
* http://kafka.apache.org/07/performance.html 2337
Events Puller Spout
Sess
ion
Mer
ger
Bolt
task #1
task #2
task #N
…
(sid
, eve
nt)
(sid
, ev
ent) (sid, event)
Mon
goD
B In
sert
Bol
t
task #1
task #2
task #M
…
(sid
, se
ssio
n)
Merged Sessions
Collection
DB landing after merging
discrete-events-topic
E E E E… consume topic
…topic
topic
2437
Failure of stateful computation
•Bolts hold “state” information.
- In-memory Map for merging AdPlay sessions
•Although Kafka is a durable data source, deciding the
selection criteria on failure can still be very hard.
•Solution: store all stateful info in an external
in-memory storage.
2537
Events Puller Spout
Sess
ion
Mer
ger
Bolt
task #1
task #N
…
(sid
, eve
nt) (sid, event)
Mon
goD
B In
sert
Bol
t
task #1
task #M
…
(sid
, se
ssio
n)
Merged Sessions
Collection
discrete-events-topic
E E E E…
…topic
topic
External Object State Storage
ZADD sid time event
ZRANGE sid 0 -1
EEE? …
key: sid value: sorted set of events
online monitoring
2637
• Exposing Kafka to public network can be dangerous.
3. Security
• Pull input to the internal pipeline.
- Separate read / write permissions.
- Ingest only recognizable data.
2737
• PaaS version of Apache Kafka.
AWS Kinesis
• Fixed data retention for 24 hours.
• Producers and consumers using AWS SDK / AWS KCL.
• Fully self-managed.
• Usage unit: Kinesis Stream- Hourly charge per shard every hour.
- Also charged per 1M PUTs.
2837
Events Puller Spout
Session Merger Bolt
…
discrete events topic A
External Object State Storage
key: sid value: sorted set of events
discrete events topic B
discrete events topic C
discrete events topic D
MongoDB Insert Bolt
Kafka Topic
Dispatcher
(Kinesis consumer)
producer user
consumer user
2937
4. Schema Adaptability
• We soon realized that our event log schema was modified very fast.
• Problem: log schema was hardcoded into our stream processing logic.
- required a single topology for every different schema
• Stupid, I know ;)
• Solution: need a central serialization system to manage schema.
- replace JSON after event logs enter stream pipeline.
3037
Apache Avro
•A data serialization system.
•Rich data structures.
- String - Numbers (int, long, float) - Bytes - Boolean - null - Nested objects - …
•When Avro data is read, the schema when writing it is always present.
3137
Apache Avro
.avsc schema definitionThe actual event JSON
3237
…
discrete events topic A
discrete events topic B
discrete events topic C
Kafka Topic
Dispatcher
topology A
topology B
topology C
…
External object state storage
3337
discrete events topic
Kafka Topic
Dispatcher
single session merge
topology
External Object
State Storage
schema_1.avsc schema_2.avsc …
E
E.avro E.avro
E.avro
3437
discrete events topic
Kafka Topic
Dispatcher
single session merge
topology
External Object
State Storage
schema_1_ updated.avsc
schema_2.avsc …
3437
backup events topic
backup session merge
topology
Kappa Architecture*
•Everything is a stream.
•Backup your data into a durable buffer (Kafka).
•Resubmit an updated job that consumes the backup.
* http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html 3537
So, what did we went through?
• Building a stream processing system for ordered events stream.
• Reliability: Kafka + Redis
• Security: AWS Kinesis
• Schema Adaptability: Avro + Kappa architecture
Every use case is unique. Look closely to your needs ;)
3637