REAL-TIME DATA PROCESSING AT RTB HOUSE REAL-TIME DATA PROCESSING AT RTB HOUSE BIG DATA TECHNOLOGY MOSCOW 2018 OCTOBER 10-11, 2018 BIG DATA TECHNOLOGY MOSCOW 2018 OCTOBER 10-11, 2018 ARCHITECTURE & LESSONS LEARNED BARTOSZ ŁOŚ REAL-TIME DATA PROCESSING AT RTB HOUSE
30
Embed
REAL-TIME DATA PROCESSING AT RTB HOUSE · real-time data processing at rtb house big data technology moscow 2018 october 10-11, 2018 architecture & lessons learned bartosz ŁoŚ
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY MOSCOW 2018OCTOBER 10-11, 2018BIG DATA TECHNOLOGY MOSCOW 2018OCTOBER 10-11, 2018
ARCHITECTURE & LESSONS LEARNED
BARTOSZ ŁOŚ
REAL-TIME DATA PROCESSING AT RTB HOUSE
TABLE OF CONTENTS
Agenda:- our rtb platform- the first iteration: mutable structures- the second iteration: data-flow- the third iteration: immutable streams of events- the fourth iteration: multi-dc architecture- the current iteration: kafka workers- summary
02/30
OUR RTB PLATFORM
OUR RTB PLATFORM: THE CONTEXT 04/30
Bid requests:2M/s (peak)~30 SSP networks<50-100ms
User events:1.5B tags/day350M impressions/day3.5M clicks/day1.5M conversions/day
Other events:bidlogs, accesslogs,domain events etc.
OUR RTB PLATFORM: DATA PROCESSING NUMBERS
Kafka:- up to 250K+ messages per second- 50TB+ processed data every day- 6 clusters in 4 datacenters- 26 Kafka brokers- 85 topics, 5000+ partitions
Aerospike (processing only):- 80TB data, up to 8K events/s
THE FIRST ITERATION
THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/30
THE 1ST ITERATION: DRAWBACKS
Issues:- long, overloading data migrations (30 days back)- complex servlets' logic, inability to reprocess- inflexible, various schemas- single-DC- inconsistencies
08/30
THE SECOND ITERATION: DATA-FLOW
THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/30
- a few better components:> merger> new stats-counter, new data-flow> dispatcher & loader> logstash
22/30
THE 4TH ITERATION: MULTI-DC ARCHITECTURE 23/30
THE 4TH ITERATION: NEW DATA-FLOW ON KAFKA STREAMS 24/30
(picture from kafka.apache.org)
Why Kafka Streams:- fully embedded library with no stream processing cluster- no external dependencies- Kafka's parallelism model and group membership mechanism- event-at-a-time processing(not microbatch)- exactly-once processing semantics (but at-least-once was good enough)
THE 4TH ITERATION: MERGER ON KAFKA CONSUMER API 25/30
THE CURRENT ITERATION: KAFKA WORKERS
THE 5TH ITERATION: KAFKA WORKERS
Main features:- higher level of distribution- possibility to pause and resume processing for given partition- asynchronous processing- tighter control of offsets commits- backpressure- at-least-once semantics- processing timeouts- handling failures- multiple consumers (in progress)- kafka-to-kafka, hdfs, bigquery, elasticsearch connectors (in progress)
27/30
(github.com/RTBHOUSE/kafka-workers)
THE 5TH ITERATION: KAFKA WORKERS ARCHITECTURE 28/30
SUMMARY
What we have achieved:- platform monitoring- much more stable platform- higher quality of data processing- HDFS & BigQuery & Elasticsearch streaming- multi-DC architecture and data synchronization- high scalability- better data-flow monitoring, deployment & maintenance
29/30
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY MOSCOW 2018OCTOBER 10-11, 2018