REAL-TIME DATA PROCESSING AT RTB HOUSE · real-time data processing at rtb house big data technology moscow 2018 october 10-11, 2018 architecture & lessons learned bartosz ŁoŚ

REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE

BIG DATA TECHNOLOGY MOSCOW 2018OCTOBER 10-11, 2018BIG DATA TECHNOLOGY MOSCOW 2018OCTOBER 10-11, 2018

ARCHITECTURE & LESSONS LEARNED

BARTOSZ ŁOŚ

REAL-TIME DATA PROCESSING AT RTB HOUSE

TABLE OF CONTENTS

Agenda:- our rtb platform- the first iteration: mutable structures- the second iteration: data-flow- the third iteration: immutable streams of events- the fourth iteration: multi-dc architecture- the current iteration: kafka workers- summary

02/30

OUR RTB PLATFORM

OUR RTB PLATFORM: THE CONTEXT 04/30

Bid requests:2M/s (peak)~30 SSP networks<50-100ms

User events:1.5B tags/day350M impressions/day3.5M clicks/day1.5M conversions/day

Other events:bidlogs, accesslogs,domain events etc.

OUR RTB PLATFORM: DATA PROCESSING NUMBERS

Kafka:- up to 250K+ messages per second- 50TB+ processed data every day- 6 clusters in 4 datacenters- 26 Kafka brokers- 85 topics, 5000+ partitions

Docker (processing components only):- 44 engines- 1408 cpu cores, 5.5TB ram- 800+ containers

05/30

HDFS:- 2PB+ data, up to 10GB/s

BigQuery:- 1PB+ data, up to 10GB/min

Elasticsearch:- 40TB data, up to 50K events/s

Aerospike (processing only):- 80TB data, up to 8K events/s

THE FIRST ITERATION

THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/30

THE 1ST ITERATION: DRAWBACKS

Issues:- long, overloading data migrations (30 days back)- complex servlets' logic, inability to reprocess- inflexible, various schemas- single-DC- inconsistencies

08/30

THE SECOND ITERATION: DATA-FLOW

THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/30

THE 2ND ITERATION: DISTRIBUTED LOG

Why Apache Kafka:- distributed log- topics partitioning- partition replication- log retention- stateless- efficient data consuming

11/30

THE 2ND ITERATION: BATCH LOADING

Why Apache Camus:- "Kafka to HDFS" pipeline- batch tool- map-reduce jobs- storing offsets in log files- data partitioning

12/30

THE 2ND ITERATION: AVRO & SCHEMA VERSIONING

Why Apache Avro:- compact, efficient format- schema: JSON format, payload: binary format- self-describing container files- rich data structures- schema changes support, reader & writer schemas

Our approach:- Kafka's messages and HDFS files- schema registry- avro-fastserde

13/30

(github.com/RTBHOUSE/avro-fastserde)

THE 2ND ITERATION: ACCURATE STATISTICS

Why Apache Storm:- real-time processing- streams of tuples, topologies- fault-tolerance

Why Trident:- transactions, exactly-once processing- microbatches (latency & throughput)

14/30

THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/30

THE 2ND ITERATION: DRAWBACKS

Hybrid architecture:- aggregates (real-time)- raw events (2-hour batches)- joined events (end-of-day batch jobs)

Other issues:- Hive joins- mutable events- servlets' complex logic

16/30

THE THIRD ITERATION: NEW APPROACH

THE 3RD ITERATION: NEW APPROACH

{ "IMPRESSION”: "URL”, "TIME”, "CREATIVE”, ... "CLICKS”, "CONVERSIONS”}

{ "CLICK”: "TIME”, "IMPRESSION_ID”, ... "IMPRESSION”}

{ "CONVERSION”: "TIME”, "CLICK_ID”, ... "IMPRESSION”, "CLICK”}

New approach:- real-time processing- publishing light events- immutable streams of events

18/30

THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/30

THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/30

THE FOURTH ITERATION: MULTI-DC

THE 4TH ITERATION: NEW REQUIREMENTS

Main changes:- 5-6x larger scale:

> from 350K to 2M bid requests/s within 1.5 years- full multi-dc architecture:

> merging streams of events> synchronization of user profiles

- end-to-end exactly-once processing:> at-least-once output semantics + deduplication

- a few better components:> merger> new stats-counter, new data-flow> dispatcher & loader> logstash

22/30

THE 4TH ITERATION: MULTI-DC ARCHITECTURE 23/30

THE 4TH ITERATION: NEW DATA-FLOW ON KAFKA STREAMS 24/30

(picture from kafka.apache.org)

Why Kafka Streams:- fully embedded library with no stream processing cluster- no external dependencies- Kafka's parallelism model and group membership mechanism- event-at-a-time processing(not microbatch)- exactly-once processing semantics (but at-least-once was good enough)

THE 4TH ITERATION: MERGER ON KAFKA CONSUMER API 25/30

THE CURRENT ITERATION: KAFKA WORKERS

THE 5TH ITERATION: KAFKA WORKERS

Main features:- higher level of distribution- possibility to pause and resume processing for given partition- asynchronous processing- tighter control of offsets commits- backpressure- at-least-once semantics- processing timeouts- handling failures- multiple consumers (in progress)- kafka-to-kafka, hdfs, bigquery, elasticsearch connectors (in progress)

27/30

(github.com/RTBHOUSE/kafka-workers)

THE 5TH ITERATION: KAFKA WORKERS ARCHITECTURE 28/30

SUMMARY

What we have achieved:- platform monitoring- much more stable platform- higher quality of data processing- HDFS & BigQuery & Elasticsearch streaming- multi-DC architecture and data synchronization- high scalability- better data-flow monitoring, deployment & maintenance

29/30

REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE

BIG DATA TECHNOLOGY MOSCOW 2018OCTOBER 10-11, 2018

THANK YOU FOR YOUR ATTENTION

REAL-TIME DATA PROCESSING AT RTB HOUSE · real-time data processing at rtb house big data technology moscow 2018 october 10-11, 2018 architecture & lessons learned bartosz ŁoŚ

Documents