Top Banner
Apache Kafka at trivago 2017-01-25, Munich, Germany Clemens Valiente
32

Kafka at trivago

Feb 08, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Kafka at trivago

Apache Kafka at trivago

2017-01-25, Munich, GermanyClemens Valiente

Page 2: Kafka at trivago

Email: [email protected] de.linkedin.com/in/clemensvaliente

Senior Data Engineertrivago Düsseldorf

Originally a mathematicianStudied at Uni ErlangenAt trivago for 5 years

Clemens Valiente

Page 3: Kafka at trivago

3

As a hotel price comparison engine, our most valuable information are hotel prices.

They are not only shown to our visitors to support their hotel booking decision, but also stored and later analyzed by Business Intelligence.

With over one million hotels and all major booking websites connected to our system, we have one of the most complete sources of information on hotel price development and trends

Collecting price information for BI

Page 4: Kafka at trivago

4

The past: Data pipeline 2010 – 2015

Page 5: Kafka at trivago

5

The past: Data pipeline 2010 – 2015

Java Software Engineering

Page 6: Kafka at trivago

6

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

Page 7: Kafka at trivago

7

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

Page 8: Kafka at trivago

8

The past: Data pipeline 2010 – 2015Facts & FiguresPrice dimensions- Around one million hotels- 250 booking websites- Travellers search for up to

180 days in advance- Data collected over five

years

Page 9: Kafka at trivago

9

The past: Data pipeline 2010 – 2015Facts & FiguresPrice dimensions- Around one million hotels- 250 booking websites- Travellers search for up to

180 days in advance- Data collected over five

years

Restrictions- Only single night stays- Only prices from

European visitors- Prices cached up to 30

minutes- One price per hotel,

website and arrival date per day

- “Insert ignore”: The first price per key wins

Page 10: Kafka at trivago

10

The past: Data pipeline 2010 – 2015Facts & FiguresPrice dimensions- Around one million hotels- 250 booking websites- Travellers search for up to

180 days in advance- Data collected over five

years

Restrictions- Only single night stays- Only prices from

European visitors- Prices cached up to 30

minutes- One price per hotel,

website and arrival date per day

- “Insert ignore”: The first price per key wins

Size of data- We collected a total of 56

billion prices in those five years

- Towards the end of this pipeline in early 2015 on average around 100 million prices per day were written to BI

Page 11: Kafka at trivago

11

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

Page 12: Kafka at trivago

12

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

Page 13: Kafka at trivago

13

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

Page 14: Kafka at trivago

14

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

Page 15: Kafka at trivago

15

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

Page 16: Kafka at trivago

16

Refactoring the pipeline: Requirements

• Scales with an arbitrary amount of data (future proof)• reliable and resilient• low performance impact on Java backend• long term storage of raw input data• fast processing of filtered and aggregated data• Open source• we want to log everything:

• more prices • Length of stay, room type, breakfast info, room category, domain

• with more information• Net & gross price, city tax, resort fee, affiliate fee, VAT

Page 17: Kafka at trivago

17

Present data pipeline 2017 – ingestion

Düsseldorf

Page 18: Kafka at trivago

18

Present data pipeline 2017 – ingestion

Düsseldorf

Page 19: Kafka at trivago

19

Present data pipeline 2017 – ingestion

San Francisco

Düsseldorf

Hongkong

Page 20: Kafka at trivago

20

Present data pipeline 2017 – processing

Cam

us

Page 21: Kafka at trivago

21

Present data pipeline 2017 – results after two years in production• Very reliable, barely any downtime or service interruptions of the system• Java team is very happy – less load on their system• BI team is very happy – more data, more resources to process it• stakeholders very happy

• Faster results• Better quality of results due to more data• More detailed results• => Shorter research phase, more and better stories• => Less requests & workload for BI

Page 22: Kafka at trivago

22

Present data pipeline 2017 – facts & figures

Kafka Cluster specifications- Cluster of 5 machines in

each data centre for logs- An additional cluster of two

machines in Düsseldorf for aggregation/stream processing

Page 23: Kafka at trivago

23

Present data pipeline 2017 – facts & figures

Kafka Cluster specifications- Cluster of 5 machines in

each data centre for logs- An additional cluster of two

machines in Düsseldorf for aggregation/stream processing

Data Size (price log)- Over 4 trillion messages

collected so far- 10 billion messages/day- Over a hundred topics

Page 24: Kafka at trivago

24

Present data pipeline 2017 – facts & figures

Kafka Cluster specifications- Cluster of 5 machines in

each data centre for logs- An additional cluster of two

machines in Düsseldorf for aggregation/stream processing

Data Size (price log)- Over 4 trillion messages

collected so far- 10 billion messages/day- Over a hundred topics

Camus- Mapreduce application that

writes prices to hdfs- 15 Mappers running in

parallel- Pretty much continuously

in 10 minute intervals- To be replaced by

Gobblin/Kafka Connect

Page 25: Kafka at trivago

25

Present data pipeline 2017 – use cases & status quoUses for price information- Monitoring price parity in

hotel market- Anomaly and fraud

detection- Price feed for online

marketing- Display of price

development and delivering price alerts to website visitors

Page 26: Kafka at trivago

26

Present data pipeline 2017 – use cases & status quoUses for price information- Monitoring price parity in

hotel market- Anomaly and fraud

detection- Price feed for online

marketing- Display of price

development and delivering price alerts to website visitors

Other data sources and usage- Clicklog information from

our website and mobile app

- Used for marketing performance analysis, product tests, invoice generation etc

- Every Euro of revenue at some point was a message in Kafka

Page 27: Kafka at trivago

27

Present data pipeline 2017 – use cases & status quoUses for price information- Monitoring price parity in

hotel market- Anomaly and fraud

detection- Price feed for online

marketing- Display of price

development and delivering price alerts to website visitors

Other data sources and usage- Clicklog information from

our website and mobile app

- Used for marketing performance analysis, product tests, invoice generation etc

- Every Euro of revenue at some point was a message in Kafka

Status quo- Our entire BI business

logic runs on and through the kafka – hadoop pipeline

- Almost all departments rely on data, insights and metrics delivered by hadoop

- Most of the company could not do their job without hadoop data

Page 28: Kafka at trivago

28

Düsseldorf

Leipzig Palma

Ongoing Projects: Breaking up the Monolith

Page 29: Kafka at trivago

29

Düsseldorf

PalmaLeipzig

Page 30: Kafka at trivago

30

Key challenges and learnings

● Settle on a common message format (Avro/Protobuf, not csv or json)

● A common message envelope is helpful (e.g. header with timestamp and sender)

● For stream processing repeat your key in your message value

● Monitor your consumer offsets with an audit log, especially across data centres

● Turn off auto creation of topics, but have a process in place for topic creation

Page 31: Kafka at trivago

Email: [email protected] de.linkedin.com/in/clemensvaliente

Senior Data Engineertrivago Düsseldorf

Originally a mathematicianStudied at Uni ErlangenAt trivago for 5 years

Clemens Valiente

Thank you!

Questions and comments?

Page 32: Kafka at trivago

● Thanks to Jan Filipiak for his brainpower behind most projects

● Additional resources:

● https://github.com/trivago/gollum A n:m message multiplexer written in Go

● https://github.com/trivago/triava TriavaCache, JSR107 compliant cache