Top Banner
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Data Pipeline with Kafka Peerapat Asoktummarungsri AGODA
33

Data Pipeline with Kafka

Jan 07, 2017

Download

Engineering

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Data Pipelinewith Kafka

Peerapat AsoktummarungsriAGODA

Page 2: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Senior Software Engineer Agoda.com

Contributor Thai Java User Group (THJUG.com)

Contributor Agile66

Page 3: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

AGENDA

Big Data & Data Pipeline

Kafka Introduction

Quick Start

Monitoring

Data Pipeline for Search API

Hadoop integration with Camus

Page 4: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Hadoop+

HDFS

Information

Big Data

MapReduce

Page 5: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Pipeline

hadoopWebsitelog

Page 6: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

hadoopWebsite

Mobile

Growth

log

Page 7: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

hadoopWebsite

Mobilerealtime

monitoring

Complex

log

message

Page 8: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

New

NewhadoopWebsite

Mobilerealtime

monitoring

DataWarehouse

API

Features becomes the problem

NEW

Page 9: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

hadoopWebsite

Mobilerealtime

monitoring

API

Data Pipeline

Produce

Consume

Data Pipeline

Warehouse

Page 10: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

compare

Topic

Queue Consumer

Consumer

Consumer

Consumer

Consumer

Consumer

1

2

3

1

1

1

Page 11: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

General Topic Implement

Topic

Consumer 1

Consumer 2

Consumer 3

2

2

This consumer will lose a message.

Page 12: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Distributed by Design

Fast

Scalable - It can be elastically and transparently expanded without downtime.

Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss.

Page 13: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Topic Consumer 1

Consumer 2

Consumer 3

msg

gid = Group ID

msg

msg

1

2

3

4

7

6 5

gid = hadoop

Page 14: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Topic hadoop

gid = hadoop

realtime monitoring

data warehouse

msg

gid = Group ID

msg

msg

12

gid = rtmon

gid = warehouse

3

123

123

Page 15: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Topic hadoop

gid = hadoop

realtime monitoring

data warehouse

msg

gid = Group ID

msg 9

gid = rtmon

gid = warehouse

9

9

New Consumer

1

2

3

gid = newconsumer

Page 16: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 17: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

VagrantInstall Vagrant

Install Virtual Box

Clone https://github.com/stealthly/scala-kafka

vagrant up

Page 18: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

BREWbrew update

brew install zookeeper kafka -y

Page 19: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Some Kafka Config# The id of the broker. This must be set to a unique integer for each broker.

broker.id=0

# The port the socket server listens on

port=9092

# Zookeeper connection string (see zookeeper docs for details).

zookeeper.connect=localhost:2181

# Timeout in ms for connecting to zookeeper

zookeeper.connection.timeout.ms=6000

# The minimum age of a log file to be eligible for deletion

log.retention.hours=168

Page 20: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Kafka @ Linkedin (2013)10 billion message writes per day

55 billion messages delivered to real-time consumers

367 topics that cover both user activity topics and operational data

the largest of which adds an average of 92GB per day of batch-compressed messages

Messages are kept for 7 days, and these average at about 9.5 TB of compressed messages across all topics.

Page 21: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

KafkaOffsetMonitor

java -cp KafkaOffsetMonitor-assembly-0.2.1.jar \ com.quantifind.kafka.offsetapp.OffsetGetterWeb \ --zk localhost \ --port 8080 \ --refresh 10.seconds \ --retain 2.days

Download KafkaOffsetMonitor from Github https://github.com/quantifind/KafkaOffsetMonitor

1 Jar file, KafkaOffsetMonitor-assembly-0.2.1.jar

Page 22: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 23: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 24: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 25: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 26: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 27: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 28: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

CHANGE

Produce ChangePrice & Inventory

Consumer

Cassandra

Search API

Calculate Price

HTTP

KafkaAPI

Hotel Manager

Hotels

Page 29: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

CHANGEKafkaAPI

Hotel Manager

HotelsB Consumer

A Consumer

Price & Inventory Consumer

Page 30: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Camus

Page 31: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

http://www.slideshare.net/nuboat

https://github.com/nuboat/akkakafkaexam

Slide available here

Sourcecode available here

Page 32: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

REFERENCES

http://www.slideshare.net/charmalloc/developingwithapachekafka-29910685

http://www.infoq.com/articles/apache-kafka

http://kafka.apache.org/

https://github.com/stealthly/scala-kafka

https://github.com/quantifind/KafkaOffsetMonitor

Page 33: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Q & A