Backday Xebia : Retour d’expérience sur l’ingestion d’événements dans HDFS

#backdaybyxebia

Sylvain Lequeux @slequeux

Event Ingestion in HDFS

#backdaybyxebia

Back To Basics

#backdaybyxebia

Basics : Event ?

Asynchronysm …

… to message systems …

… to event systems

#backdaybyxebia

Basics : KafkaA messaging system

#backdaybyxebia

Basics : KafkaA distributed messaging system

#backdaybyxebia


… multi-queues (called “topics”) splitted in partitions

#backdaybyxebia


… multi-queues (called “topics”) splitted in partitions

… multi-clients

#backdaybyxebia

#backdaybyxebia

#backdaybyxebia

Basics : Hadoop Distributed FileSystem

Distributed & scalable

Highly fault-tolerant

Standard support for BigData jobs to run

“Moving computation is cheaper than moving data”

#backdaybyxebia

#backdaybyxebia

VS

#backdaybyxebia

Flumehttp://flume.apache.org/

#backdaybyxebia

FlumeConcepts

#backdaybyxebia

Flume

➔ Top level Apache project

➔ “Item” streaming based on data flow

#backdaybyxebia

Flume

1. An “item” exists somewhereInitially, “items” were log files

#backdaybyxebia

Flume


2. A source is a way to transform this data into Flume events

#backdaybyxebia

Flume


2. A source is a way to transform this data into Flume events

3. A channel is a way to transport data (memory, file)

#backdaybyxebia

Flume


2. A source is a way to pull and transform this data into Flume events

3. A channel is a way to transport data (memory, file)

4. A sink is a way to put a Flume event somewhere

#backdaybyxebia

Flume + Kafka = Flafka

#backdaybyxebia

FlumeHow it works

#backdaybyxebia

Simple flume configuration file# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1

# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = localhosta1.sources.r1.port = 44444

# Describe the sinka1.sinks.k1.type = logger

# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

1 - Define an agent

2 - Explicit the sources

3 - Explicit the sinks

4 - Explicit the channels

5 - Connect the pieces

#backdaybyxebia

Flafka source configuration

# Mandatory configa1.sources.r1.type = org.apache.flume.source.kafka.KafkaSourcea1.sources.r1.zookeeperConnect = localhost:2181a1.sources.r1.topic = MyTopic

# Optional configa1.sources.r1.batchSize = 1000a1.sources.r1.batchDurationMillis = 1000a1.sources.r1.consumer.timeout.ms = 10a1.sources.r1.auto.commit.enabled = falsea1.sources.r1.groupId = flume

#backdaybyxebia

HDFS Sink configuration

# Mandatory configa1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = hdfs://localhost:54310/data/flume/ \

%{topic}/%y-%m-%d

# A log of optional configs inluding :# Compression# File types : SequenceFile, Avro, etc.# Possibility to use a custom Writable# Kerberos configuration

#backdaybyxebia

Flume : data transformation

#backdaybyxebia

Flume Interceptors

# Mandatory configa1.sources.r1.interceptors = i1a1.sources.r1.interceptors.i1.type = .....

➔ Transformation executed ◆ After event is generated◆ Before sending it to channel

➔ Some predefined interceptors◆ Timestamp◆ UUID◆ Filtering◆ Morphline◆ ...

➔ Could write your own (pure Java)

#backdaybyxebia

Flume : how to run it ?Command line : Included in distribs :

flume-ng agent \

-n a1 \

-c /usr/lib/flume-ng/conf/ \

-f /usr/lib/flume-ng/conf/flume-kafka.conf &

#backdaybyxebia

Camuslinkedin/camus

#backdaybyxebia

CamusConcepts

#backdaybyxebia

Camus

➔ OpenSource project developped by LinkedIn

➔ Based entirely on MapReduce

#backdaybyxebia

Camus

A batch consists in three steps :

- P1 : Gets metadata : topic & partitions, latest offsets

- P2 : Pulls new events

- P3 : Updates local metadatas

#backdaybyxebia

CamusHow it works

#backdaybyxebia

Time to write some codeJust explain how to transform

INTO

#backdaybyxebia

Time to write some code

#backdaybyxebia

Time to write some code

#backdaybyxebia

#backdaybyxebia

Round 1 : getting started

Flume Camus

Just a simple configuration file make it works

Need a complete dev environment (included maven) to use it

Morphline interceptor’syntax is quite complex

Dev should understand MapReduce concepts

#backdaybyxebia

Round 2 : running time

Flume Camus

Flume events are ingested with no delay MapReduce setup adds uncompressable time

31 sec uncompressable+

~ 1 sec / 500 messages / node

=> 111 sec for 1 message=> 117 sec for 50 messages=> 116 sec for 1.000 mesages=> 127 sec for 10.000 messages

#backdaybyxebia

Round 3 : maintainability

Flume Camus

When used by CM, server is easy to maintain, but config is not

Full Maven project. Just use a version control system (Git, SVN, aso.)

#backdaybyxebia

Round 4 : customization

Flume Camus

Interceptors are fully customizable Morphing data could be done easily

Event headers make HDFS path highly modulable

#backdaybyxebia

Round 5 : deployment

Flume Camus

When used by CM, just include your conf, that’s it

MapReduce jobs may be included in any MR orchestrator (Oozie for instance)

Without a manager, everything needs to be done manually

#backdaybyxebia

Round 6 : state of the project

Flume Camus

Released 1.0.0 in 2012 Currently in v 0.1.0-SNAPSHOT

Included by default with Hadoop distributions

Highly used by LinkedIn in production

Almost no documentation

#backdaybyxebia

SummaryFlume Camus

Getting Started

Running time

Maintainability

Customization

Deployment

State of the project

#backdaybyxebia

Global Feedback

#backdaybyxebia

Debugging

Debugging on Flume is quite complex

Some really critical bugs like [FLUME-2578]

#backdaybyxebia

Documentation

Flume has really good quality doc

Camus only has a readme file and not up to date !

#backdaybyxebia

Camus & M/R

Camus suffers the use of Map/Reduce.

Maybe using some other concept like Spark may result in better perfs.

#backdaybyxebia

Flume quantity of files

Flume needs a very precise configuration not to generate a bunch of file.

It is easy to get it generate a lot of little files, which is problematic in term of BigData.

#backdaybyxebia

Thank youQuestions ?

Backday Xebia : Retour d’expérience sur l’ingestion d’événements dans HDFS

Technology