Top Banner
March 2012 Apache Flume (NG) Alexander Lorenz | Customer Operations Engineer
16

Apache Flume (NG)

Dec 12, 2014

Download

Technology

Apache FlumeNG presentation, held in Stuttgart. Includes coding and configuration examples
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Flume (NG)

March 2012

Apache Flume (NG)Alexander Lorenz | Customer Operations Engineer

Page 2: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Overview

• Stream data (events, not files) from clients to sinks

• Clients: files, syslog, avro, …• Sinks: HDFS files, HBase, …• Configurable reliability levels

– Best effort: “Fast and loose”– Guaranteed delivery: “Deliver no matter what”

• Configurable routing / topology

2

Page 3: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Architecture

Component Function

Agent The JVM running Flume. One per machine. Runs many sources and sinks.

Client Produces data in the form of events. Runs in a separate thread.

Sink Receives events from a channel. Runs in a separate thread.

Channel Connects sources to sinks (like a queue). Implements the reliability semantics.

Event A single datum; a log record, an avro object, etc. Normally around ~4KB.

3

Page 4: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Agent

• Runs many clients and sinks• Java properties-based configuration• Low overhead (-Xmx20m)

– But adding RAM increases performance

4

Page 5: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Sources

• Plugin interface• Managed by a SourceRunner that controls

threading and execution model (e.g. polling vs. event-based)

• Included: exec, avro, syslog, …

5

Page 6: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Sources

6

public class MySource implements PollableSource { public Status process() { // Do something to create an Event.. Event e = EventBuilder.withBody(…).build(); // A channel instance is injected by Flume. Transaction tx = channel.getTransaction(); tx.begin(); try { channel.put(e); tx.commit(); } catch (ChannelException ex) { tx.rollback(); return Status.BACKOFF; } finally { tx.close(); } return Status.READY; }}

Page 7: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Channel

• Plugin interface• Transactional• Provide queuing between source / sink• Provide reliability semantics

– MemoryChannel: Basically a Java BlockingQueue.

– JDBC– WAL

7

Page 8: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Sinks

• Plugin interface• Managed by a SinkRunner that controls

threading and execution model• Included: HDFS files (various formats)

8

Page 9: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Sinks Code

9

public class MySink implements PollableSink { public Status process() { Transaction tx = channel.getTransaction(); tx.begin(); try { Event e = channel.take(); if (e != null) { // … tx.commit(); } else { return Status.BACKOFF; } } catch (ChannelException ex) { tx.rollback(); return Status.BACKOFF; } finally { tx.close(); } return Status.READY; }}

Page 10: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Tiered collection

• Send events from agents to another tier of agents to aggregate

• Use an Avro sink (really just a client) to send events to an Avro source (really just a server) in another machine

• Failover supported• Load balancing (soon)• Transactions guarantee handoff

10

Page 11: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Tiered collection – The handoff

• Agent 1: Tx begin• Agent 1: Channel take event• Agent 1: Sink send• Agent 2: Tx begin• Agent 2: Channel put• Agent 2: Tx commit, respond OK• Agent 1: Tx commit (or rollback)

11

Page 12: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Configuration

• done in a single file• identifier for flows• <identifier>.type.subtype.parameter.config,

where <identifier> is the name of the agent• flow has a client, type, channel, sink• can have multiple channels in one flow

12

Page 13: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Simple Configuration Examplesyslog-agent.sources = Syslog

syslog-agent.channels = MemoryChannel-1syslog-agent.sinks = Console

syslog-agent.sources.Syslog.type = syslogTcpsyslog-agent.sources.Syslog.port = 5140

syslog-agent.sources.Syslog.channels = MemoryChannel-1syslog-agent.sinks.Console.channel = MemoryChannel-1

syslog-agent.sinks.Console.type = loggersyslog-agent.channels.MemoryChannel-1.type = memory

13

Page 14: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

HDFS Configuration Examplesyslog-agent.sources = Syslog

syslog-agent.channels = MemoryChannel-1syslog-agent.sinks = HDFS-LAB

syslog-agent.sources.Syslog.type = syslogTcpsyslog-agent.sources.Syslog.port = 5140

syslog-agent.sources.Syslog.channels = MemoryChannel-1syslog-agent.sinks.HDFS-LAB.channel = MemoryChannel-1

syslog-agent.sinks.HDFS-LAB.type = hdfs

syslog-agent.sinks.HDFS-LAB.hdfs.path = hdfs://NN.URI:PORT/flumetest/'%{host}''syslog-agent.sinks.HDFS-LAB.hdfs.file.Prefix = syslogfilessyslog-agent.sinks.HDFS-LAB.hdfs.file.rollInterval = 60syslog-agent.sinks.HDFS-LAB.hdfs.file.Type = SequenceFilesyslog-agent.channels.MemoryChannel-1.type = memory

14

Page 15: Apache Flume (NG)

©2012 Cloudera, Inc. All Rights Reserved.

Features

• Fan out: One source, many channels• Fan in: Many sources, one channel• Processors (aka: decorators)• Auto-batching of events in RPCs…• Multiplexing channels for datamining• Avro implementation in both ways

15