Apache Hadoop Ingestion Patterns & Apache Flume

1

Apache Hadoop Ingestion Patterns & Apache FlumeTed Malaska

2

Agenda

• Selecting an Ingestion Strategy• Apache Flume

• High Level Components• Flume’s Guarantees• Common Architectures• Detailed Configurations• Performance Tuning

• Example

3

Selecting a Ingestion Strategy

• Timeliness• Append or Delta• Access Patterns• Original Source System• Network Concerns• Transformation, Partitioning, and Bifurcation

4

Timeliness

• Macro Batch: 15 minutes to hours

• Micro Batch: 4 minutes to 15 minutes

• Mini Micro Batch: Under 4 minutes but greater then 30 seconds

• Near Real Time Decision Support: Under 30 second but over 2 seconds

• Near Real Time Event Processing: Down to about 100 to 200 milliseconds

• Real Time:

5

Append or Delta

• Existing Data is Immutable• Existing Data is Mutable for a Fixed Window• Existing Data is Always Mutable

6

Access Patterns

• Batch• MR• Hive• Pig• Crunch• Graph

• Time of Thought or NRT• Impala• Search• Get, Put, Scan

7

Original Source System

• File System• RDBMS• Stream• Log Files

8

Network Concerns

• Security• Bandwidth and Compression

9

Transformation, Partitioning, and Bifurcation

• Transformation: Converting XML or JSON to delimiter data.

• Partitioning: Incoming data is stock trade data and partitioning by ticker is required

• Bifurcation: The data needs to land in HDFS and HBase for different access patterns

10

Apache Flume

• History• Scribe• Flume• Flume NG

11

High Level Components

HDFS

HBase

Avro Client

JMS

Sources Interceptors Selectors Channels SinksPoint A Point B

12

Sources

• AvroSource• HTTPSource• NetcatSource• SpoolDirectorySource• ExecSource• JMSSource• ThriftSource• SyslogTcpSource• SyslogUDPSource

13

Interceptors

• RegexExtractorInterceptor• TimestampInterceptor• StaticInterceptor• HostInterceptor• Custom

14

Selectors

• MultoplexingChannelSelector• ReplicatingChannelSelector• Custom

15

Channel

• FileChannel• MemoryChannel

16

Sinks

• HDFSEventSink• HBaseSink• AsyncHBaseSink• NullSink• RollingFileSink• AvroSink• ThriftSink• MorphlineSink• ElasticSearchSink

17

Flume’s Guarantees

• There is no such thing as 100% guarantees• Flume offers several level of configurable guarantees• This is done through transactions

18

Flume’s Guarantees (Transactions 1 of 3)

Avro Client

Flume Agent

Submit a Batch

Confirm BatchWith Guarantees

19


HDFS

HBase

Avro Client

JMS

Sources Interceptors Selectors Channels SinksPoint A Point B


• Memory Channel: Best Effort• File Channel: JBOD• File Channel: Raid• File Channel: NAS or SAN

21

Common Architectures (Fan In)

HDFS

22

Common Architectures (Bifurcation)

HDFS

HDFS DR

23

Common Architectures (Alerting or Partitioning)

HDFS

HBase

Partition 1

Partition 2

24

Detailed Configurations: Avro Source & Client

• Bind and port• Threads• Batch Size• Compression• SSL Encryption • IP Filtering

25

Detailed Configurations: JMS Source

• Connection Factory• Provided URL• Destination Name• Destiniation Type (queue or topic)• Message Selector• User Name• Password File• Batch Size

26

Detailed Configurations: FileChannel

• User home• Data Directories• Capacity• Keep alive• Transaction Capacity• Checkpointing

• Directory• Use Dual Checkpoints• Backup checkpoint directory• Checkpoint Interval

• Max file size• Minimum required space• useFastReplay• encryptionActiveKey & encryptionCipherProvider

27

Detailed Configurations: MemoryChannel

• Capacity• transactionCapacity• byteCapacity• byteCapacityBufferPercentage• Keep-Alive

28

Example of Configuration: HDFSEventSink (1 of 3)

• hdfs.path• hdfs.filePrefix• Hdfs.inUsePrefix• Hdfs.inUseSuffix• Hdfs.rollInterval• Hdfs.rollCount• Hdfs.rollSize• Hdfs.codeC• Hdfs.fileType• Hdfs.idleTimeout• Hdfs.batchSize• ThreadPoolSize

29


• Path Escaping • Using Headers to partition data

Alias Description%{host} Substitute value of event header named “host”. Arbitrary header names are supported.%t Unix time in milliseconds%a locale’s short weekday name (Mon, Tue, ...)%A locale’s full weekday name (Monday, Tuesday, ...)%b locale’s short month name (Jan, Feb, ...)%B locale’s long month name (January, February, ...)%c locale’s date and time (Thu Mar 3 23:05:25 2005)%d day of month (01)%D date; same as %m/%d/%y%H hour (00..23)%I hour (01..12)%j day of year (001..366)%k hour ( 0..23)%m month (01..12)%M minute (00..59)%p locale’s equivalent of am or pm%s seconds since 1970-01-01 00:00:00 UTC%S second (00..59)%y last two digits of year (00..99)%Y year (2010)%z +hhmm numeric timezone (for example, -0400)

30


• File Formats and Compression• Text Files• Sequence Files• Avro Files

• Can’t Use Columnar File Types• RC• Parquet

31

Example of Configuration: HBaseSink

• Table name• Column Family• Batch size• Hbase user• kerberosPrincipal & kerberosKeytab• enabledWal• Serializer

32

Example of Configuration: AsyncHBaseSink

• Table name• Column Family• Batch size• Hbase user• enabledWal• Serializer

Thank you!

Apache Hadoop Ingestion Patterns & Apache Flume

Documents

Apache Hadoop Ingestion Patterns & Apache Flume