YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Apache Hadoop  Ingestion Patterns & Apache Flume

1

Apache Hadoop Ingestion Patterns & Apache FlumeTed Malaska

Page 2: Apache Hadoop  Ingestion Patterns & Apache Flume

2

Agenda

• Selecting an Ingestion Strategy• Apache Flume

• High Level Components• Flume’s Guarantees• Common Architectures• Detailed Configurations• Performance Tuning

• Example

Page 3: Apache Hadoop  Ingestion Patterns & Apache Flume

3

Selecting a Ingestion Strategy

• Timeliness• Append or Delta• Access Patterns• Original Source System• Network Concerns• Transformation, Partitioning, and Bifurcation

Page 4: Apache Hadoop  Ingestion Patterns & Apache Flume

4

Timeliness

• Macro Batch: 15 minutes to hours

• Micro Batch: 4 minutes to 15 minutes

• Mini Micro Batch: Under 4 minutes but greater then 30 seconds

• Near Real Time Decision Support: Under 30 second but over 2 seconds

• Near Real Time Event Processing: Down to about 100 to 200 milliseconds

• Real Time:

Page 5: Apache Hadoop  Ingestion Patterns & Apache Flume

5

Append or Delta

• Existing Data is Immutable• Existing Data is Mutable for a Fixed Window• Existing Data is Always Mutable

Page 6: Apache Hadoop  Ingestion Patterns & Apache Flume

6

Access Patterns

• Batch• MR• Hive• Pig• Crunch• Graph

• Time of Thought or NRT• Impala• Search• Get, Put, Scan

Page 7: Apache Hadoop  Ingestion Patterns & Apache Flume

7

Original Source System

• File System• RDBMS• Stream• Log Files

Page 8: Apache Hadoop  Ingestion Patterns & Apache Flume

8

Network Concerns

• Security• Bandwidth and Compression

Page 9: Apache Hadoop  Ingestion Patterns & Apache Flume

9

Transformation, Partitioning, and Bifurcation

• Transformation: Converting XML or JSON to delimiter data.

• Partitioning: Incoming data is stock trade data and partitioning by ticker is required

• Bifurcation: The data needs to land in HDFS and HBase for different access patterns

Page 10: Apache Hadoop  Ingestion Patterns & Apache Flume

10

Apache Flume

• History• Scribe• Flume• Flume NG

Page 11: Apache Hadoop  Ingestion Patterns & Apache Flume

11

High Level Components

HDFS

HBase

Avro Client

JMS

Sources Interceptors Selectors Channels SinksPoint A Point B

Page 12: Apache Hadoop  Ingestion Patterns & Apache Flume

12

Sources

• AvroSource• HTTPSource• NetcatSource• SpoolDirectorySource• ExecSource• JMSSource• ThriftSource• SyslogTcpSource• SyslogUDPSource

Page 13: Apache Hadoop  Ingestion Patterns & Apache Flume

13

Interceptors

• RegexExtractorInterceptor• TimestampInterceptor• StaticInterceptor• HostInterceptor• Custom

Page 14: Apache Hadoop  Ingestion Patterns & Apache Flume

14

Selectors

• MultoplexingChannelSelector• ReplicatingChannelSelector• Custom

Page 15: Apache Hadoop  Ingestion Patterns & Apache Flume

15

Channel

• FileChannel• MemoryChannel

Page 16: Apache Hadoop  Ingestion Patterns & Apache Flume

16

Sinks

• HDFSEventSink• HBaseSink• AsyncHBaseSink• NullSink• RollingFileSink• AvroSink• ThriftSink• MorphlineSink• ElasticSearchSink

Page 17: Apache Hadoop  Ingestion Patterns & Apache Flume

17

Flume’s Guarantees

• There is no such thing as 100% guarantees• Flume offers several level of configurable guarantees• This is done through transactions

Page 18: Apache Hadoop  Ingestion Patterns & Apache Flume

18

Flume’s Guarantees (Transactions 1 of 3)

Avro Client

Flume Agent

Submit a Batch

Confirm BatchWith Guarantees

Page 19: Apache Hadoop  Ingestion Patterns & Apache Flume

19

Flume’s Guarantees (Transactions 2 of 3)

HDFS

HBase

Avro Client

JMS

Sources Interceptors Selectors Channels SinksPoint A Point B

Page 20: Apache Hadoop  Ingestion Patterns & Apache Flume

Flume’s Guarantees (Transactions 3 of 3)

• Memory Channel: Best Effort• File Channel: JBOD• File Channel: Raid• File Channel: NAS or SAN

Page 21: Apache Hadoop  Ingestion Patterns & Apache Flume

21

Common Architectures (Fan In)

HDFS

Page 22: Apache Hadoop  Ingestion Patterns & Apache Flume

22

Common Architectures (Bifurcation)

HDFS

HDFS DR

Page 23: Apache Hadoop  Ingestion Patterns & Apache Flume

23

Common Architectures (Alerting or Partitioning)

HDFS

HBase

Partition 1

Partition 2

Page 24: Apache Hadoop  Ingestion Patterns & Apache Flume

24

Detailed Configurations: Avro Source & Client

• Bind and port• Threads• Batch Size• Compression• SSL Encryption • IP Filtering

Page 25: Apache Hadoop  Ingestion Patterns & Apache Flume

25

Detailed Configurations: JMS Source

• Connection Factory• Provided URL• Destination Name• Destiniation Type (queue or topic)• Message Selector• User Name• Password File• Batch Size

Page 26: Apache Hadoop  Ingestion Patterns & Apache Flume

26

Detailed Configurations: FileChannel

• User home• Data Directories• Capacity• Keep alive• Transaction Capacity• Checkpointing

• Directory• Use Dual Checkpoints• Backup checkpoint directory• Checkpoint Interval

• Max file size• Minimum required space• useFastReplay• encryptionActiveKey & encryptionCipherProvider

Page 27: Apache Hadoop  Ingestion Patterns & Apache Flume

27

Detailed Configurations: MemoryChannel

• Capacity• transactionCapacity• byteCapacity• byteCapacityBufferPercentage• Keep-Alive

Page 28: Apache Hadoop  Ingestion Patterns & Apache Flume

28

Example of Configuration: HDFSEventSink (1 of 3)

• hdfs.path• hdfs.filePrefix• Hdfs.inUsePrefix• Hdfs.inUseSuffix• Hdfs.rollInterval• Hdfs.rollCount• Hdfs.rollSize• Hdfs.codeC• Hdfs.fileType• Hdfs.idleTimeout• Hdfs.batchSize• ThreadPoolSize

Page 29: Apache Hadoop  Ingestion Patterns & Apache Flume

29

Example of Configuration: HDFSEventSink (2 of 3)

• Path Escaping • Using Headers to partition data

Alias Description%{host} Substitute value of event header named “host”. Arbitrary header names are supported.%t Unix time in milliseconds%a locale’s short weekday name (Mon, Tue, ...)%A locale’s full weekday name (Monday, Tuesday, ...)%b locale’s short month name (Jan, Feb, ...)%B locale’s long month name (January, February, ...)%c locale’s date and time (Thu Mar 3 23:05:25 2005)%d day of month (01)%D date; same as %m/%d/%y%H hour (00..23)%I hour (01..12)%j day of year (001..366)%k hour ( 0..23)%m month (01..12)%M minute (00..59)%p locale’s equivalent of am or pm%s seconds since 1970-01-01 00:00:00 UTC%S second (00..59)%y last two digits of year (00..99)%Y year (2010)%z +hhmm numeric timezone (for example, -0400)

Page 30: Apache Hadoop  Ingestion Patterns & Apache Flume

30

Example of Configuration: HDFSEventSink (2 of 3)

• File Formats and Compression• Text Files• Sequence Files• Avro Files

• Can’t Use Columnar File Types• RC• Parquet

Page 31: Apache Hadoop  Ingestion Patterns & Apache Flume

31

Example of Configuration: HBaseSink

• Table name• Column Family• Batch size• Hbase user• kerberosPrincipal & kerberosKeytab• enabledWal• Serializer

Page 32: Apache Hadoop  Ingestion Patterns & Apache Flume

32

Example of Configuration: AsyncHBaseSink

• Table name• Column Family• Batch size• Hbase user• enabledWal• Serializer

Page 33: Apache Hadoop  Ingestion Patterns & Apache Flume

Thank you!


Related Documents