Top Banner

Click here to load reader

Apache Hadoop Ingestion Patterns & Apache Flume

Feb 23, 2016

ReportDownload

Documents

keegan

Apache Hadoop Ingestion Patterns & Apache Flume. Ted Malaska. Agenda. Selecting an Ingestion Strategy Apache Flume High Level Components Flume’s Guarantees Common Architectures Detailed Configurations Performance Tuning Example. Selecting a Ingestion Strategy. Timeliness - PowerPoint PPT Presentation

Slide 1

1Apache Hadoop Ingestion Patterns & Apache FlumeTed Malaska1

2AgendaSelecting an Ingestion StrategyApache FlumeHigh Level ComponentsFlumes GuaranteesCommon ArchitecturesDetailed ConfigurationsPerformance TuningExample

2

3Selecting a Ingestion StrategyTimelinessAppend or DeltaAccess PatternsOriginal Source SystemNetwork ConcernsTransformation, Partitioning, and Bifurcation3

4TimelinessMacro Batch: 15 minutes to hoursMicro Batch: 4 minutes to 15 minutesMini Micro Batch: Under 4 minutes but greater then 30 secondsNear Real Time Decision Support: Under 30 second but over 2 secondsNear Real Time Event Processing: Down to about 100 to 200 millisecondsReal Time: 4

5Append or DeltaExisting Data is ImmutableExisting Data is Mutable for a Fixed WindowExisting Data is Always Mutable5

6Access PatternsBatchMRHivePigCrunchGraphTime of Thought or NRTImpalaSearchGet, Put, Scan

6

7Original Source SystemFile SystemRDBMSStreamLog Files7

8Network ConcernsSecurityBandwidth and Compression8

9Transformation, Partitioning, and BifurcationTransformation: Converting XML or JSON to delimiter data.Partitioning: Incoming data is stock trade data and partitioning by ticker is requiredBifurcation: The data needs to land in HDFS and HBase for different access patterns9

10Apache FlumeHistoryScribeFlumeFlume NG

10

11High Level ComponentsHDFSHBaseAvro ClientJMSSourcesInterceptorsSelectorsChannelsSinksPoint APoint B11

12SourcesAvroSourceHTTPSourceNetcatSourceSpoolDirectorySourceExecSourceJMSSourceThriftSourceSyslogTcpSourceSyslogUDPSource12

13InterceptorsRegexExtractorInterceptorTimestampInterceptorStaticInterceptorHostInterceptorCustom13

14SelectorsMultoplexingChannelSelectorReplicatingChannelSelectorCustom14

15ChannelFileChannelMemoryChannel15

16SinksHDFSEventSinkHBaseSinkAsyncHBaseSinkNullSinkRollingFileSinkAvroSinkThriftSinkMorphlineSinkElasticSearchSink16

17Flumes GuaranteesThere is no such thing as 100% guaranteesFlume offers several level of configurable guaranteesThis is done through transactions

17

18Flumes Guarantees (Transactions 1 of 3)Avro ClientFlume AgentSubmit a BatchConfirm BatchWith Guarantees18

19Flumes Guarantees (Transactions 2 of 3)HDFSHBaseAvro ClientJMSSourcesInterceptorsSelectorsChannelsSinksPoint APoint B19

Flumes Guarantees (Transactions 3 of 3)Memory Channel: Best EffortFile Channel: JBODFile Channel: RaidFile Channel: NAS or SAN20

21Common Architectures (Fan In)HDFS21

22Common Architectures (Bifurcation)HDFSHDFS DR22

23Common Architectures (Alerting or Partitioning)HDFSHBasePartition 1Partition 223

24Detailed Configurations: Avro Source & ClientBind and portThreadsBatch SizeCompressionSSL Encryption IP Filtering24

25Detailed Configurations: JMS SourceConnection FactoryProvided URLDestination NameDestiniation Type (queue or topic)Message SelectorUser NamePassword FileBatch Size25

26Detailed Configurations: FileChannelUser homeData DirectoriesCapacityKeep aliveTransaction CapacityCheckpointingDirectoryUse Dual CheckpointsBackup checkpoint directoryCheckpoint IntervalMax file sizeMinimum required spaceuseFastReplayencryptionActiveKey & encryptionCipherProvider

26

27Detailed Configurations: MemoryChannelCapacitytransactionCapacitybyteCapacitybyteCapacityBufferPercentageKeep-Alive

27

28Example of Configuration: HDFSEventSink (1 of 3)hdfs.pathhdfs.filePrefixHdfs.inUsePrefixHdfs.inUseSuffixHdfs.rollIntervalHdfs.rollCountHdfs.rollSizeHdfs.codeCHdfs.fileTypeHdfs.idleTimeoutHdfs.batchSizeThreadPoolSize28

29Example of Configuration: HDFSEventSink (2 of 3)Path Escaping Using Headers to partition dataAliasDescription%{host}Substitute value of event header named host. Arbitrary header names are supported.%tUnix time in milliseconds%alocales short weekday name (Mon, Tue, ...)%Alocales full weekday name (Monday, Tuesday, ...)%blocales short month name (Jan, Feb, ...)%Blocales long month name (January, February, ...)%clocales date and time (Thu Mar 3 23:05:25 2005)%dday of month (01)%Ddate; same as %m/%d/%y%Hhour (00..23)%Ihour (01..12)%jday of year (001..366)%khour ( 0..23)%mmonth (01..12)%Mminute (00..59)%plocales equivalent of am or pm%sseconds since 1970-01-01 00:00:00 UTC%Ssecond (00..59)%ylast two digits of year (00..99)%Yyear (2010)%z+hhmm numeric timezone (for example, -0400)29

30Example of Configuration: HDFSEventSink (2 of 3)File Formats and CompressionText FilesSequence FilesAvro FilesCant Use Columnar File TypesRCParquet30

31Example of Configuration: HBaseSinkTable nameColumn FamilyBatch sizeHbase userkerberosPrincipal & kerberosKeytabenabledWalSerializer31

32Example of Configuration: AsyncHBaseSinkTable nameColumn FamilyBatch sizeHbase userenabledWalSerializer32

Thank you!Bag of Model, use result of other as a boost33

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.