1
Apache Hadoop Ingestion Patterns & Apache FlumeTed Malaska
2
Agenda
• Selecting an Ingestion Strategy• Apache Flume
• High Level Components• Flume’s Guarantees• Common Architectures• Detailed Configurations• Performance Tuning
• Example
3
Selecting a Ingestion Strategy
• Timeliness• Append or Delta• Access Patterns• Original Source System• Network Concerns• Transformation, Partitioning, and Bifurcation
4
Timeliness
• Macro Batch: 15 minutes to hours
• Micro Batch: 4 minutes to 15 minutes
• Mini Micro Batch: Under 4 minutes but greater then 30 seconds
• Near Real Time Decision Support: Under 30 second but over 2 seconds
• Near Real Time Event Processing: Down to about 100 to 200 milliseconds
• Real Time:
5
Append or Delta
• Existing Data is Immutable• Existing Data is Mutable for a Fixed Window• Existing Data is Always Mutable
6
Access Patterns
• Batch• MR• Hive• Pig• Crunch• Graph
• Time of Thought or NRT• Impala• Search• Get, Put, Scan
7
Original Source System
• File System• RDBMS• Stream• Log Files
8
Network Concerns
• Security• Bandwidth and Compression
9
Transformation, Partitioning, and Bifurcation
• Transformation: Converting XML or JSON to delimiter data.
• Partitioning: Incoming data is stock trade data and partitioning by ticker is required
• Bifurcation: The data needs to land in HDFS and HBase for different access patterns
10
Apache Flume
• History• Scribe• Flume• Flume NG
11
High Level Components
HDFS
HBase
Avro Client
JMS
Sources Interceptors Selectors Channels SinksPoint A Point B
12
Sources
• AvroSource• HTTPSource• NetcatSource• SpoolDirectorySource• ExecSource• JMSSource• ThriftSource• SyslogTcpSource• SyslogUDPSource
13
Interceptors
• RegexExtractorInterceptor• TimestampInterceptor• StaticInterceptor• HostInterceptor• Custom
14
Selectors
• MultoplexingChannelSelector• ReplicatingChannelSelector• Custom
15
Channel
• FileChannel• MemoryChannel
16
Sinks
• HDFSEventSink• HBaseSink• AsyncHBaseSink• NullSink• RollingFileSink• AvroSink• ThriftSink• MorphlineSink• ElasticSearchSink
17
Flume’s Guarantees
• There is no such thing as 100% guarantees• Flume offers several level of configurable guarantees• This is done through transactions
18
Flume’s Guarantees (Transactions 1 of 3)
Avro Client
Flume Agent
Submit a Batch
Confirm BatchWith Guarantees
19
Flume’s Guarantees (Transactions 2 of 3)
HDFS
HBase
Avro Client
JMS
Sources Interceptors Selectors Channels SinksPoint A Point B
Flume’s Guarantees (Transactions 3 of 3)
• Memory Channel: Best Effort• File Channel: JBOD• File Channel: Raid• File Channel: NAS or SAN
21
Common Architectures (Fan In)
HDFS
22
Common Architectures (Bifurcation)
HDFS
HDFS DR
23
Common Architectures (Alerting or Partitioning)
HDFS
HBase
Partition 1
Partition 2
24
Detailed Configurations: Avro Source & Client
• Bind and port• Threads• Batch Size• Compression• SSL Encryption • IP Filtering
25
Detailed Configurations: JMS Source
• Connection Factory• Provided URL• Destination Name• Destiniation Type (queue or topic)• Message Selector• User Name• Password File• Batch Size
26
Detailed Configurations: FileChannel
• User home• Data Directories• Capacity• Keep alive• Transaction Capacity• Checkpointing
• Directory• Use Dual Checkpoints• Backup checkpoint directory• Checkpoint Interval
• Max file size• Minimum required space• useFastReplay• encryptionActiveKey & encryptionCipherProvider
27
Detailed Configurations: MemoryChannel
• Capacity• transactionCapacity• byteCapacity• byteCapacityBufferPercentage• Keep-Alive
28
Example of Configuration: HDFSEventSink (1 of 3)
• hdfs.path• hdfs.filePrefix• Hdfs.inUsePrefix• Hdfs.inUseSuffix• Hdfs.rollInterval• Hdfs.rollCount• Hdfs.rollSize• Hdfs.codeC• Hdfs.fileType• Hdfs.idleTimeout• Hdfs.batchSize• ThreadPoolSize
29
Example of Configuration: HDFSEventSink (2 of 3)
• Path Escaping • Using Headers to partition data
Alias Description%{host} Substitute value of event header named “host”. Arbitrary header names are supported.%t Unix time in milliseconds%a locale’s short weekday name (Mon, Tue, ...)%A locale’s full weekday name (Monday, Tuesday, ...)%b locale’s short month name (Jan, Feb, ...)%B locale’s long month name (January, February, ...)%c locale’s date and time (Thu Mar 3 23:05:25 2005)%d day of month (01)%D date; same as %m/%d/%y%H hour (00..23)%I hour (01..12)%j day of year (001..366)%k hour ( 0..23)%m month (01..12)%M minute (00..59)%p locale’s equivalent of am or pm%s seconds since 1970-01-01 00:00:00 UTC%S second (00..59)%y last two digits of year (00..99)%Y year (2010)%z +hhmm numeric timezone (for example, -0400)
30
Example of Configuration: HDFSEventSink (2 of 3)
• File Formats and Compression• Text Files• Sequence Files• Avro Files
• Can’t Use Columnar File Types• RC• Parquet
31
Example of Configuration: HBaseSink
• Table name• Column Family• Batch size• Hbase user• kerberosPrincipal & kerberosKeytab• enabledWal• Serializer
32
Example of Configuration: AsyncHBaseSink
• Table name• Column Family• Batch size• Hbase user• enabledWal• Serializer
Thank you!