Top Banner
Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
44

Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Dec 19, 2015

Download

Documents

Oswald Walsh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Moving Data

CMSC 491Hadoop-Based Distributed Computing

Spring 2015Adam Shook

Page 2: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Agenda

• Sqoop• Flume• NiFi

Page 3: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

APACHE SQOOP

Page 4: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Sqoop - SQL to Hadoop

• Sqoop is a tool designed to transfer data between Hadoop and relational databases

• Top-level Apache project developed by Cloudera• Use Sqoop to move data between an RDBMS and HDFS• Sqoop automates most of this process, relying on the

database to describe the schema for the data to be imported

• Sqoop uses MapReduce to import and export the data• Uses a JDBC or custom interface

Page 5: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

What Can You Do With Sqoop . . .

• The input to the import process is a database table• Sqoop will read the table row-by-row into HDFS• The import process is performed in parallel. The output of

this process is a set of files containing a copy of the imported table

• These files may be text files, binary, Avro or SequenceFiles• Generates Java class which can encapsulate one row of the

imported data that can be reused in subsequent MapReduce processing of the data– Used to serialize / deserialize Sequence File formats– Parse the delimited-text form of a record

Page 6: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

What Else Can You Do With Sqoop . . .

• You can export an HDFS file to an RDBMS• Sqoop’s export process reads a set of

delimited text files from HDFS in parallel– Parses them into records– Inserts them as rows into an RDBMS table

• Incremental imports are supported• Sqoop includes commands which allow you to

inspect the RDBMS you are connected to

Page 7: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Sqoop Is Flexible

• Most aspects of the import, code generation, and export processes can be customized– You can control the specific row range or columns

imported – You can specify particular delimiters and escape

characters for the file-based representation of the data– You can specify the file format used.

• Sqoop provides connectors for MySQL, PostgreSQL, Netezza, Oracle, SQL Server, and DB2.

• There is also a generic JDBC connector

Page 8: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

To Use Sqoop

• To use Sqoop specify the tool you want to use and the arguments that control the tool

• Standard Syntax– sqoop tool-name [tool-arguments]

• Help is available– sqoop help (tool-name) or– sqoop import --help

Page 9: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Sqoop Tools

• Tools to import / export data:– sqoop import– sqoop import-all-tables– sqoop create-hive-table– sqoop export

• Tools to inspect a database– sqoop list-databases– sqoop list-tables

Page 10: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Sqoop Arguments

• Common tool arguments--connect JDBC connect string--username username for authentication--password password for authentication

• Import control arguments--append Append data to an existing dataset in HDFS--as-textfile Imports data as plain text (default)--table Table to read--target-dir HDFS target directory--where WHERE clause used for filtering

• Sqoop also provides arguments and options for output line formatting, input parsing, Hive, code generation, HBase and many others

Page 11: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Sqoop Examples

• Import an employees table from the HR database– $ sqoop import --connect jdbc:mysql://

database.example.com/hr \ --username abc --password 123 --table employees

• Import an employees table from the HR database, but only employee’s whose salary exceeds $70000– $ sqoop import --connect jdbc:mysql://

database.example.com/hr \ --username abc --password 123 --table employees \--where “salary > 70000”

• Export new employee data into the employees table in the HR database– $ sqoop export --connect jdbc:mysql://database.example.com/hr

--table employees --export-dir /new_employees

Page 12: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

APACHE FLUME

Page 13: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

What Is Flume?

• Apache Flume is a distributed, reliable system for efficiently collecting, aggregating and moving large amounts of log data from many different sources into HDFS

• Supports complex multi-hop flows where events may travel through multiple agents before reaching HDFS– Allows fan-in and fan-out flows– Supports contextual routing and backup routes (fail-over)

• Events are staged in a channel on each agent and are delivered to the next agent (like HDFS) in the flow

• Removed from a channel after they are stored in the channel of next agent or in HDFS

Page 14: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Flume Components

• Event – data being collected • Flume Agent – source, channel, sink– Source – where the data comes from– Channel – repository for the data– Sink – next destination for the data

Page 15: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

How Does It Work?

• A Flume event is data flow • A Flume agent is a (JVM) process that hosts the components

(source, channel, sink) through which events flow from an external source to the next destination (hop)

• A Flume source receives events sent to it by an external source like a web server– Format specific

• When a Flume source receives an event it stores it into one or more channels. – Channel is a passive store– Can be a memory channel– Can be a durable

Page 16: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

How Does It Work?

• The sink removes the event from the channel and puts it into an external repository (HDFS) or forwards it to the source of the next Flume agent (next hop) in the flow

• The source and sink within the given agent run asynchronously with the events staged in the channel.

Page 17: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Single Agent

Page 18: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Multi-Agent

Page 19: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Multiplexing Channels

Page 20: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Consolidation

Page 21: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Configuration

• To define the flow within a single agent– list the sources / channels / sinks – point the source and sink to a channel

• Basic syntax:# list the sources, sinks and channels for the agent<Agent>.sources = <Source><Agent>.sinks = <Sink><Agent>.channels = <Channel1> <Channel2>

# set channel for source<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...

# set channel for sink<Agent>.sinks.<Sink>.channel = <Channel1>

Page 22: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Flume Example

• Agent lets a user generate events and display them to the console.

• Defines a single agent named a1 – a1 has a source that listens for data on port 44444– a1 has a channel that buffers event data in memory– a1 has a sink that logs event data to the console

# example.conf: A single-node Flume configuration# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1

Page 23: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Flume Example . . .# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = localhosta1.sources.r1.port = 44444

# Describe the sinka1.sinks.k1.type = logger # Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

Page 24: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Built-In Flume Sources

• Avro• Thrift• Exec• JMS• Spooling Directory• NetCat• Sequence Generator• Syslog• HTTP

Page 25: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Built-In Flume Sinks

• HDFS• Logger• Avro• Thrift• IRC• File Roll• Null• HBase• Solr• Elastic Search

Page 26: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Built-In Flume Channels

• Memory• JDBC• File• Pseudo Transaction

Page 27: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Flume Interceptors

• Attach functions to sources for some type of transformation– Convert event to a new format– Add a timestamp– Change your car's oil

Page 29: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

APACHE NIFI

Page 30: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Moving DataSo, what do you use?

• Bash, Python, Perl, PHP– Do you have a project folder called “Loaders”?

• Database Replication• Apache Falcon• Sqoop (Relational Data System)• Flume (Web server log ingesting)

Page 31: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

NiFi: History

Photo Cred: wikipedia.org

Page 32: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

NiFi: Why?

Photo Cred: www.niagarafallslive.com

Page 33: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

What does this Nifi look like?

Page 34: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Features!

• This thing moves files!• Visual Representation of Data Flow and ETL

Processing• Guaranteed Delivery of Data• Manages Data Delivery, Flow, Age Off, Etc• Prioritization on Multiple levels• Extendable• Tracking and Logs

Page 35: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Overall Architecture

• Java, Java, and more Java• JVM– Webserver– Flow Controller– Flow File Repo– Content Repo– Provenance Repo

• There is a notion of Nifi Cluster

Page 36: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

How do I install?• Requirements

– Java 7+• Download

– http://nifi.incubator.apache.org/downloads/Create User• Create User (Optional)

– useradd nifi• Move to destination Directory and Extract Tar

– I like /opt/nifi• Edit Configs (Optional)• Start

– Linux: ($(NifiHome)/bin/nifi.sh)– Windows: bin/start-nifi.sh

• Logs– logs/

• You can install as a service as well

Page 37: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

And Demo Time

Page 38: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Nifi User Interface Terminology and Overview

• Flowfile• Components

– Processors (This is where we will spend a bunch of our time today)– Processor Groups

• Remote Processor Groups

– Input Ports– Output Ports– Funnels– Templates– Labels

• Relationships• Actions• Management• Navigation• Status

Page 39: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Nifi User InterfaceSummary Page

• Extremely useful when diagnosing a problem, large amount of processors

Page 40: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Data Provenance

• Ability to dive down into FlowFile details• Useful for searches, troubleshooting,

optimization.• Ability to search for specific events• Ability to replay FlowFile• Graph of FlowFile lineage.

Page 41: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

NiFi Extensions

• Developers who have a basic knowledge of Java can extend components of NiFi

• Able to Extend:– Processors– Reporting Tasks– Controller Services– FlowFile Prioritizations– Authority Providers

Page 42: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

OMG, what is a NAR?

• NiFi Archive• Allows for dependency separation from other

components• Defeats the dreaded “NoClassDefFoundError”

(and hours of trying to figure out what library is causing the problem) via ClassLoader isolation

• Use nifi-nar-maven-plugin• Great instructions in the end of the NiFi

Developers Guide

Page 43: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Extending Processors

• Best place to look is at already written processors in the source code

• Example GetFile.java: nifi-0.0.1-incubating-source-release.zip\nifi-0.0.1-incubating\nifi-nar-bundles\nifi-standard-bundle\nifi-standard-processors\src\main\java\org\apache\nifi\processors\standard\GetFile.java

Page 44: Moving Data CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Resources

• http://sqoop.apache.org• http://flume.apache.org• http://nifi.incubator.apache.org