Top Banner
Everything you (freaking) need to know about Hadoop Now Andrew C. Oliver @acoliver #ATO2014 {All Things Open | Raleigh} {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
29
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Everything You Need to Know About Hadoop Right Now

Everything you (freaking) need to know about

Hadoop NowAndrew C. Oliver

@acoliver#ATO2014

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 2: Everything You Need to Know About Hadoop Right Now

Andrew C. Oliver● Programming since I was about 8● Java since ~1997● Founded POI project (currently hosted at Apache) with

Marc Johnson ~2000○ Former member Jakarta PMC○ Emeritus member of Apache Software Foundation

● Joined JBoss ~2002● Former Board Member/current helper/lifetime member:

Open Source Initiative (http://opensource.org)● Column in InfoWorld: http://www.infoworld.com/author-

bios/andrew-oliver○ I make fanboys cry.

Andrew C. Oliver@acoliver

#ATO2014

Everything You Need to Know About Hadoop Now

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 3: Everything You Need to Know About Hadoop Right Now

Open Software Integrators● Founded Nov 2007 by Andrew C. Oliver (me)

○ in Durham, NCPivoted from Java/Linux consulting to full on

Hadoop/NoSQL this year

● We’re Hiring○ mid to senior level (Java/Linux and Database background)○ devopsy type people (Puppet, Chef, Salt, etc, Linux

background, database understanding, Ruby/Python/etc) ○ up to 50% travel, salary + bonus, 401k, health, etc etc○ preferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS,

JQuery○ nice to have: Hadoop, Neo4j, MongoDB, Cassandra, Ruby, at

least one Cloud platform

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 4: Everything You Need to Know About Hadoop Right Now

● What is Hadoop anyhow?

● What is Hadoop Good For?

● What isn’t it good for?

● How do you get data into Hadoop?

● How do you get data out of Hadoop?

● How do you process data in Hadoop?

● How do you analyze data in Hadoop?

● How do you secure Hadoop?

Overview

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 5: Everything You Need to Know About Hadoop Right Now

● This is an overview talk intended as a roadmap to point you at the most

important bits to learn on the way…

● It is not comprehensive training…

● It is not an in-depth look at any part of Hadoop

● It is a rather high level selective overview of the Hadoop ecosystem

But first...

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 6: Everything You Need to Know About Hadoop Right Now

What is Hadoop Anyhow?

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 7: Everything You Need to Know About Hadoop Right Now

● A platform for distributed

computing

● 2011

○ HDFS

○ Hive

● 2012

○ HDFS

○ YARN

○ Hive

○ HBase

● 2014

○ HDFS

○ Hive

○ Yarn

○ HBase

○ Spark

○ Storm

○ Kafka

○ Mahout

○ Squoop

○ Oozie

○ ...

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014Hadoop is

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 8: Everything You Need to Know About Hadoop Right Now

● HDFS

○ Distributed Filesystem similar to Gluster, Ceph, etc.

○ You can use other distributed filesystems in place of HDFS

○ Blocks are distributed, and by default duplicated on at least 1 other

node

○ 128m default block size

○ Restful API, CLI tools, third-party tools to “mount” HDFS on Linux

(stable), Windows (ymmv), Mac (?)

● DO NOT PUT YOUR DATA NODES ON A SAN! IT IS WRONG! DO NOT DO

IT! EVEN ON THURSDAY!

Hadoop is

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 9: Everything You Need to Know About Hadoop Right Now

● YARN

○ Yet another resource negotiator

○ schedules “work” among nodes, distributes the “processing”

● Map Reduce is

○ an API

○ an algorithm, data is mapped to nodes, the answers are “reduced” to a single

answer

● Hive is

○ HDFS/Hadoop based data warehousing

○ SQL, JDBC, ODBC

○ Tables map to files on HDFS

○ No updates, deletes, transactions (but coming in “Stinger.next”)

Hadoop is

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 10: Everything You Need to Know About Hadoop Right Now

● HBase

○ a column family database

○ ACID

○ relatively low-latency

● And a whole lot more

Hadoop is

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 11: Everything You Need to Know About Hadoop Right Now

● An ecosystem of tools for distributed processing and storage of data.

Hadoop is

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 12: Everything You Need to Know About Hadoop Right Now

What is Hadoop Good For?

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 13: Everything You Need to Know About Hadoop Right Now

● Working with large amounts of data in batch

○ ETL processing / Data Transformation

○ Analytics / BI

○ Integration (Data Lake, Enterprise Data Hub)

● Working with streams of data

○ Events

■ Log data

● Time series or similar data (HBase)

What is Hadoop Good for

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 14: Everything You Need to Know About Hadoop Right Now

● What is Hadoop bad at?

○ Quick jobs - i.e. Hive/Map Reduce setup time is measured in seconds

to minutes.

○ Lots of small files (128MB block size = 0 byte files are 128m files)

○ General DBMS stuff - HBase is a much more “specific” database than

MySQL/etc.

○ High Availability

■ WHA???

● Knox, Oozie, etc all have shaky support if any for HA

Namenodes.

What is Hadoop bad at?

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 15: Everything You Need to Know About Hadoop Right Now

How do you get data into/out of Hadoop?

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 16: Everything You Need to Know About Hadoop Right Now

● How do you get data into Hadoop?

○ Sqoop it from an RDBMS

○ Use JDBC or ODBC and push into Hive from an external DB

○ Push data into Hive with the restful API

○ Put an extract file onto HDFS with the REST API

■ process it into Hive directly with a LOAD DATA statement

■ transform/process it into Hive using PIG

■ use Java

○ Message it in there with Kafka, RabbitMQ or similar MQ and custom “spout”

for Storm

○ Use any multitude of APIs that write data into HDFS, HBase, Hive, etc.

How do you get data into Hadoop?

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 17: Everything You Need to Know About Hadoop Right Now

● How do you get data out of Hadoop?

○ Should you be getting it out or should you process it there?

○ JDBC/ODBC to Hive

○ HBase can be mounted into Hive

○ REST APIs for Hive/HDFS

○ APIs for Kafka, Spark, Storm, etc (subscribe)

○ HDCP to another HDFS

○ Mount it with FUSE and use your favorite Linux tool

○ hadoop fs -cat /path/to/file/on/hdfs |grep stuff > mynewlocalfile

How do you get data out of Hadoop?

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 18: Everything You Need to Know About Hadoop Right Now

How do you process data in Hadoop?

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 19: Everything You Need to Know About Hadoop Right Now

● Map-reduce Java API

● Hive supports SQL (soon to be not a subset)

● PIG can munge files on HDFS and can work with Hive

● Storm and Spark have their own APIs for dealing with events or so-called

micro-batches of data

● There are numerous toolkits

○ Mahout - common machine learning algorithms (many not very

parallelizable/etc)

○ MLib - Machine learning built on Spark

○ GraphX

How do you process data in Hadoop?

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 20: Everything You Need to Know About Hadoop Right Now

● Most major BI tools now support Hadoop

○ Tableau

○ Pentaho

○ Datameer

○ Your favorite probably here

● All that stuff is for l4m3rs, use the command line interface :-)

○ hive -e ‘select * from sometable’

○ pig hdfs://some/dir/myscript.pig

● Use RStudio and write some R to predict what sales will be next month (you will be

sort of wrong probably)

● Use your favorite SQL tool that supports JDBC/ODBC

● Use Hue

How do you analyze data in Hadoop

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 21: Everything You Need to Know About Hadoop Right Now

How do you secure Hadoop?

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 22: Everything You Need to Know About Hadoop Right Now

● HDFS supports POSIX (that means Linux-style) filesystem security

● The most complete security authentication throughout Hadoop is based

on Kerberos (yeah I know).

● You can do it with just straight LDAP too, but it isn’t integrated.

● Knox supplies “perimeter-based security” for (only):

○ Hive

○ HDFS

○ Ooozie

○ HBase

○ HCatalog

● Supposedly Argus will save us from all of this!

How do you secure Hadoop?

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 23: Everything You Need to Know About Hadoop Right Now

Other Considerations

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 24: Everything You Need to Know About Hadoop Right Now

● Disaster Recovery

○ Falcon (alpha quality)

● Workflow

○ Flume

● Schedule/trigger/orchestrate those ETL jobs

○ Oozie

● Install, configure, monitor Hadoop

○ Ambari

● Use tables in both Pig and Hive

○ HCatalog

Cacophony

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 25: Everything You Need to Know About Hadoop Right Now

Ambari

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 26: Everything You Need to Know About Hadoop Right Now

Hue

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 27: Everything You Need to Know About Hadoop Right Now

Hue editing Oozie

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 28: Everything You Need to Know About Hadoop Right Now

Pig ScriptREGISTER file:///usr/lib/pig/piggybank.jar;define SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();

rows = load '$FILEPATH' using org.apache.pig.piggybank.storage.CSVExcelStorage('\u001a') as (a0:chararray,a1:chararray,a2:chararray,a3:chararray,a4:chararray,a5:chararray,a6:chararray,a7:chararray,a8:chararray,a9:chararray);

row = foreach rows GENERATEREPLACE((TRIM($0)),'NULL','') as orderid,REPLACE((TRIM($1)),'NULL','') as customerid,REPLACE((TRIM($2)),'NULL','') as customername,REPLACE((TRIM($3)),'NULL','') as address,REPLACE((TRIM($4)),'NULL','') as city,REPLACE((TRIM($5)),'NULL','') as state,REPLACE((TRIM($6)),'NULL','') as zip,REPLACE((TRIM($7)),'NULL','') as status,REPLACE((TRIM($8)),'NULL','') as store row into 'stage.orders' using org.apache.hcatalog.pig.HCatStorer('loaddate=$LOADDATE');

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 29: Everything You Need to Know About Hadoop Right Now

Thank you for attending!

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}