Top Banner
BIG DATA INFRASTRUCTURE – INTRODUCTION TO HADOOP WITH MAP REDUCE, PIG, AND HIVE Gil Benghiat Eric Estabrooks Chris Bergh O P E N D A T A S C I E N C E C O N F E R E N C E BOSTON 2015 @opendatasci
77

Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Jul 28, 2015

Download

Data & Analytics

DataKitchen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

BIG DATA INFRASTRUCTURE – INTRODUCTION TO HADOOP WITH

MAP REDUCE, PIG, AND HIVE

Gil Benghiat

Eric Estabrooks Chris Bergh

O P E N D A T A S C I E N C E C O N F E R E N C E

BOSTON 2015

@opendatasci

Page 2: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Agenda

Introductions

Hadoop Overview & Comparisons

What do I use when?

AWS EMR

Hive

Pig

Impala Hive

6/1/2015 2

Doing

Presentation

Page 3: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Introductions

Page 4: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Meet DataKitchen

Chris Bergh (Head Chef)

4

Gil Benghiat (VP Product)

Eric Estabrooks (VP Cloud and Data Services)

Software development and executive experience delivering enterprise software focused on Marketing and Health Care sectors.

Deep Analytic Experience: Spent past decade solving analytic challenges

New Approach To Data Preparation and Production: focused on the Data Analysts and Data Scientists

Page 5: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

5

Analysts And Their Teams Are Spending

60-80% Of Their Time On Data Preparation And Production

Page 6: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

This creates an expectation gap

6

Analyze

Prepare Data

C

Analyze

Prepare Data

Business Customer Expectation

Analyst Reality

Communicate

The business does not think that Analysts are preparing data

Analysts don’t want to prepare data

Page 7: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

7

DataKitchen is on a mission to integrate and organize data to make analysts and data scientists super-powered.

Page 8: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Meet the Audience: A few questions

• Who considers themselves

• Data scientist

• Data analyst

• Programmer / Scripter

• On the Business side

• Who knows SQL – can write a select statement?

• Who used AWS before today?

6/1/2015 8

Page 9: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Hadoop Overview

Page 10: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

What Is Apache Hadoop?

• Software framework

• Distributed processing of large scale datasets

• Cluster of commodity hardware

• Promise of lower cost

• Has many frameworks, modules and projects

6/1/2015 10

http://hadoop.apache.org/

Page 11: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

6/1/2015 11 Mark Grover http://radar.oreilly.com/2015/02/processing-frameworks-for-hadoop.html

Hadoop ecosystem frameworks

* * * *

* Covered in talk Hands on *

*

(HDFS, Cassandra, HBase, S3)

Page 12: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Hadoop has been evolving

6/1/2015 12

Map Reduce

Impala Hadoop Pig

2005 2007 2009 2011 2013 2015

Google Trends “Big Data”

Page 13: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

What is Hadoop good for?

• Problems that are huge, and can be run in parallel over immutable data

• NOT OLTP (e.g. backend to e-commerce site)

• Providing frameworks to build software

• Map Reduce

• Spark

• Tez

• A backend for visualization tools

6/1/2015 13

Page 14: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Map Reduce

6/1/2015 14

http://www.cs.berkeley.edu/~matei/talks/2010/amp_mapreduce.pdf

Page 15: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

6/1/2015 15

Page 16: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Test your system in the small

1. Make a small data set

2. Test like this:

$ cat data.txt | map | sort | reduce

6/1/2015 16

Page 17: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

You can write map reduce jobs in your favorite language

Streaming Interface

• Lets you specify mappers and reducer

• Supports • Java • Python • Ruby • Unix Shell • R • Any executable

Map Reduce “generators”

• Results in map reduce jobs

• PIG

• Hive

6/1/2015 17

Page 18: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Applications that lend themselves to map reduce

• Word Count

• PDF Generation (NY Times 11,000,000 articles)

• Analysis of stock market historical data (ROI and standard deviation)

• Geographical Data (Finding intersections, rendering map files)

• Log file querying and analysis

• Statistical machine translation

• Analyzing Tweets

6/1/2015 18

Page 19: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Pig

• Pig Latin - the scripting language

• Grunt – Shell for executing Pig Commands

6/1/2015 19

http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009

Page 20: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

This is what it would be in Java

6/1/2015 20

http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009

Page 21: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Hive

You write SQL! Well, almost, it is HiveQL

6/1/2015 21

SELECT * FROM user WHERE active = 1;

JDBC SQL

Workbench

HUE

AWS S3

Page 22: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Impala

• Uses SQL very similar to HiveQL

• Runs 10-100x faster than Hive Map Reduce

• Runs in memory so it may not scale up as well

• Some batch jobs may run faster on Impala than Hive

• Great for developing your code on a small data set

• Can use interactively with Tableau and other BI tools

6/1/2015 22

Page 23: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

• Had a version of SQL called Shark

• Shark has been replaced by Spark SQL

• Hive on Spark is under development

• Spark SQL is faster than Shark

• Runs 100x faster than Hive Map Reduce

• Can use interactively with Tableau and other BI tools

6/1/2015 23

Page 24: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Performance Comparisons

Page 25: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Performance comparison (3. Join Query Feb 2014)

6/1/2015 25 Source: https://amplab.cs.berkeley.edu/benchmark/ What’s this?

(in

Sec

on

ds)

Page 26: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Performance comparison (TPC-DS April 2015)

6/1/2015 26

Source:

Page 27: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Performance comparison (Single User Sep 2014)

6/1/2015 27 Source:

Page 28: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Amazon EMR

Page 29: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Today, we will use EMR to run Hadoop

• EMR = Elastic Map Reduce

• Amazon does almost all of the work to create a cluster

• Offers a subset of modules and projects

6/1/2015 29

OR

Page 30: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

6/1/2015 30

m3.xlarge

Page 31: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

What to use when

Page 32: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

6/1/2015 32

Wh

at T

ype

of

Dat

abas

e t

o

Use

?

Capturing Transactions?

Use RDMS

Capturing Logs? Use File System

Back End To Website?

NoSQL Database (Mongodb)

Cache (Redis)

Doing Analytics?

Small Data? Desktop Tools

(Excel, Tableau)

Building Models? R, Python, SAS

Miner

Big-ish Data?

Columnar Database (Redshift)

‘Big Data’ Database (like Hadoop)

Page 33: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

6/1/2015 33

Wh

ich

To

ol S

ho

uld

I U

se?

Project Goal

Want Experience In Coolest Tech?

Spark is Hot Tech now

Just Want To Get Job Done?

Choose Hadoop Distributions

Mainly Structured Data?

Want Fast Response?

SQL / Impala

SQL / Redshift

Mainly Unstructured Data?

Developer? Write Map-Reduce

Job

Not Developer? SQL/HIVE

Page 34: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

6/1/2015 34

Ho

w S

ho

uld

I U

se It

?

Use Case

Development

Use Cloud

Use Virtual Machine

Production

Fixed Workload

Do ROI on buying up front

Use Cloud

Variable Workload Use Cloud

Page 35: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Hands on

Page 36: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Form groups of 3

6/1/2015 36

Page 37: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Let’s Do This!

6/1/2015 37

What do we need?

• AWS Account

• Key (.pem file)

• The data file in the S3 bucket

What will we do?

• Start Cluster

• MR Hive

• MR Pig

• Impala

• Sum county level census data by state.

Prerequisites and scripts are located at http://www.datakitchen.io/blog

Page 38: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

AWS Console

6/1/2015 38

• Just google “aws console”

• Log in

Page 39: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

6/1/2015 39

Click Here

Where’s EMR?

Page 40: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Create Cluster

6/1/2015 40

OR

Page 41: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Cluster Options

6/1/2015 41

Cluster Configuration mod

Tags defaults

Software Configuration mod

File System Configuration defaults

Hardware Configuration mod

Security and Access mod

IAM Roles defaults

Bootstrap Actions defaults

Steps defaults

Page 42: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Cluster Configuration

6/1/2015 42

mod

Page 43: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Tags

6/1/2015 43

defaults

Page 44: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Software Configuration

6/1/2015 44

Pick Impala here! Hopefully we’ll have time to get to this.

mod

Don’t for get to click add!

Page 45: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

File System Configuration

6/1/2015 45

defaults

Page 46: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Hardware Configuration

6/1/2015 46

$ 0.35 / hour

Set Core and Task to 0

mod

Page 47: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Security and Access

6/1/2015 47

Finally we get to use our keys!

mod

Page 48: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

IAM Roles

6/1/2015 48

Just defaults, please

More JSON in here

defaults

Page 49: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Bootstrap Actions

6/1/2015 49

defaults

• Tweak configuration • Install custom application

(Apache Drill, Mahout, etc.) • Shell scripts Can use this to set up

Spark

Page 50: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Steps

6/1/2015 50

defaults

Page 51: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Steps

6/1/2015 51

Page 52: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Steps: Hive Program

6/1/2015 52

Page 53: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Provisioning

6/1/2015 53

Page 54: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Bootstrapping

6/1/2015 54

Page 55: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Monitor Startup Progress

6/1/2015 55

Page 56: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Instructions to Connect

6/1/2015 56

Here’s your hostname

SSH Info

We’ll follow these instructions

Page 57: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Post ODSC Update: An easier way to access Hue (foxyproxy slowed us down)

For Windows, Unix, and Mac, use ssh to establish a tunnel

$ ssh -i datakitchen-training.pem -L 8888:localhost:8888 [email protected]

From the browser, go to

http://localhost:8888

You may need to fix the permissions on the .pem file:

$ chmod 400 datakitchen-training.pem

With the cygwin version of ssh, you may have to fix the group of the .pem file before the chmod command.

$ chgrp Users datakitchen-training.pem

6/1/2015 57

Page 58: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Post ODSC Update: On Windows, you can use putty to establish a tunnel 1. Download PuTTY.exe to your computer from:

http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

2. Start PuTTY.

3. In the Category list, click Session

4. In the Host Name field, type [email protected]

5. In the Category list, expand Connection > SSH > Auth

6. For Private key file for authentication, click Browse and select the private key file (datakitchen-training.ppk) used to launch the cluster.

7. In the Category list, expand Connection > SSH, and then click Tunnels.

8. In the Source port field, type 8888.

9. In the Destination type localhost:8888

10. Verify the Local and Auto options are selected.

11. Click Add.

12. Click Open.

13. Click Yes to dismiss the security alert.

6/1/2015 58

Now this will work

http://localhost:8888

Page 59: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Setup Web Connection – Linux/Mac

6/1/2015 59

Page 60: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Port Forwarding (Mac/Linux)

6/1/2015 60

ssh -i ~/.ec2/emr-training.pem -L 8888:localhost:8888 [email protected]

Page 61: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Setup Web Connection – Windows

6/1/2015 61

Page 62: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Setup Web Connection - Chrome (Windows and Mac are Identical)

6/1/2015 62

Page 63: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Setup Web Connection - Firefox (Windows and Mac are Identical)

6/1/2015 63

Page 64: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Start Hue, in browser type

http://master public DNS:8888

http://ec2-52-5-91-114.compute-1.amazonaws.com:8888

6/1/2015 64

Note: no hadoop@

Page 65: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Sign in

6/1/2015 65

First time Other times

Page 66: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

6/1/2015 66

Page 67: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

HIVE: Load Data from S3

6/1/2015 67

Familiar SQL

Describe file format Pull from S3 bucket UPDATE with your bucket name

Page 68: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

HIVE: Run the summary interactively

6/1/2015 68

Page 69: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

HIVE: Export Our Data

6/1/2015 69

Define CSV output

Write out data

You can look at the data in s3

UPDATE with your bucket name

Page 70: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

PIG: Load Data from S3

6/1/2015 70

Readable syntax

Describe file format

Pull from S3 bucket UPDATE with your bucket name

Page 71: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

PIG: Transform the data

6/1/2015 71

Page 72: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

PIG Export Our Data

6/1/2015 72

UPDATE with your bucket name

Page 73: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

IMPALA: From the shell window

Type: impala-shell >invalidate metadata

>show tables;

>

> quit

You can type “pig” or “hive” at the command line and run the scripts here, without Hue.

6/1/2015 73

Page 74: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Terminate!

6/1/2015 74

Page 75: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Remember to shut down your clusters

Page 76: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Recap

Presentation

• Hadoop is an evolving ecosystem of projects

• It is well suited for big data

• Use something else for medium or small data

Doing

• Started a Hadoop cluster via the AWS Console (Web UI)

• Loaded Data

• Wrote some queries

6/1/2015 76

Page 77: Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

77

Thank you!

To continue the discussion, contact us at

[email protected] [email protected]

[email protected] [email protected]