Top Banner
Big Data Infrastructure workshop A hands-on introduction Saturday, December 6, 2014
43

Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift

Jul 15, 2015

Download

Software

DataKitchen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Big Data Infrastructure workshop A hands-on introduction

Saturday, December 6, 2014

Page 2: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Agenda

08:30 AM Breakfast

09:00 AM Introduction and Strengths of Technologies

10:00 AM Start an EMR Cluster

10:15 AM break + set up query tool

10:30 AM Hadoop hands-on

10:55 AM break

11:10 AM Redshift hands-on

11:40 AM Operationalizing your code

12:00 PM adjourn

12/6/2014 2

Page 3: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Background on your presenters

Page 4: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

DataKitchen Leadership

Chris Bergh (Executive Chef)

4

Gil Benghiat(VP Product)

Eric Estabrooks (VP Cloud and Data Services)

Software development origins and executive experience delivering enterprise software focused on Marketing and Health Care sectors.

Deep Analytic Experience: Spent past decade solving the analytic data preparation problem

New Approach To Data Preparation and Production: focused on the Analysts

Page 5: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

5

Analysts And Their Teams Are Spending

60-80% Of Their Time On Data Preparation And Production

Page 6: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

This creates an expectation gap

6

Analyze

Prepare Data

C

Analyze

Prepare Data

Business Customer Expectation

AnalystReality

Communicate

The business does not think that Analysts are preparing data

(Analysts don’t want to prepare data)

Page 7: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

What Analyst Really Want: An Integrated Data Set Ready For Analysis

With: Autonomy & Agility

Without: All the Work & Anxiety

Page 8: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

8

DataKitchen solves this problem.

We are on a mission to prepare data to

make analysts successful.

Page 9: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Agenda

08:30 AM Breakfast

09:00 AM Introduction and Strengths of Technologies

10:00 AM Start an EMR Cluster

10:15 AM break + set up query tool

10:30 AM Hadoop hands-on

10:55 AM break

11:10 AM Redshift hands-on

11:40 AM Operationalizing your code

12:00 PM adjourn

12/6/2014 9

Page 10: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Experience of Audience

• Who considers themselves

• Analyst

• Data scientist

• Programmer / Scripter

• On the Business side

• Who knows SQL – can write a simple select?

• Who had an AWS account before today?

12/6/2014 10

Page 11: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Hadoop & Redshift

Page 12: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

What Is Apache Hadoop?

• Software framework

• Large scale processing

• Network of commodity hardware

• Handles hardware failures

12/6/2014 12

http://hadoop.apache.org/

Page 13: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

What is Hadoop good for?

• Problems that are huge (batch), but not hard, and can be run in parallel over immutable data

• NOT OLTP (e.g. backend to e-commerce site)

• Providing a Map Reduce framework

12/6/2014 13

Page 14: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Map Reduce

12/6/2014 14

http://www.cs.berkeley.edu/~matei/talks/2010/amp_mapreduce.pdf

Page 15: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

12/6/2014 15

Page 16: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

You can write map reduce jobs in your favorite language

Streaming Interface

• Lets you specify mappers and reducer

• Supports• Java• Python• Ruby• Unix Shell• R• Any executable

Map Reduce “generators”

• Results in map reduce jobs

• PIG

• Hive

12/6/2014 16

Page 17: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Applications that lend themselves to map reduce

• Word Count

• PDF Generation (NY Times 11,000,000 articles)

• Analysis of stock market historical data (ROI and standard deviation)

• Geographical Data (Finding intersections, rendering map files)

• Log file querying and analysis

• Statistical machine translation

• Spam detection

• Analyzing Tweets

12/6/2014 17

Page 18: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Would you use an excavator to plant a tomato?

12/6/2014 18

Page 19: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Another use …Some people use a Hadoop cluster for a “data lake”

• Store all your raw data

• Cook it on demand

12/6/2014 19

Page 20: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

12/6/2014 20http://pixgood.com/hadoop-ecosystem-diagram.html

Imp

ala

Page 21: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Pig

• Pig Latin - the scripting language

• Grunt – Shell for executing Pig Commands

12/6/2014 21

http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009

Page 22: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

This is what it would be in Java

12/6/2014 22

http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009

Page 23: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Hive

You write SQL! Well, almost, it is HiveQL

12/6/2014 23

SELECT user.*FROM userWHERE user.active = 1;

JDBCSQL

Workbench

The first hands on session will focus on this.

Page 24: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

In Amazon, the common workflow for batch processing starts and ends with s3.

12/6/2014 24

HiveScript

Page 25: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Impala

• Uses SQL very similar to HiveQL

• Runs 10-100x faster

• Runs in memory so it does not scale up as well

• Great for developing your code on a small data set

• Can use interactively with Tableau and other BI tools

• Some batch jobs run faster on Impala than Hive

12/6/2014 25

Page 26: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

What is EMR?

• Hadoop offered by Amazon

• EMR = Elastic Map Reduce

• Amazon does almost all of the work to create a cluster

12/6/2014 26

OR

Page 27: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Three ways to pay for EMR

• On Demand - highest price, by the hour, no commitment

• m1.small $0.055 per Hour

• i2.8xlarge $7.09 per hour

• (29 different machine options)

• Reservation - 1 and 3 year terms (No, All, & Partial Upfront)

• Spot - lowest price, machine can be taken away

Do I leave my cluster up all the time?

12/6/2014 27

Page 28: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Adding machines: Time down, Cost up

12/6/2014 28

Cost in ECU

Page 29: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

What Is Redshift?

• Columnar database

• Great for reads

• Scale by adding machines

• Two ways to pay

• On Demand

• Reservation

• Good for SQL-based ETL too

12/6/2014 29

http://hadoop.apache.org/

Page 30: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Redshift Machine Options (on demand prices)

12/6/2014 30

Petabyte scale

Remember: Amazon charges for s3 storage too

Page 31: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Redshift usage pattern

• Load data to s3 first

• Use BI tools to send in SQL

• Amazon Redshift is based on PostgreSQL

12/6/2014 31

The second hands on session will focus on this.

JDBCSQL

Workbench

Page 32: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Agenda

08:30 AM Breakfast

09:00 AM Introduction and Strengths of Technologies

10:00 AM Start an EMR Cluster

10:15 AM break + set up query tool

10:30 AM Hadoop hands-on

10:55 AM break

11:10 AM Redshift hands-on

11:40 AM Operationalizing your code

12:00 PM adjourn

12/6/2014 32

Page 33: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Should I use Redshift or EMR?

Redshift for

• Structured data

• Interactive queries

• Speed

Hadoop for

• Data format flexibility

• Computation flexibility

• Super Big Data

12/6/2014 33

• Try both

• Compare costs

• If it works in Redshift, start there

Page 34: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Performance comparison (3. Join Query)

12/6/2014 34https://amplab.cs.berkeley.edu/benchmark/

Page 35: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Recap

• Started a Hadoop cluster via the AWS Console (Web UI)

• Loaded Data

• Wrote some queries

• Same for Redshift

Eventually, you will do this for real and have a script that has value.

Now what?

12/6/2014 35

Page 36: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

To run your data job you need to …

• Wait for the new data to arrive

• Move it to s3

• Start a cluster

• Load the data

• Run your SQL scripts

• Wait for it to finish

• Shut down your cluster

12/6/2014 36

Page 37: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

And hope …

• The new data is in the right format

• Assumptions you made during development are still true

• Someone did not mess up your code with an "easy change“

• The new data transfers run successfully

• A table you depend on has been updated correctly

• The new data has not been truncated by the source

• No data quality issues with the source data

Wouldn’t it be great to turn your hopes into tests?

12/6/2014 37

Page 38: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

DataKitchen: We produce the data

12/6/2014 38

SQL, tests and the check list

go into a Recipe

You data are

Ingredients

The results are

Servings

Page 39: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

DataKitchen brings reality in line with expectations

39

Analyze

Prepare Data

C

Analyze

Prepare Data

Business Customer Expectation

AnalystReality

Communicate

Analyze

Prepare Data

With DataKitchen

Communicate

Page 40: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

The story of our first Recipe

12/6/2014 40

Page 41: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

The story of our first Recipe

With DataKitchen, we got 75% of our time back!

… and we don’t have to remember to shut down our cluster.

12/6/2014 41

Page 42: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

Remember to shut down your clusters

Page 43: Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift

43

Thank you!

Send us an emailto receive our newsletter

or to give us feedback.

[email protected]