Amazon EMR

Amazon Elastic Map Reduce (EMR)

Saturday, December 6, 2014

Agenda

08:30 AM Breakfast09:00 AM Introduction and Strengths of Technologies10:00 AM Start an EMR Cluster

10:15 AM break + set up query tool10:30 AM Hadoop hands-on10:55 AM break11:10 AM Redshift hands-on11:40 AM Operationalizing your code12:00 PM adjourn

12/6/2014 2

Session Goals

• Understand:

• When to use EMR?

• Do:

• Start Cluster

• Load Data from S3

• Transform Data

• Unload Data to S3

12/6/2014 3

Draw elements from Gil’s deckPattern

When to use EMR?

• Some Boolean combination of the following:

• Ephemeral clusters

• Batch processing: daily, weekly, etc.

• User Defined Functions (UDF)

• File formats

• TB, PB data sets in S3

• Instant gratification

12/6/2014 4

Let’s Do This!

12/6/2014 5

What do we need?

• Key (.pem file)

• SQL Workbench

What will we do?

• Start Cluster

• Load stock market data from S3

• Calculate Sharpe ratio

• Unload Sharpe ratio results to S3

The Sharpe Ratio characterizeshow well the return of an assetcompensates the investor for therisk taken. Roughly, the higher thebetter.

AWS Console

12/6/2014 6

• Just google “aws console”

12/6/2014 7

Click Here

Where’s EMR?

Create Cluster

12/6/2014 8

Cluster Options

12/6/2014 9

• Lots of them!• Cluster Configuration• Tags - Skip• Software Configuration• File System Configuration• Hardware Configuration• Security and Access• IAM Roles• Bootstrap Actions• Steps

Cluster Configuration

12/6/2014 10

Software Configuration

12/6/2014 11

More fun stuff in here

File System Configuration

12/6/2014 12

Hardware Configuration

12/6/2014 13

$ 0.28 / hour

Set Core and Task to 0

Security and Access

12/6/2014 14

Finally we get to use our keys!

IAM Roles

12/6/2014 15

Just defaults, please

More JSON in here

Bootstrap Actions

12/6/2014 16

• Tweak configuration• Install custom application

(Apache Drill, Mahout, etc.)• Shell scripts

Steps

12/6/2014 17

Steps

12/6/2014 18

Steps: Hive Program

12/6/2014 19

Provisioning

12/6/2014 20

Bootstrapping

12/6/2014 21

Here’s your hostname

SSH Info

Monitor Startup Progress

12/6/2014 22

SSH – Linux/Mac

12/6/2014 23

SSH - Windows

12/6/2014 24

Port Forwarding (Mac/Linux)

12/6/2014 25

ssh -i ~/.ec2/emr-training.pem -L 10000:localhost:10000

[email protected]

Connect with SQL Workbench:

12/6/2014 26

• Localhost

• Autocommit

• Default URL

Load Data from S3

12/6/2014 27

Familiar SQL

Describe file format

Pull from DK bucket

Calculate Daily Returns

12/6/2014 28

Copy data into our new table

Create a table in HDFS

Hive has Windowing and Analytic Features

Daily Return =(adjclose[n] – adjclose[n-1]) -1

Calculate Sharpe Ratio

12/6/2014 29

Export Our Data

12/6/2014 30

Define CSV output

Write out data

Terminate!

12/6/2014 31

Links and Resources

• SQLWorkbench/J

• AWS EMR Documentation

• Hive Language Manual

12/6/2014 32

http://aws.amazon.com/documentation/elastic-mapreduce/

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

Amazon EMR

Software

test test314125connect

test test314113security

clusterload data

copy data

wheres emr

emr cluster10

pb data sets

s3transform dataunload