Amazon Elastic Map Reduce (EMR) Saturday, December 6, 2014
Agenda
08:30 AM Breakfast09:00 AM Introduction and Strengths of Technologies10:00 AM Start an EMR Cluster
10:15 AM break + set up query tool10:30 AM Hadoop hands-on10:55 AM break11:10 AM Redshift hands-on11:40 AM Operationalizing your code12:00 PM adjourn
12/6/2014 2
Session Goals
• Understand:
• When to use EMR?
• Do:
• Start Cluster
• Load Data from S3
• Transform Data
• Unload Data to S3
12/6/2014 3
Draw elements from Gil’s deckPattern
When to use EMR?
• Some Boolean combination of the following:
• Ephemeral clusters
• Batch processing: daily, weekly, etc.
• User Defined Functions (UDF)
• File formats
• TB, PB data sets in S3
• Instant gratification
12/6/2014 4
Let’s Do This!
12/6/2014 5
What do we need?
• Key (.pem file)
• SQL Workbench
What will we do?
• Start Cluster
• Load stock market data from S3
• Calculate Sharpe ratio
• Unload Sharpe ratio results to S3
The Sharpe Ratio characterizeshow well the return of an assetcompensates the investor for therisk taken. Roughly, the higher thebetter.
Cluster Options
12/6/2014 9
• Lots of them!• Cluster Configuration• Tags - Skip• Software Configuration• File System Configuration• Hardware Configuration• Security and Access• IAM Roles• Bootstrap Actions• Steps
Bootstrap Actions
12/6/2014 16
• Tweak configuration• Install custom application
(Apache Drill, Mahout, etc.)• Shell scripts
Port Forwarding (Mac/Linux)
12/6/2014 25
ssh -i ~/.ec2/emr-training.pem -L 10000:localhost:10000
Calculate Daily Returns
12/6/2014 28
Copy data into our new table
Create a table in HDFS
Hive has Windowing and Analytic Features
Daily Return =(adjclose[n] – adjclose[n-1]) -1
Links and Resources
• SQLWorkbench/J
• AWS EMR Documentation
• Hive Language Manual
12/6/2014 32