Best Practices and Performance Tuning on Amazon Elastic ...

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amo AbeyaratneBig Data and Analytics Consultant ANZ

12.04.2016

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Michael HanischSolutions Architect

Challenge: Data is Everywhere

- Sensors- Devices- Logs- Operations- …

Size: Growing in Petabytes

Migrate & Collect

Store &Transform

Process & Analyze

Visualize & Predict

“If every byte was a word, and you take a second to read a word, it will take you 32 million years to read a whole Petabyte”

How do we deal with it?

Strategy: Divide and Conquer

Hadoop Amazon EMR

A Managed Hadoop Framework in the Cloud

Hadoop on EC2

Managing on your own can be a LOT of work

Hadoop has to scale?

Tool: Amazon EMR

Agenda

Why EMR?

Well Architected EMR Design for Production

Challenge: Data is EverywhereSize: Growing in PBsStrategy: Divide & ConquerTool: Amazon EMR

Why EMR?

Why EMR?

Automated Decoupled Elastic

Integrated Low-costCurrent

Why EMR? : Automation

EC2 Provisioning Cluster Setup Hadoop Configuration

Installing ApplicationsJob submissionMonitoring and Failure Handling

Why EMR? : Decoupled Architecture

Separate compute and storage

Resize and shutdown with no data loss

Point multiple clusters at the same data on

Amazon S3

Easily evolve infrastructure as technology evolves

HDFS for iterative and disk I/O intensive

workloadsSave with spot and reserved instances

Why EMR?: Elastic

Intelligent resize: Wait for work to finish before stopping instances

Core nodes for scaling HDFS

Task nodes for scaling processing

power

Use instance groups: to manage different instance types in the

same cluster

Why EMR?: Current

Application Open source release

EMR release

Spark 1.5 September 9, 2015 September 2015

Spark 1.5.2 November 9, 2015 November 2015

Spark 1.6 January 4, 2016 January 2016

Spark 1.6.1 March 9, 2016 April 2016

Why EMR?: Integration with other services

Amazon Kinesis

Amazon S3

AWS Data Pipeline

Amazon DynamoDB

Amazon Redshift

Amazon KMS

IAM for Authentication

Why EMR?: Low-cost

Spot instances: Bid for unused EC2s at up to 90% less price

Transient clusters: Terminate the cluster when not in use

Reserved instances: For persistent

clusters, make use of EC2 reserved

instances to save up to 50%

Agenda

Why EMR?



-Automated-Decoupled-Elastic-Integrated-Current-Cost-efficient

Well-Architected Amazon EMR

Well Architected EMR: Design for Production

SecurityReliabilityPerformance Cost efficiency

Performance Efficiency

Choice of Instance Type

Choice of Storage Choice of application or framework

Performance: Choice of instance type - Master

Less than 50 nodes?

Heavy networkI/O

M3.xlarge

YES

No

C3 family or R3 with Enhanced networking

YES

M3.2xlarge or larger

Performance: Choice of instance type - Workers

Compute Memory Storage

Machine Learning

C1 FamilyC3 FamilyCC1.4xlargeCC2.8xlarge

M2 FamilyR3 FamilyCr1.8xlarge

Interactive Analysis

D2 FamilyI2 Family

Large HDFS

General

Batch Process

M3 FamilyM1 Family

How many nodes do I need?

Performance: Sizing

- How many tasks will you have to execute?- Based on data size, number of files/splits- Based on the job you are evaluating

- When do you need to get done?- How much parallelism do you need?- How many tasks can each node process in parallel?

Performance: Sizing

- Which other jobs are you running?

- How much data do you need to store locally?- Which files are reused 3x or more?

Performance: Sizing

Guidelines:

- Size based on HDFS storage first if needed- Add enough (task) nodes to handle processing- Do not add more than 5 tasks nodes per core node- Prefer smaller clusters of larger machines

Performance: Tune it – e.g. for Spark

Enable Dynamic Allocation of executors

Executor MemoryExecutor CoresYARN containers

Driver MemoryDeployment mode on YARN (Cluster | Client?)

EMRFS (S3) HDFS

Performance: Choice of storage

- Ability to decouple - Reliable and durable- Cost efficient- Works well for jobs that read a

dataset once per run.

- Need a persistent cluster- Reliability is configurable. But need

multiple nodes to achieve replication factor.

- Great for jobs with iterative reads on the same dataset like machine learning

Combine with S3DistCp and move data from S3 once to HDFS for iterative workloads

- Instancestorage

- AmazonEBS

Performance: S3 vs HDFS at Netflix

http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html

S3: Range GET – Data ‘locality’ doesn’t matter?

GET Range 128-192MB

GET Range 0-64MB

GET Range 64-128MB

GET Range (n-64)-nMBEMR worker nodes

S3 object (use larger files)

S3: Avoid sequential key names

UTF-8 Binary ordering

var/ Server/

BS

VWAmo/

100/s 100/s 100/sS3 partition S3 partition S3 partition

S3: Avoid sequential key names

1.2TB/Day logs30TB /Day data 250 Hadoop Jobs 75Billion transactions/Day

5 Petabytes of Data

S3: Real world heavy EMRFS users

10+PB Data Warehouseon Amazon S3> 1PB read each day

Performance: Compression

Always compress data on S3 to reduce bandwidth & cost.

Performance: Choice of Framework

Count these words

Count = 1 These = 1 Words = 1

- Embarrassingly parallel? - Can be optimized with a DAG?

A

B

C

D

ECount = 1These = 1Words = 1

Reliability

Externalize Hive Metastore

- Store your metadata outside the cluster on RDS

- Multi-AZ RDS cluster will give you HA

Data and Applications on S3

- Source of truth for data

- Store config & applications on S3, too

- Consistent View

Automate

- Bootstraps- Config options- Cloudformation

Reliability: Hive Metastore on RDS MySQL

[

{

"Classification": "hive-site",

"Properties": {

"javax.jdo.option.ConnectionURL": "jdbc:mysql:\/\/emr-hive-metastore.cauttwbz9zri.us-east-1.rds.amazonaws.com:3306\/hive?createDatabaseIfNotExist=true",

"javax.jdo.option.ConnectionUserName": "admin",

"javax.jdo.option.ConnectionPassword": "Passw0rd!"

},

"configurations" : []

}

]

Config file lives on s3://<bucketname>/hive-meta-config.json

Security

Encryption

- Server side- Client side- HDFS Transparent

- RPC with SSL- File system with LUKS

IAM Roles

- Secure Integration with AWS services

VPC

- Private Subnets- S3 endpoints- NAT

Security

Encryption



S3 Server-side encryption (SSE):- Using S3-managed keys – or –- Using KMS-managed keys (SSE-KMS)

- Configured at cluster start time

Security: Server-Side Encryption (SSE-KMS)

Amazon S3 Bucket

AWS KMS

EMRFS with Client-side Encryption

0utput writes via EMRFS with Client-side Encryption enabled

Amazon S3 Bucket

Security

Encryption



S3 Server-side encryption (SSE-KMS):1. Define Customer Master Key in KMS2. Create key policy that allows EMR’s

instance profile / role to use this key3. Configure cluster with CMK KeyID or Alias

Security: Sample Key Policy

{ "Sid": "Allow use of the key",

"Effect": "Allow",

"Principal": {

"AWS": [ "arn:aws:iam::XXXXXXXXXXXX:role/EMR_DefaultRole"]

},

"Action": [ "kms:Encrypt",

"kms:Decrypt",

"kms:ReEncrypt*",

"kms:GenerateDataKey*",

"kms:DescribeKey" ],

"Resource": "*”,

"Condition": {

"ForAnyValue:StringLike": {

"kms:EncryptionContext:aws:s3:arn": [

"arn:aws:s3:::bucket1/*",

"arn:aws:s3:::bucket2/*" ] } }

}

[…]

Security: Enable SSE-KMS with selected CMK

[ { "Classification":"emrfs-site",

"Properties": {

"fs.s3.enableServerSideEncryption": "true",

"fs.s3.serverSideEncryption.kms.keyId":"a4567b8-9900-12ab-1234-123a45678901"

} } ]

Security

Encryption



Client-side encryption:- Using custom key materials provider- Configured at cluster start

Security: End to End Encryption

Amazon S3 Bucket

AWS KMSAWS S3 SDKAmazonS3EncryptionClient()

Encrypted Object

EMRFS with Client-side Encryption

HDFS transparent encryption with Hadoop KMS

spark.ssl.enabled hadoop.rpc.protecti onhadoop.ssl.enabl edmapreduce.shuffle.ssl.enabled

0utput writes via EMRFS with Client-side Encryption enabled

Amazon S3 Bucket

LUKS with bootstrap action for local file systems

Cost Efficiency

Spot instances: Bid for unused EC2s at up to 90% less price

Transient clusters: Terminate the cluster when not in use

Reserved instances: For persistent clusters, make use of EC2 reserved instances to save up to 50%

Matching Supply and Demand• Is the cluster big enough?• Can we make it transient?• Monitor the usage with Ganglia and Amazon CloudWatch alarms

Using cost-effective resources• S3 instead of HDFS for larger datasets?

• Taking advantage of Spot and Reserved instances?

Optimize over time• Monitor and watch out for new instance types, features that may reduce cost.

Agenda

Why EMR?



-Automated-Decoupled-Elastic-Integrated-Current-Cost-efficient

-Performance tuning-Reliability measures-Security facts-Cost efficiency

Thank you!

Best Practices and Performance Tuning on Amazon Elastic ...

Documents