© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amo AbeyaratneBig Data and Analytics Consultant ANZ
12.04.2016
Best Practices and Performance Tuning on Amazon Elastic MapReduce
Michael HanischSolutions Architect
Challenge: Data is Everywhere
- Sensors- Devices- Logs- Operations- …
Size: Growing in Petabytes
Migrate & Collect
Store &Transform
Process & Analyze
Visualize & Predict
“If every byte was a word, and you take a second to read a word, it will take you 32 million years to read a whole Petabyte”
How do we deal with it?
Strategy: Divide and Conquer
Hadoop Amazon EMR
A Managed Hadoop Framework in the Cloud
Hadoop on EC2
Managing on your own can be a LOT of work
Hadoop has to scale?
Tool: Amazon EMR
Agenda
Why EMR?
Well Architected EMR Design for Production
Challenge: Data is EverywhereSize: Growing in PBsStrategy: Divide & ConquerTool: Amazon EMR
Why EMR?
Why EMR?
Automated Decoupled Elastic
Integrated Low-costCurrent
Why EMR? : Automation
EC2 Provisioning Cluster Setup Hadoop Configuration
Installing ApplicationsJob submissionMonitoring and Failure Handling
Why EMR? : Decoupled Architecture
Separate compute and storage
Resize and shutdown with no data loss
Point multiple clusters at the same data on
Amazon S3
Easily evolve infrastructure as technology evolves
HDFS for iterative and disk I/O intensive
workloadsSave with spot and reserved instances
Why EMR?: Elastic
Intelligent resize: Wait for work to finish before stopping instances
Core nodes for scaling HDFS
Task nodes for scaling processing
power
Use instance groups: to manage different instance types in the
same cluster
Why EMR?: Current
Application Open source release
EMR release
Spark 1.5 September 9, 2015 September 2015
Spark 1.5.2 November 9, 2015 November 2015
Spark 1.6 January 4, 2016 January 2016
Spark 1.6.1 March 9, 2016 April 2016
Why EMR?: Integration with other services
Amazon Kinesis
Amazon S3
AWS Data Pipeline
Amazon DynamoDB
Amazon Redshift
Amazon KMS
IAM for Authentication
Why EMR?: Low-cost
Spot instances: Bid for unused EC2s at up to 90% less price
Transient clusters: Terminate the cluster when not in use
Reserved instances: For persistent
clusters, make use of EC2 reserved
instances to save up to 50%
Agenda
Why EMR?
Well Architected EMR Design for Production
Challenge: Data is EverywhereSize: Growing in PBsStrategy: Divide & ConquerTool: Amazon EMR
-Automated-Decoupled-Elastic-Integrated-Current-Cost-efficient
Well-Architected Amazon EMR
Well Architected EMR: Design for Production
SecurityReliabilityPerformance Cost efficiency
Performance Efficiency
Choice of Instance Type
Choice of Storage Choice of application or framework
Performance: Choice of instance type - Master
Less than 50 nodes?
Heavy networkI/O
M3.xlarge
YES
No
C3 family or R3 with Enhanced networking
YES
M3.2xlarge or larger
Performance: Choice of instance type - Workers
Compute Memory Storage
Machine Learning
C1 FamilyC3 FamilyCC1.4xlargeCC2.8xlarge
M2 FamilyR3 FamilyCr1.8xlarge
Interactive Analysis
D2 FamilyI2 Family
Large HDFS
General
Batch Process
M3 FamilyM1 Family
How many nodes do I need?
Performance: Sizing
- How many tasks will you have to execute?- Based on data size, number of files/splits- Based on the job you are evaluating
- When do you need to get done?- How much parallelism do you need?- How many tasks can each node process in parallel?
Performance: Sizing
- Which other jobs are you running?
- How much data do you need to store locally?- Which files are reused 3x or more?
Performance: Sizing
Guidelines:
- Size based on HDFS storage first if needed- Add enough (task) nodes to handle processing- Do not add more than 5 tasks nodes per core node- Prefer smaller clusters of larger machines
Performance: Tune it – e.g. for Spark
Enable Dynamic Allocation of executors
Executor MemoryExecutor CoresYARN containers
Driver MemoryDeployment mode on YARN (Cluster | Client?)
EMRFS (S3) HDFS
Performance: Choice of storage
- Ability to decouple - Reliable and durable- Cost efficient- Works well for jobs that read a
dataset once per run.
- Need a persistent cluster- Reliability is configurable. But need
multiple nodes to achieve replication factor.
- Great for jobs with iterative reads on the same dataset like machine learning
Combine with S3DistCp and move data from S3 once to HDFS for iterative workloads
- Instancestorage
- AmazonEBS
Performance: S3 vs HDFS at Netflix
http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html
S3: Range GET – Data ‘locality’ doesn’t matter?
GET Range 128-192MB
GET Range 0-64MB
GET Range 64-128MB
GET Range (n-64)-nMBEMR worker nodes
S3 object (use larger files)
S3: Avoid sequential key names
UTF-8 Binary ordering
var/ Server/
BS
VWAmo/
100/s 100/s 100/sS3 partition S3 partition S3 partition
S3: Avoid sequential key names
1.2TB/Day logs30TB /Day data 250 Hadoop Jobs 75Billion transactions/Day
5 Petabytes of Data
S3: Real world heavy EMRFS users
10+PB Data Warehouseon Amazon S3> 1PB read each day
Performance: Compression
Always compress data on S3 to reduce bandwidth & cost.
Performance: Choice of Framework
Count these words
Count = 1 These = 1 Words = 1
- Embarrassingly parallel? - Can be optimized with a DAG?
A
B
C
D
ECount = 1These = 1Words = 1
Reliability
Externalize Hive Metastore
- Store your metadata outside the cluster on RDS
- Multi-AZ RDS cluster will give you HA
Data and Applications on S3
- Source of truth for data
- Store config & applications on S3, too
- Consistent View
Automate
- Bootstraps- Config options- Cloudformation
Reliability: Hive Metastore on RDS MySQL
[
{
"Classification": "hive-site",
"Properties": {
"javax.jdo.option.ConnectionURL": "jdbc:mysql:\/\/emr-hive-metastore.cauttwbz9zri.us-east-1.rds.amazonaws.com:3306\/hive?createDatabaseIfNotExist=true",
"javax.jdo.option.ConnectionUserName": "admin",
"javax.jdo.option.ConnectionPassword": "Passw0rd!"
},
"configurations" : []
}
]
Config file lives on s3://<bucketname>/hive-meta-config.json
Security
Encryption
- Server side- Client side- HDFS Transparent
- RPC with SSL- File system with LUKS
IAM Roles
- Secure Integration with AWS services
VPC
- Private Subnets- S3 endpoints- NAT
Security
Encryption
- Server side- Client side- HDFS Transparent
- RPC with SSL- File system with LUKS
S3 Server-side encryption (SSE):- Using S3-managed keys – or –- Using KMS-managed keys (SSE-KMS)
- Configured at cluster start time
Security: Server-Side Encryption (SSE-KMS)
Amazon S3 Bucket
AWS KMS
EMRFS with Client-side Encryption
0utput writes via EMRFS with Client-side Encryption enabled
Amazon S3 Bucket
Security
Encryption
- Server side- Client side- HDFS Transparent
- RPC with SSL- File system with LUKS
S3 Server-side encryption (SSE-KMS):1. Define Customer Master Key in KMS2. Create key policy that allows EMR’s
instance profile / role to use this key3. Configure cluster with CMK KeyID or Alias
Security: Sample Key Policy
{ "Sid": "Allow use of the key",
"Effect": "Allow",
"Principal": {
"AWS": [ "arn:aws:iam::XXXXXXXXXXXX:role/EMR_DefaultRole"]
},
"Action": [ "kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:DescribeKey" ],
"Resource": "*”,
"Condition": {
"ForAnyValue:StringLike": {
"kms:EncryptionContext:aws:s3:arn": [
"arn:aws:s3:::bucket1/*",
"arn:aws:s3:::bucket2/*" ] } }
}
[…]
Security: Enable SSE-KMS with selected CMK
[ { "Classification":"emrfs-site",
"Properties": {
"fs.s3.enableServerSideEncryption": "true",
"fs.s3.serverSideEncryption.kms.keyId":"a4567b8-9900-12ab-1234-123a45678901"
} } ]
Security
Encryption
- Server side- Client side- HDFS Transparent
- RPC with SSL- File system with LUKS
Client-side encryption:- Using custom key materials provider- Configured at cluster start
Security: End to End Encryption
Amazon S3 Bucket
AWS KMSAWS S3 SDKAmazonS3EncryptionClient()
Encrypted Object
EMRFS with Client-side Encryption
HDFS transparent encryption with Hadoop KMS
spark.ssl.enabled hadoop.rpc.protecti onhadoop.ssl.enabl edmapreduce.shuffle.ssl.enabled
0utput writes via EMRFS with Client-side Encryption enabled
Amazon S3 Bucket
LUKS with bootstrap action for local file systems
Cost Efficiency
Spot instances: Bid for unused EC2s at up to 90% less price
Transient clusters: Terminate the cluster when not in use
Reserved instances: For persistent clusters, make use of EC2 reserved instances to save up to 50%
Matching Supply and Demand• Is the cluster big enough?• Can we make it transient?• Monitor the usage with Ganglia and Amazon CloudWatch alarms
Using cost-effective resources• S3 instead of HDFS for larger datasets?
• Taking advantage of Spot and Reserved instances?
Optimize over time• Monitor and watch out for new instance types, features that may reduce cost.
Agenda
Why EMR?
Well Architected EMR Design for Production
Challenge: Data is EverywhereSize: Growing in PBsStrategy: Divide & ConquerTool: Amazon EMR
-Automated-Decoupled-Elastic-Integrated-Current-Cost-efficient
-Performance tuning-Reliability measures-Security facts-Cost efficiency
Thank you!