From Apache Hadoop/Spark to Amazon EMR - Provectus

From Apache Hadoop/Spark to Amazon EMR: Best Migration Practices and Cost Optimization Strategies

Executive SummaryAs more businesses look to technologies like machine learning and predictive analytics to more efficiently drive business outcomes, they seek to migrate on-premises Apache Spark and Hadoop clusters to the cloud. Wrestling with rising costs, maintenance uncertainties and administrative headaches, they turn to Amazon Elastic MapReduce (EMR) to support Big Data, Spark, and Hadoop environments.

Amazon EMR is a cost-efficient, high-performance service that improves business agility and keeps down resource utilization costs. As a foundation for building data streaming and machine learning infrastructure, EMR enables organizations to transform into AI-driven enterprises, to become better at targeting prospects, supporting customers, building effective products, and responding to market needs.

This whitepaper will walk you through Amazon EMR’s key features, look into its business and technology benefits, and explore core scenarios for migrating Apache Spark & Hadoop clusters to Amazon EMR:

Hadoop on-prem to EMR on AWS through Lift & Shift or Re-Architect Next-Gen architecture on AWS — Containers, non-HDFS, or Streaming Existing Hadoop on-prem to Hadoop on AWS through Rehost,

We will also look into major cost optimization strategies with Amazon EMR for businesses running workloads on Apache Hadoop and Apache Spark.

Replatform or Redistribution

2

3© 2020 Provectus. All rights reserved | provectus.com

Amazon EMR OverviewAmazon EMR is the industry-leading cloud-native Big Data platform for improving resource utilization by Apache Hadoop, Hive, Spark, Map/Reduce, and machine learning workloads. Designed to process vast amounts of data quickly and cost-effectively at scale, it is powered by open-source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi and Presto, coupled with Amazon EC2 and Amazon S3.

Amazon EMR gives analytical teams the capability to run PB-scale analysis for a fraction of the cost of traditional on-premises clusters. It eliminates the complexity involved in manual provisioning and setup of data lake resources, environment tuning and fine-tuning, and all other operational challenges.

Amazon EMR supports services for data analysis, analytics, data lake management, and machine learning, including Amazon Redshift, Amazon Athena, Amazon Glue, Amazon Kinesis, and Amazon SageMaker (also: Jupyter notebooks, Spark ML, Tensorflow).

Amazon EMR helps organizations resolve complex technical and business challenges, from clickstream and log analysis to real-time and predictive analytics.

Key Features

Efficient Data Storage. Amazon S3 is an 11 9s availability storage for various data types. It separates storage and compute, to manage multi-tenancy for both performance and chargeback to different business units.

1.

Reduced Operational Cost. EMR's automated cluster provisioning, i.e. cluster setup, Hadoop configuration, and cluster tuning, reduces overall operational cost. This feature also improves your Operation team’s productivity.

2.

High Performance. Amazon EMR's built-in Auto Scaling increases the performance of various types of workloads while keeping the overall costs low. This feature also improves the price-performance ratio.

Cost-Efficiency. Scale-out or back into the worker nodes of purpose-managed separate clusters for ephemeral, long-running, and smaller workloads. This feature enables pay-per-use versus often idle large clusters.

3.

4.


Major Use Cases

Create new products

Real-Time Streaming

Extract Transform Load (ETL)

Interactive Analytics

Clickstream Analysis

Genomicso

Apache Hadoop/Spark and Amazon EMROver 10 years ago, the advance of Hadoop gave a fresh start to data lake technology. By offering a large-scale data collection environment with massive, cluster-based parallel processing, Hadoop enabled organizations that once relied on expensive high-end systems to work with their data for a fraction of the cost.

As data lakes started to evolve beyond simple data collection and organization towards advanced data processing and analytics, clusters backfired. IT teams struggled to efficiently manage data among multiple clusters, and businesses were forced to buy and deploy systems that were rarely used. Rising costs, maintenance uncertainties, and administrative headaches became ordinary issues for organizations running Apache Hadoop/Spark workloads. Eventually, they started to consider the cloud as an alternative to Hadoop clusters.

Industry Trends

is projectedLarge and growing Hadoop market. By 2024, the Hadoop market to reach $9.4B, with an annual revenue growth reaching 33%. The trend coincides with growth in such technology application areas as Data Lakes, Intelligent Systems of Engagement, and Self-Tuning Systems of Intelligence.

Forrester ResearchRapid growth of cloud adoption in the Big Data space. According to , global spending on Big Data solutions via cloud subscriptions will grow almost 7.5x faster than on-prem subscriptions.

1.

2.

The appeal of a cloud-based data lake has massively grown in recent years. The growth is primarily driven by a combination of four factors:

https://wikibon.com/2016-2026-worldwide-big-data-market-forecast/

https://go.forrester.com/blogs/insight-paas-accelerate-big-data-cloud/


MapRUncertainty with leading Hadoop vendors. Cloudera, Hortonworks, , as well as other vendors offering Big Data Hadoop solutions are at a crossroads, as their clients are exploring cloud, to take advantage of such benefits as cost, flexibility, scalability, performance, and other benefits.

Availability of resources in the cloud. Both data lake engineers and Big Data engineers prefer cloud to on-premises, since cloud resources are easily accessible, can be scaled up or down, and, in most cases, are more cost-effective for irregular workloads.

Other business benefits include:

3.

4.

Business BenefitsIssues of efficiency, scale, and management have always been on the agenda of organizations that deploy data lakes using Apache Hadoop/Spark. Amazon EMR is one of the managed services that help address these issues while:

The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR

In 2018, IDC released comprehensive research on business benefits of using Amazon EMR for Apache Hadoop/Spark, titled

. Overall, it was concluded that organizations migrating from Apache Hadoop/Spark to Amazon EMR could reduce their total cost of ownership by 57%.

Reducing infrastructure costs

342% increase in ROI over the course of five years

60% reduction in IT infrastructure costs over the course of five years

Increasing productivity of engineers and IT staff

8 months from migration to breakeven, on average

33% more efficient Big Data teams

Improving the availability of Big Data/Hadoop/Spark environments

$2.9 million of additional new revenue gained per year

46% more efficient Big Data/Hadoop environment management staff

$18.1 million of total annual benefits per organization

99% reduction in unplanned downtime

It’s time to move to the cloud, especially if Big Data is in the picture.

Organizations that migrate their Big Data/Hadoop/Spark environments to Amazon EMR reduce total cost of ownership by 57% through improved business agility and lower costs.

https://www.hpe.com/us/en/newsroom/press-release/2019/08/hpe-advances-its-intelligent-data-platform-with-acquisition-of-mapr-business-assets.html

https://d1.awsstatic.com/analyst-reports/IDC%20Economic%20Benefits%20of%20Migrating%20to%20EMR%20White%20Paper.pdf



Overall, IDC found that users of Amazon EMR increased the number of useful applications and capacity, realized better performance, and reduced staff time required for routine operations, all while realizing considerable cost savings.

Managing data lake technologies, including data collection, data processing, and data analytics systems in the on-premises data center poses certain challenges:

The challenges of on-premises Apache Hadoop/Spark can be summarized as follows:

Technology Benefits

Tightly coupled compute and storage. Storage is bound to grow with compute, and vice versa. Compute requirements usually vary a lot, and it is highly inefficient to bind the two together.

Underutilized or scarce resources. It is not easy to scale up or down a large monolithic cluster. There is contention for resources at peak times, and massive underutilization of resources at steady state.

Down-the-line apps may not receive updates in time. With a monolithic cluster, there may be dependencies of downstream applications that do not allow for upgrades. This lack of flexibility limits innovation.

Data gets replicated. Not only does replication add to cost, but it also pushes the limits of a data center. A 3X replication on one data center is a common occurrence for HDFS clusters.

Contention for the same resources. Apache Spark is compute-bound, and Apache Hive

is memory-bound. They cannot be separated on-premises, and are forced to contend

for the same resources.

Separation of data creates data silos. Though separation of data helps resolve some of the above challenges, it creates a challenge of its own — data silos between teams using Hive, Spark, and other frameworks.

1.

3.

5.

2.

4.

6.

Fixed Cost

Static: No Scalable

Storage Compute

Outages Impact

Always On Self Serve

Production Upgrade

Slow Deployment Cycle


Amazon Web Services distinguishes six distinct Hadoop migration scenarios among three major patterns.

Re-purchase: Hadoop distro on-prem to EMR on AWS (also referred to as “Lift & Shift”).

Amazon EMR allows organizations to take maximum advantage of the cloud environment while continuing to benefit from Apache Hadoop/Spark. It brings with it such major benefits as the ability to decouple compute and storage, flexible pricing, availability of resources for various environments, and reliable performance.

Migration from on-premises Apache Hadoop/Spark to Amazon EMR may present its own difficulties. Some of the most commonly cited challenges are deterioration of performance, decline in price-to-performance ratio, hard to build and manage provisioning pipelines, and disaster recovery issues.

Fortunately, all it takes to migrate successfully is to (a) select an appropriate migration scenario, both technology-wise and business-wise; (b) stick to the industry’s best risk mitigation strategies.

Moving Apache Hadoop/Spark to the cloud using a managed service such as Amazon EMR helps organizations avoid these drawbacks, and enables them to maximize their potential.

The technology benefits of migrating to Amazon EMR include:

Decoupled compute and storage

Leverage Spot pricing for unused EC2 capacity

Autoscale nodes with Spot instances

EMR is surrounded by the industry’s broadest analytics ecosystem

Turn off your clusters through transient clusters

Diversify instance types using instance fleets

Agility in auto-scaling persistent clusters

Logical separation of jobs and applications

Fully managed EMR Notebooks

Self-service with AWS Service Catalog

Spark performance improvements

Built-in disaster recovery

Migration Scenarios and Risk Mitigation Strategies

Migration Scenarios

#1 Pattern A


Re-architect: Hadoop distro on-prem to EMR on AWS with completely new architecture and complementary services, to provide additional functionality, scalability, flexibility, cost, etc.

Next Gen architecture: Moving Hadoop workload from on-prem to AWS, but with a new architecture which may include Containers, non-HDFS, Streaming, etc. The workload remains the same or has added functionality.

Re-host / Lift & Shift: Any Hadoop distro on-prem to the same Hadoop distro on AWS. For example, Cloudera on-prem to Cloudera on-AWS.

Re-platform: Hadoop distro on-prem to Hadoop distro on AWS with additional optimizations, such as separation of compute and storage, use of complementary services such as Glue, Athena, etc. to optimize environment, while using the same distro that was used on-prem. For example, Cloudera on-prem to Cloudera on AWS with optimizations.

Re-distro: Hadoop distro A on-prem to Hadoop distro B on AWS, with completely new architecture and complementary services, to provide additional functionality, scalability, flexibility, cost, etc. For example, from MapR to Cloudera-on-AWS

While all of these patterns and scenarios are viable, practice has shown that only three scenarios enable organizations to take maximum advantage of migration.

#2 Pattern B

#3 Pattern C

EMR Migration Scenarios pros and cons:

1. Lift & Shift — Migrate to Amazon EMR by lifting and shifting your Hadoop distro on-premises to the AWS cloud.

2. Re-Architect — Migrate to Amazon EMR with a new architecture, with complimentary services to optimize cost and to provide additional functionality, scalability, flexibility, etc.

Low Risk & Lowest migration cost

Medium risk, medium migration cost

Very high ongoing cost

Medium ongoing cost

Low business value addition

High business value addition

Quickest time to market


Medium time to market

3. Next Gen Architecture — Migrate to Amazon EMR with a completely new architecture, which may include Streaming, Containers with added functionality, scalability, flexibility, etc.

High risk, highest migration cost

Analyze all applications and workloads to ascertain the exact amount of compute, memory, and storage. Also, identify the run time of day, week, or month, and any other infrastructure needs.

Develop a business value model and an implementation complexity model for all applications and workloads. Create a business value vs complexity prioritization matrix to be aware of all potential pain points.

Lowest ongoing cost

Highest business value addition

Longest time to market

Each of these scenarios has its advantages and disadvantages, but bear in mind that only the latter two that imply architecture rework deliver tangible results in terms of total cost of ownership, as summarized in the image below.

(We will talk more about cost reduction opportunities of the above-mentioned scenarios further down the line.)

Correctly balancing risk and reward is of substantial importance to any organization that needs to migrate their Big Data, Apache Hadoop, or Apache Spark workloads to Amazon EMR. Be it the simplest Lift & Shift or a radically more complex re-architecture, the following strategies will help:

On-Premises Instance Right-Sizing

Transientclusters

Auto-scaling

Spot Pricing

Automated Orchestration

EMROptimized

True TCOcomparison

S3 vs.HDFS

Lift&Shift

Risk Mitigation Strategies


Ensure an organized mirroring of data loads onto Amazon EMR cluster with on-premises Apache Hadoop cluster.

Create and share a detailed migration plan among all stakeholders. Move workloads to Amazon EMR in an orderly fashion.

Identify excited innovators within each business unit to spread the word about the potential of Amazon EMR. Use their help to move from on-prem to Amazon EMR.

By following the tips above (and by choosing the correct migration scenario right off the bat), you safeguard your organization from critical mistakes that can cost you time and money. In the next section, we will learn about best practices to optimize Amazon EMR migration costs.

Companies that migrate their Apache Hadoop/Spark workloads to Amazon ERM can expect to reduce IT infrastructure costs while benefiting from better team productivity and performance.

Given a variety of migration patterns and scenarios, company-to-company cost optimization opportunities also vary significantly. Let’s look at cost optimization factors that impact how much an organization can save across three major migration scenarios.

Cost Optimization for Hadoop/Spark Workloads with Amazon EMR

10-20%1. Lift & Shift — cost reduction

20-90%3. Next Gen Architecture — cost reduction

10-40%2. Re-Architect — cost reduction

CapEx to OpEx

Data Pipelines optimization

Decoupled Storage & Compute

On-prem license fees

Workload decomposition (EMR, Redshift, Athena, Sagemaker)

Transient clusters

Maintenance overhead

Serverless ETL Serverless Data Catalog

Spot pricingOptimized hardware

Uncertainty in Hadoop vendors

Serverless ad-hoc queries Streaming processing

AutoscalingAmazon S3 lifecycle

Proprietary Spark EMR engine


Cost Reduction & Performance Impact by migration scenario

Apache Hadoop pipeline #1 — Heavy Processing dependant

At this point, it is clear that an organization’s potential to save by migrating to Amazon EMR primarily depends on the migration scenario and its ability to risk investing time and resources into a full-scale rework of architecture, from Lift & Shift to EMR Optimized.

However, no matter which scenario your organization chooses to pursue, you will be able to optimize costs, simply by using Amazon S3 as your persistent storage, or by taking advantage of autoscaling, or by turning your clusters on and off.

The key advantage of Amazon EMR from a cost optimization standpoint is that it exists within a next-generation ecosystem that is designed to support each and every task pertaining to AI/ML, Big Data, Analytics, and more.

Just consider two of the most common Apache Hadoop pipelines below:

Sources Ingestion Heavy Processing

LightweightProcessing

Query

MPP QueriesFull Scans

Data re-shuffle

Fast SelectsBig Joins

From-scratchprocessing

90% of your Hadoop Costs

Flat SQL transforms

CostReduction

Lift & Shift


Apache Hadoop pipeline #2 — Query dependant

Hadoop pipeline on Amazon EMR

Sources

Sources

Ingestion

Ingestion

Heavy Processing

Stream & Batch Processing

LightweightProcessing

Query

Query

Full Scans

CDC

Data re-shuffle

Flat SQL transforms

Reports Serving

Big Joins

Incremental Processing

Re-partitioning90% of your

Hadoop Costs

2-3x of cost reduction

Both pipelines do their job but are costly when it comes to full scans, big joins, data re-shuffling, from-scratch processing, and re-partitioning — because all of these operations are performed on-premises.

Compare those to an EMR-based pipeline for Hadoop:

As seen from the image above, going for incremental processing and flat SQL transforms at the Stream & Batch Processing stage of your pipeline enables organizations to reduce costs by 2-3X. This is possible because of the ecosystem that is built around data lakes on AWS.


AnalyticsMachine Learning

Amazon AthenaAmazon SageMaker

Amazon RedshiftAmazon Textract

Amazon KinesisAmazon Personalize

Amazon EMRAmazon Rekognition

Amazon ElasticsearchAmazon Comprehend

Amazon QuickSightAmazon Forecast

The ecosystem around Data Lake on AWS

Overall, AWS and Amazon EMR offer a wide selection of means and methods to realize cost savings, from cloud’s innate agility and flexibility to architecture improvements to “big picture” ecosystem advantages.

On-prem Data Movement Real-time Data Movement

AWS Direct Connect AWS IoT Core

AWS Snowball AWS Kinesis Data Streams

AWS Storage Gateway AWS Kinesis Firehose

AWS Snowmobile AWS Kinesis Video Streams

SummaryAmazon EMR is the industry-leading cloud Big Data platform.

Envisioned by Amazon Web Services as the ultimate tool for Apache

Hadoop, Hive, Spark, Map/Reduce and machine learning workloads,

Amazon EMR is coupled Amazon EC2 and Amazon S3 within the AWS

ecosystem to process vast amounts of data quickly and cost-effectively,

at scale.

The service is ideal for organizations willing to run large-scale analysis

for a fraction of the cost of traditional on-prem clusters, while avoiding

complexities associated with manual provisioning and setup of data

lake resources, tuning and fine-tuning of environments, and other

operational challenges.

Amazon EMR is a golden opportunity for businesses and IT teams

looking to migrate to the cloud, reduce costs, increase team

productivity, and eliminate administrative uncertainty. Many think

tanks, from Gartner and Forrester to IDC, have proven the advantages

of Amazon EMR over on-premises Apache Hadoop and Apache Spark.

14

Amazon Web Services distinguishes six distinct Hadoop migration

scenarios among three major patterns, including Lift & Shift,

Re-Architect, and Next Gen Architecture.

Every migration scenario has its pros and cons: Lift & Shift is the

quickest and easiest but only reduces costs by up to 20%;

Re-Architect is the middle ground; Next Gen Architecture is the most

complex option, but it can reduce costs by up to 90%. Specific

strategies can help organizations mitigate migration risks and move

more quickly from on-premises to the cloud.

Amazon EMR offers organizations an arsenal of features to realize

cost savings. It includes decoupled compute & storage, built-in

disaster recovery, “flexible” transient clusters, autoscaling of

persistent clusters, auto scalable nodes with Spot instance, a variety

of instance types and instance fleets, EMR Notebooks, and more.

At Provectus, we are dedicated to helping businesses reduce the cost

and increase the performance of their Big Data solutions on Apache

Hadoop/Spark by migrating to Amazon EMR. We use industry best

practices to assess the scope of migration and craft a smart migration

strategy around architecture improvements and cost optimization

opportunities.

15

Amazon EMR Migration WorkshopLearn more about Amazon EMR Migration Program and request the

https://provectus.com/amazon-emr-migration/


About Provectus Contact UsProvectus is an Artificial Intelligence consultancy and

solutions provider helping businesses achieve their

objectives through AI. Provectus is a Premier Consulting

Partner with AWS competencies in Machine Learning,

Data & Analytics, and DevOps.

125 University Avenue, Suite 290

Palo Alto, California, 94301

+1 (800) 950-9840

[email protected]

provectus.com

http://provectus.com

mailto:[email protected]

Index

17

1.

2.

3.

4.

5.

6.

7.

8.

aws.amazon.com/emr/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc

aws.amazon.com/aws-cost-management/aws-cost-optimization/

go.forrester.com/blogs/insight-paas-accelerate-big-data-cloud/

aws.amazon.com/getting-started/hands-on/optimize-amazon-emr-clusters-with-ec2-spot/

wikibon.com/2016-2026-worldwide-big-data-market-forecast/

aws.amazon.com/blogs/big-data/optimize-amazon-emr-costs-with-idle-checks-and-automatic-resource-termination-using-advanced-amazon-cloudwatch-metrics-and-aws-lambda/

d1.awsstatic.com/analyst-reports/IDC%20Economic%20Benefits%20of%20Migrating%20to%20EMR%20White%20Paper.pdf

d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf

https://aws.amazon.com/emr/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc

https://aws.amazon.com/getting-started/hands-on/optimize-amazon-emr-clusters-with-ec2-spot/

https://aws.amazon.com/aws-cost-management/aws-cost-optimization/

https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-costs-with-idle-checks-and-automatic-resource-termination-using-advanced-amazon-cloudwatch-metrics-and-aws-lambda/

https://wikibon.com/2016-2026-worldwide-big-data-market-forecast/

https://go.forrester.com/blogs/insight-paas-accelerate-big-data-cloud/


https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf

From Apache Hadoop/Spark to Amazon EMR - Provectus

Documents