-
FEDERATION WHITE PAPER
VIRTUALIZING HADOOP IN LARGE-SCALE INFRASTRUCTURES How Adobe
Systems achieved breakthrough results in Big Data analytics with
Hadoop-as-a-Service
ABSTRACT
Large-scale Apache Hadoop analytics have long eluded the
industry, especially in virtualized environments. In a
ground-breaking proof of concept (POC), Adobe Systems demonstrated
running Hadoop-as-a-Service (HDaaS) on a virtualized and
centralized infrastructure handled large-scale data analytics
workloads. This white paper documents the POCs infrastructure
design, initial obstacles, and successful completion, as well as
sizing and configuration details, and best practices. Importantly,
the paper also underscores how HDaaS built on an integrated and
virtualized infrastructure delivers outstanding performance,
scalability, and efficiency, paving the path toward larger-scale
Big Data analytics in Hadoop environments.
December, 2014
-
To learn more about how EMC products, services, and solutions
can help solve your business and IT challenges, contact your local
representative or authorized reseller, visit www.emc.com, or
explore and compare products in the EMC Store
Copyright 2014 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as
of its publication date. The information is subject to change
without notice.
The information in this publication is provided as is. EMC
Corporation makes no representations or warranties of any kind with
respect to the information in this publication, and specifically
disclaims implied warranties of merchantability or fitness for a
particular purpose.
Use, copying, and distribution of any EMC software described in
this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see EMC
Corporation Trademarks on EMC.com.
VMware and vSphere are registered trademarks or trademarks of
VMware, Inc. in the United States and/or other jurisdictions. All
other trademarks used herein are the property of their respective
owners.
Part Number H13856
-
TABLE OF CONTENTS
EXECUTIVE SUMMARY
.................................................................................................................................................
4
INTRODUCTION
.............................................................................................................................................................
5
BOLD ARCHITECTURE FOR HDAAS
..............................................................................................................................
7
NAVIGATING TOWARD LARGE-SCALE HDAAS
...........................................................................................................
8
A Few surprises
.............................................................................................................................................................................................
8
Diving in Deeper
............................................................................................................................................................................................
8 Relooking at Memory Settings
............................................................................................................................................................................
8
Modifying Settings Properly with BDE
................................................................................................................................................................
9
Bigger is Not Always Better
.................................................................................................................................................................................
9
Storage Sizing Proved Successful
......................................................................................................................................................................
9
BREAKTHROUGH IN HADOOP ANALYTICS
...............................................................................................................
10
Impressive Performance Results
...............................................................................................................................................................
10
Breaking with Tradition Adds Efficiency
...................................................................................................................................................
11
Stronger Data PRotection
..........................................................................................................................................................................
11
Freeing the Infrastructure
..........................................................................................................................................................................
11
BEST PRACTICE RECOMMENDATIONS
.....................................................................................................................
12
Memory settings are key
...........................................................................................................................................................................
12
Understand Sizing and Configuration
......................................................................................................................................................
12
Acquire or Develop Hadoop Expertise
.....................................................................................................................................................
12
NEXT STEPS: LIVE WITH HDAAS
...............................................................................................................................
12
H13856 Page 3 of 12
-
EXECUTIVE SUMMARY
Apache Hadoop has become a prime tool for analyzing Big Data and
achieving greater insights that help organizations improve
strategic decision making. Traditional Hadoop clusters have proved
inefficient for handling large-scale analytics jobs sized at
hundreds of terabytes or even petabytes. Adobes Digital Marketing
organization, which operates data analytic jobs on this scale, was
encountering increased demand internally to use Hadoop for analysis
of the company's existing eight-petabyte data repository.
To address this need, Adobe explored an innovative approach to
Hadoop. Rather than running traditional Hadoop clusters on
commodity servers with locally attached storage, Adobe virtualized
the Hadoop computing environment and used its existing EMC Isilon
storagewhere the eight-petabyte data repository residesas a central
location for Hadoop data.
Adobe enlisted resources, technologies, and expertise of EMC,
VMware, and Cisco to build a reference architecture for virtualized
Hadoop-as-a-Service (HDaaS) and perform a comprehensive proof of
concept. While the five-month POC encountered some challenges, the
project also yielded a wealth of insights and understanding
relating to how Hadoop operates and its infrastructure
requirements.
After meticulous configuring, refining, and testing, Adobe
successfully ran a 65-terabyte Hadoop jobone of the industrys
largest to date in a virtualized environment. This white paper
details the process that Adobe and the POC team followed that led
to this accomplishment.
The paper includes specific configurations of the virtual HDaaS
environment used in the POC. It also covers initial obstacles and
how the POC team overcame them. It also documents how the team
adjusted settings, sized systems, and reconfigured the environment
to support large-scale Hadoop analytics in a virtual environment
with centralized storage.
Most importantly, the paper presents POCs results, along with
valuable best practices for other organizations interested in
pursuing similar projects. The last section describes Adobes plans
to bring virtual HDaaS to production for its business users and
data scientists.
H13856 Page 4 of 12
-
INTRODUCTION
Organizations across the world increasingly view Big Data as a
prime source of competitive differentiation, and analytics as the
means to tap this source. Specifically, Hadoop enables data
scientists to perform sophisticated queries against massive volumes
of data to gain insights, discover trends, and predict outcomes. In
fact, a GE and Accenture study reported that 84 percent of survey
respondents believe that using Big Data analytics has the power to
shift the competitive landscape for my industry" in the next
year.1
Apache Hadoop, an increasingly popular environment for running
analytics jobs, is an open source framework for storing and
processing large data sets. Traditionally running on clusters of
commodity servers with local storage, Hadoop comprises multiple
components, primarily the Hadoop Distributed File System (HDFS) for
data storage, Yet Another Resource Negotiator (YARN) for managing
system resources like memory and CPUs, and MapReduce for processing
massive jobs by splitting up input data into small subtasks and
collating results.
At Adobe, a global leader in digital marketing and digital media
solutions, its Technical Operations team uses traditional Hadoop
clusters to deliver Hadoop as a Service (HDaaS) in a private cloud
for several application teams. These teams run Big Data jobs such
as log and statistical analysis of application layers to uncover
trends that help guide product enhancements.
Elsewhere, Adobe's Digital Marketing organization tracks and
analyzes customers website statistics, which are stored in an
eight-petabyte data repository on EMC Isilon storage. Adobe Digital
Marketing would like to use HDaaS for more in-depth analysis that
would help their clients improve website effectiveness, correlate
site visits to revenue, and guide strategic business decisions.
Rather than moving data from a large data repository to the Hadoop
clustersa time-consuming task, Technical Operations determined it
would be most efficient to simply use Hadoop to access data sets on
the existing Isilon-based data repository.
Adobe has a goal of running analytics jobs against data sets
that are hundreds of terabytes in size. Simply adding commodity
servers to Hadoop clusters would become highly inefficient,
especially since traditional Hadoop clusters require three copies
of the data to ensure availability. Adobe also was concerned that
current Hadoop versions lack high availability features. For
example, Hadoop has only has two NameNodes, which tracks where data
resides in Hadoop environments. If both NameNodes fail, the entire
Hadoop cluster would collapse.
Technical Operations proposed separating the Hadoop elements and
placing them where they can scale more efficiently and reliably.
This meant using Isilon, where Adobes file-based data repository is
stored, for centralized Hadoop storage and virtualizing the Hadoop
cluster nodes to enable more flexible scalability and lower compute
costs. (Figures 1 and 2)
Figure 1. Traditional Hadoop Architecture
1 "Industrial Internet Insights Report for 2015." GE, Accenture.
2014.
H13856 Page 5 of 12
-
Figure 2. Virtual Hadoop Architecture with Isilon
Despite internal skepticism about a virtualized infrastructure
handling Hadoops complexity, Technical Operations recognized a
compelling upside: improving efficiency and increasing scalability
to a level that had not been achieved for single-job data sets in a
virtualize Hadoop environment with Isilon. This is enticing,
especially as data analytics jobs continue to grow in size across
all environments.
"People think by that virtualizing Hadoop, you're going to take
a performance hit. But we showed that's not the case. Instead you
get added flexibility that actually unencumbers your
infrastructure."
Chris Mutchler, Compute Platform Engineer, Adobe Systems
To explore the possibilities, Adobe Technical Operations
embarked on a virtual HDaaS POC for Adobe Systems Digital
Marketing. The infrastructure comprised EMC, VMware, and Cisco
solutions and was designed to test the outer limits of Big Data
analytics on Isilon and VMware using Hadoop.
Key objectives of the POC included:
Building a virtualized HDaaS environment to deliver analytics
through a self-service catalog to internal Adobe customers
Decoupling storage from compute by using EMC Isilon to provide
HDFS, ultimately enabling access to the entire data repository for
analytics Understanding sizing and security requirements of the
integrated EMC Isilon, EMC VNX, VMware and Cisco UCS infrastructure
to support larger-scale
HDaaS
Proving an attractive return on investment and total cost of
ownership in virtualized HDaaS environments compared to physical
in-house solutions or public cloud services such as Amazon Web
Services
Documenting key learnings and best practices The results were
impressive. While the POC uncovered some surprises, Adobe gained
valuable knowledge for future HDaaS projects. Ultimately, Adobe ran
some of the largest Hadoop data analytics jobs to date in a
virtualized HDaaS environment. It was a groundbreaking achievement
and bodes a new era of scale and efficiency for Big Data
analytics.
H13856 Page 6 of 12
-
BOLD ARCHITECTURE FOR HDAAS
The POCs physical topology is built on Cisco Unified Compute
System (UCS), Cisco Nexus networking, EMC VNX block storage, and
EMC Isilon scale-out storage. (Figure 3)
Figure 3. HDaaS Hardware Topology
At the compute layer, Adobe was particularly interested in Cisco
UCS for its firmware management and centralized configuration
capabilities. Plus, UCS provides a converged compute and network
environment when deployed with Nexus.
VNX provides block storage for VMware ESX hosts and virtual
machines (VMs) that comprise the Hadoop cluster. Adobe's focus was
learning the VNX sizing and performance requirements to support
virtualized HDaaS.
An existing Isilon customer, Adobe especially liked Isilons data
lake concept that enables access to one source of data through
multiple protocols, such as NFS, FTP, Object, and HDFS. In the POC,
data was loaded onto Isilon via NFS and accessed via HDFS by
virtual machines in the Hadoop compute cluster. The goal was to
prove that Isilon delivered sufficient performance to support large
Hadoop workloads.
To deploy, run, and manage Hadoop on a common virtual
infrastructure, Adobe relied on VMware Big Data Extensions (BDE) an
essential software component of the overall environment. Adobe
already used BDE in its private cloud HDaaS deployment and wanted
to apply it to the new infrastructure.
BDE enabled Adobe to automate and simplify deployment of
hundreds of virtualized Hadoop compute nodes that were tied
directly to Isilon for HDFS. During testing, Adobe also used BDE to
deploy, reclaim, and redeploy the Hadoop cluster more than 30 times
to evaluate different cluster configurations. Without the
automation and flexibility of BDE, Adobe would not have been able
to conduct such a wide range and high volume of tests within such a
short timeframe.
In this POC, Adobe used Pivotal HD as an enhanced Hadoop
distribution framework but designed the infrastructure to run any
Hadoop distribution.
The following tools assisted Adobe with monitoring, collecting
and reporting on metrics generated by the POC:
VNX Monitor and Reporting Suite (M&R) Isilon Insight IQ
(IIQ) Vmware vCenter Operations Manager (VCOPS) Cisco UCS Director
(USCD)
H13856 Page 7 of 12
-
NAVIGATING TOWARD LARGE-SCALE HDAAS
The POC spanned five months from hardware delivery through final
testing. Adobe expected the infrastructure components to integrate
well, provide a stable environment, and perform satisfactorily.
In fact, the POC team implemented the infrastructure in about
one and a half weeks. Then it put Isilon to the test as the HDFS
data store, and evaluated how well Hadoop ran in a virtualized
environment.
A FEW SURPRISES
Adobe ran its first Hadoop MapReduce job in the virtual HDaaS
environment within three days of initial set-up. Smaller data sets
of 60 to 450 gigabytes performed well, but the team hit a wall
beyond 450 gigabytes.
The team focused on the job definition of the Hadoop
configuration to determine if it was written correctly or using
memory efficiently. In researching the industry at large, Adobe
learned that most enterprise Hadoop environments were testing data
on a small scale. In fact, Adobe did not find another Hadoop POC or
implementation that exceeded 10 terabytes for single-job data sets
in a virtualized Hadoop environment with Isilon.
"When we talked to other people in the industry, we realized we
were on the forefront of scaling Hadoop at levels possibly never
seen before."
Jason Farnsworth, Senior Storage Engineer, Adobe Systems
After four weeks of tweaking the Hadoop job definition and
adjusting memory settings, the team successfully ran a six-terabyte
job. Pushing beyond six terabytes, the team sought to run larger
data sets upwards of 60 terabytes. The larger jobs again proved
difficult to complete successfully.
DIVING IN DEEPER
The next phase involved Adobe Technical Operations enlisting
help from storage services, compute platforms, research scientists,
data center operations, and network engineering. Technical
Operations also reached out to the POCs key partnersEMC, including
Isilon and Pivotal, VMware, Cisco, and Trace3, an EMC value-added
reseller and IT systems integrator.
The team, which included several Hadoop experts, dissected
nearly every element of the HDaaS environment. This included Hadoop
job definitions, memory settings, Java memory allocations, command
line options, physical and virtual infrastructure configurations,
and HDFS options.
"We had several excellent meetings with Hadoop experts from EMC
and VMware. We learned an enormous amount that helped us solve our
initial problems and tweak the infrastructure to scale the way we
wanted."
Jason Farnsworth, Senior Storage Engineer, Adobe Systems
Relooking at Memory Settings
Close inspection of Hadoop revealed a lack of maturity to
perform in virtualized environments. For example, some operations
launched through VMware BDE did not function properly on Hadoop,
requiring significant tweaking. Complicating matters, the team
learned that Hadoop error messages did not clearly describe the
problem or indicate the origin.
Most notable, the team discovered that Hadoop lacked sufficient
intelligence to analyze memory requirements for large analytics
jobs. This necessitated manually adjusting memory settings.
The POC team recommends the following memory settings as a good
starting point for organizations to diagnose scaling and
job-related issues when testing Hadoop in larger-scale
environments:
Yarn Settings
Amount of physical memory in megabytes that can be allocated for
containers: yarn.nodemanager.resource.memory-mb=x x=memory in
megabytes. BDE has a base calculation for this value according to
how much RAM to allocate to the workers on deployment. Default
value is 8192.
Minimum container memory for YARN. The minimum allocation for
every container request at the ResourceManager, in megabytes:
yarn.scheduler.minimum-allocation-mb=x x=memory in megabytes.
Default Value is 1024.
Application Master Memory: yarn.app.mapreduce.am.resource.mb=x
x=memory in megabytes. Default value is 1536. Java options for the
application master (JVM HEAP Size):
yarn.app.mapreduce.am.command-opts=x x=memory in megabytes but
passed as a Java
option (e.g., Xmx7000m). Default value is Xmx1024m.
H13856 Page 8 of 12
-
Mapred Settings
Mapper memory: mapreduce.map.memory.mb=x x=memory in megabytes.
Default Value is 1536. Reducer memory: mapreduce.reduce.memory.mb=x
x=memory in megabytes. Default Value is 3072 Mapper Java Options
(JVM Heap Size). Heap size for child JVMs of maps:
mapreduce.map.java.opts=x x=memory but passed as a Java option
(e.g,
Xmx2000m). Default Value is Xmx1024m
Reducer Java Options (JVM Heap Size). Heap size for child JVMs
of reduces: mapreduce.reduce.java.opts=x x=memory but passed as a
Java option (e.g., xmx4000m). Default Value is Xmx2560m
Maximum size of the split metainfo file:
mapreduce.jobtracker.split.metainfo.maxsize=x x=10000000 by
default. POC team set this to -1, which disables or sets to any
size.
For guidance on baseline values to use in these memory settings,
the POC team recommends the following documents:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_installing_manually_book/content/rpm-chap1-11.html
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
https://support.pivotal.io/hc/en-us/articles/201462036-Mapreduce-YARN-Memory-Parameters
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://hadoop.apache.org/docs/r2.5.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Modifying Settings Properly with BDE
Both the virtual and physical infrastructure required
configuration adjustments. Since VMware BDE acts as a management
service layer on top of Hadoop, the team relied on BDE to modify
Hadoop settings to ensure they were properly applied to the virtual
clusters and remained persistent. Changing the settings via the
servers would not enable consistent application of modifications
across all the virtual clusters. The team also kept in mind that
stopping, restarting, or redeploying a cluster through BDE would
automatically reset all the node settings to their default
values.
Bigger is Not Always Better
The POC revealed that the configuration of physical servers
(hosts) and virtual servers (Hadoop workers or guests) affected
Hadoop performance and cost efficiency.
For example, a greater number of physical cores (CPUs) with less
megahertz delivered improved performance versus fewer cores with
more megahertz. At a higher cost, the same number of physical cores
with more megahertz delivered even better performance.
In a virtual environment, fewer virtual CPUs (vCPUs) with a
greater number of Hadoop workers, performed and scaled better than
a greater number of vCPUs supporting fewer workers.
The team also learned to keep all physical hosts in the VMware
cluster configured identically and ensure there were not any
variations in host configurations. This way, VMware distributed
resource scheduling would not be invoked to spend time and
resources balancing the cluster and resources instead would be made
immediately available to Hadoop. BDE also was especially valuable
in ensuring that memory settings and the alignment between cores
and VMs were consistent.
Storage Sizing Proved Successful
Both VNX and Isilon performed perfectly in the POC. The team
sized VNX to hold both the VMware environment and the Hadoop
intermediate spacetemporary space used by Hadoop jobs such as
MapReduce. Intermediate space also can be configured to be stored
directly on the Isilon cluster, but this setting was not tested
during the POC.
Technical Operations also tested various HDFS block sizes,
resulting in performance optimizations. Depending on job and
workload, the team found that block sizes of 64 megabytes to 1024
megabytes drove optimal throughput. The 12 Isilon X-Series nodes
with two-terabyte drives provided more than enough capacity and
performance for tested workloads, and could easily scale to support
Hadoop workloads hundreds of terabytes in size.
While the POCs Isilon did not incorporate flash technology, the
team noted that adding flash drives would provide a measurable
performance increase.
H13856 Page 9 of 12
-
BREAKTHROUGH IN HADOOP ANALYTICS
After eight weeks of fine-tuning the virtual HDaaS
infrastructure, Adobe succeeded in running a 65-terabyte Hadoop
workloadsignificantly larger than the largest known virtual Hadoop
workloads. In addition, this was the largest workload ever tested
by EMC in a virtual Hadoop environment on Isilon.
Fundamentally, these results proved that Isilon as the HDFS
layer worked. In fact, the POC refutes claims by some in the
industry that suggest shared storage will cause problems with
Hadoop. To the contrary, Isilon had no adverse effects and even
contributed superior results in a virtualized HDaaS environment
compared to traditional Hadoop clusters. These advantages apply to
many aspects of Hadoop, including performance, storage efficiency,
data protection, and flexibility.
"Our results proved that having Isilon act as the HDFS layer was
not adverse. In fact, we got better results with Isilon than we
would have in a traditional cluster."
Chris Mutchler, Compute Platform Engineer, Adobe Systems
IMPRESSIVE PERFORMANCE RESULTS
With compute resources allocated in small quantities to a large
number of VMs, job run time improved significantly. (Figures 4 and
5) Furthermore, the test demonstrated that Isilon performed well
without flash drives.
Figure 4. TeraSort Job Run Time by Worker Count
Figure 5. Adobe Pig Job Run Time by Worker Count
The team concluded that Hadoop performs better in a scale-out
rather than scale-up configuration. That is, jobs complete more
quickly when run on a greater number of compute nodes, so having
more cores is more important than having faster processors. In
fact, performance improved as the number of workers increased.
Tests were run with the following cluster configurations:
256 workers, 1 vCPU, 7.25 GB RAM, 30 GB intermediate space 128
workers, 2 vCPU, 14.5 GB RAM, 90 GB intermediate space 64 workers,
4 vCPU, 29 GB RAM, 210 GB intermediate space 32 Workers, 8 vCPU, 58
GB RAM, 450 GB intermediate space
H13856 Page 10 of 12
-
BREAKING WITH TRADITION ADDS EFFICIENCY
Traditional Hadoop clusters require three copies of the data in
case servers fail. Isilon eliminates tripling storage capacity
requirements due to built-in data protection capabilities of the
Isilon OneFS operating system.
For example, in a traditional Hadoop cluster running jobs
against eight petabytes of data, the infrastructure would require
24 petabytes of raw disk capacitya 200 percent overheadto
accommodate three copies. Eight petabytes of Hadoop data when
stored on Isilon requires only 9.6 petabytes of raw disk capacitya
nearly 60 percent reduction. Not only does Isilon save on storage
but it also streamlines storage administration by eliminating the
need to oversee numerous islands of storage. Using Adobes
eight-petabyte data set in a traditional environment would require
24 petabytes of local disk capacity necessitating thousands of
Hadoop nodes when hundreds of compute nodes would be adequate.
Enabling a data lake, Isilon OneFS provides enterprises with one
central data repository of data accessible through multiple
protocols. Rather than requiring a separate, purpose-built HDFS
device, Isilon supports HDFS along with NFS, FTP, SMP, HTTP, NDMB,
SWIFT, and OBJECT. (Figure 6). This allows organizations to bring
Hadoop to the dataa more streamlined approach, rather than moving
data to Hadoop.
Figure 4. Isilon Data Lake Concept with Multi-protocol
Support
STRONGER DATA PROTECTION
Isilon provides secure control over data access by supporting
POSIX for granular file access permissions. Isilon stores data in a
POSIX-compliant file system with SMB and NFS workflows that users
can also access through HDFS for MapReduce. Isilon protects
partitioned subsets of data with access zones that prevent
unauthorized access.
In addition, Isilon offers rich data services that are not
available in traditional Hadoop environments. For example, Isilon
enables users to create snapshots of the Hadoop environment for
point-in-time data protection or to create duplicate environments.
Isilon replication also can synchronize Hadoop to a remote site,
providing even greater protection. This allows organizations to
keep Hadoop data secure on premises, rather than moving data to a
public cloud.
FREEING THE INFRASTRUCTURE
Virtualizing HDaaS introduces greater opportunities for
flexibility, unencumbering the infrastructure from physical
limitations. Instead of traditional bare-metal clusters with rigid
configurations, virtualization allows organizations to tailor
Hadoop VMs to their individual workloads and even use existing
compute infrastructure. This is key to optimizing performance and
efficiency. Plus, virtualization facilitates multi-tenancy and
offers additional high-availability advantages through fluid
movement of VMs from one physical host to another.
H13856 Page 11 of 12
-
BEST PRACTICE RECOMMENDATIONS
Several important lessons learned and best practices were
documented from this breakthrough POC, as follows.
MEMORY SETTINGS ARE KEY
It's important to recognize that Hadoop is still a maturing
product and does not automatically recognize optimal memory
requirements. Memory settings are crucial to achieving sufficient
performance to run Hadoop jobs against large data sets. EMC
recommends methodically adjusting memory settings and repeatedly
testing configurations until the optimal environment is
achieved.
UNDERSTAND SIZING AND CONFIGURATION
Operating at Adobe's scalehundreds of terabytes to tens of
petabytesdemands close attention to sizing and configuration of
virtualized infrastructure components. Since no two Hadoop jobs are
alike, IT organizations must thoroughly understand the data sets
and jobs their customers plan to run. Key sizing and configuration
insights from this POC include:
Devote ample time upfront to sizing storage layers based on
workload and scalability requirements. Sizing for Hadoop
intermediate space also deserves careful consideration.
Consider setting large HDFS block sizes to 256 to 1024 megabytes
to ensure sufficient performance. On Isilon, HDFS block size is
configured as a protocol setting in the OneFS operating system.
In the compute environment, deploy a large number of hosts using
processors with as many cores as possible and align the VMs to
those cores. In general, having more cores is more important than
having faster processors and results in better performance and
scalability.
Configure all physical hosts in the VMware cluster identically.
For example, mixing eight-core and ten-core systems will make CPU
alignment challenging when using BDE. Different RAM amounts also
will cause unwanted overhead while VMware's distributed resource
scheduling moves virtual guests.
ACQUIRE OR DEVELOP HADOOP EXPERTISE
Hadoop is complex, with numerous moving parts that must operate
in concert. For example, MapReduce settings may affect Java, which
may in turn, impact YARN. EMC recommends that organizations wishing
to use Hadoop to ramp up gradually and review the many resources
available to help simplify Hadoop implementation with Isilon.
Hadoop insights also may be achieved through "tribal" sharing of
experiences among industry colleagues, as well as formal
documentation and training. The POC team recommends these resources
as a starting place:
EMC Isilon Free Hadoop website EMC Hadoop Starter Kit EMC Isilon
Best Practices for Hadoop Data Storage white paper EMC Big Data
website
When building and configuring the virtual HDaaS infrastructure,
companies should select vendors with extensive expertise in Hadoop
and especially in large-scale Hadoop environments. EMC, VMware, and
solution integrators with Big Data experience can help accelerate a
Hadoop deployment and ensure success.
Because of the interdependencies among the many components in a
virtual HDaaS infrastructure, internal and external team members
will need broad knowledge of the technology stack, including
compute, storage, virtualization, and networking, with deep
understanding of how each performs separately and together. While
IT as a whole is still evolving toward developing integrated skill
sets, EMC has been on the forefront of this trend and can provide
insights and guidance.
NEXT STEPS: LIVE WITH HDAAS
With the breakthrough results of this POC, Adobe plans to take
the HDaaS reference architecture using Isilon into production and
test even larger Hadoop jobs. To generate additional results, Adobe
also will run a variety of Hadoop jobs on the virtual HDaaS
platform repeatedlyas much as hundreds of times. The goal is to
demonstrate that virtual HDaaS can deliver and is ready for large
production applications.
While the POC pointed one Hadoop cluster to Isilon, additional
testing will focus on multiple Hadoop clusters accessing data sets
on Isilon to further prove scalability. This multi-tenancy
capability is crucial for supporting multiple analytics teams with
separate projects. Adobe Technical Operations plans to run Hadoop
jobs through Isilon access zones to ensure isolation is preserved
without impacting performance or scalability.
In addition, the team plans to move intermediate space from VNX
block storage to Isilon and evaluate the impact of additional I/O
on Isilon. Adobe also expects that an all-flash array such as EMC
XtremIO would provide an excellent option for block storage in
place of VNX.
Additional configuration adjustments and testing are well worth
the effort to Adobe and present tremendous opportunities for the
analytics community as a whole. Using centralized storage, such as
Isilon, provides a common data source rather than creating numerous
storage locations for multiple Hadoop projects. The flexibility and
scalability of the virtual HDaaS environment is also of great value
as Hadoop jobs continue to grow in size.
Most important, moving virtual HDaaS into production will enable
Adobe's data scientists will be able to query against the entire
data set residing on Isilon. By doing so, they will have a powerful
way to gain more insight and intelligence that can be presented to
Adobes customers and provide both Adobe and their customers with
strong competitive advantage.
H13856 Page 12 of 12
21TEXECUTIVE SUMMARY21T 421TIntroduction21T 521TBold
Architecture for HDaaS21T 721TNavigating Toward LARGE-Scale
HdaaS21T 821TBreakthrough in Hadoop Analytics21T 1021TBest Practice
Recommendations21T 1221TNext Steps: LIVE WITH HDAAS21T 12