DELL EMC ISILON DATA LAKE WITH POWEREDGE SERVERS Recommended Configurations Kris Applegate Solution Architect Dell EMC Customer Solution Centers [email protected]Boni Bruno Principal Solution Architect Dell EMC Emerging Technology Team [email protected]Armando Acosta Product Manager Dell EMC Converged Platform Division [email protected]Sai Devulapalli Data Analytics Practice Lead Dell EMC Emerging Technology Team [email protected]ABSTRACT This white paper details the validated configuration for connecting Dell EMC Isilon to Dell EMC PowerEdge servers. We will also detail some recommended configurations as well as provide guidance on optional modifications for tailoring to each customer’s use case. December 2016
17
Embed
DELL EMC ISILON DATA LAKE WITH POWEREDGE SERVERS€¦ · and process huge datasets versus traditional existing business intelligence (BI) and analytics solutions. In addition, cost
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DELL EMC ISILON DATA LAKE WITH POWEREDGE SERVERS
Recommended Configurations
Kris Applegate Solution Architect Dell EMC Customer Solution Centers [email protected] Boni Bruno Principal Solution Architect Dell EMC Emerging Technology Team [email protected] Armando Acosta Product Manager Dell EMC Converged Platform Division [email protected] Sai Devulapalli Data Analytics Practice Lead Dell EMC Emerging Technology Team [email protected]
ABSTRACT
This white paper details the validated configuration for connecting Dell EMC Isilon to
Dell EMC PowerEdge servers. We will also detail some recommended configurations as
well as provide guidance on optional modifications for tailoring to each customer’s use
As our most popular server platform the R630 has been battle tested by almost every use-case possible. In this configuration, we take
advantage of plenty of drive slots for either rotational media or solid-state drives as well as plenty of network bandwidth (both data and
client-facing).
Figure 4. Dell EMC PowerEdge R630
Hadoop Roles
Compute Node(s)
With all shared filesystem responsibilities taken care of by the Isilon, these node’s primary role is to provide the computational
horsepower to comb through all the data. However, they do still need some local storage to help cache or accelerate those operations.
With the drastic cost reductions in flash over the last years, some customers choose to make this local space consist of Solid State
Drives (SSDs). The use of SSDs isn’t a hard and fast requirement, but is becoming a common request as SSD prices come down
further and further.
Function Disks Type
Operating System 2 RAID 1 (Mirror)
Spark Scratch / Map Reduce Spill 2-10
(Optionally SSD)
Non-RAID or RAID 0
Dell - Internal Use - Confidential
8
Table 2. Data Node Disk Layout
Infrastructure Nodes
The number of infrastructure servers will vary from customer to customer. In our recommended configuration we allocate 4 nodes, but it
could be done with less as your requirements for services high-availability vary.
Manager Node(s) The manager nodes in the cluster are responsible for running things like the Cloudera Manager (Cloudera Hadoop), Ambari
(Hortonworks Hadoop), and the principle roles for services like Hive, Oozie, and Zookeeper. We need 3 of them in order to provide a
quorum for high-availability in case of a node failure. These boxes don’t need to have high-end configurations and are a ripe area for
cost optimization. For the sake of our recommended configurations we’ll use the same chassis and server types as our compute nodes
in order to keep a common platform, but this is by no means required. Additionally, you can also, if your requirements allow it, co-locate
these roles on compute or edge nodes.
Edge Node(s) The role of the edge nodes are to be the primary interface for funneling data into a cluster as well as for pushing result data out of the
cluster. They are most often multi-homed to the Isilon network as well as the datacenter network. The configuration of these nodes can
vary drastically depending on the customer’s use case. For example if they are staging batch jobs into a cluster, you’ll need a larger
amount of local storage for that data to land on before you copy it into HDFS. If you are streaming data into the cluster, you wouldn’t
need a large amount of space but rather faster storage (like SSDs) to keep that data moving quickly. Much like the Manager Node(s),
this is an area ripe for optimization depending on use case. Our recommendations keep the same configuration as the Manager Nodes
just to keep some platform commonality. Lastly, as with the Manager Node(s) you can co-locate this role onto compute or manager
Sizing Compute Nodes and Isilon Nodes Many different factors go into sizing of your cluster. It’s important to work with your Dell EMC Account teams and Dell EMC Customer
Solutions Centers Solution Architects to make sure you’re appropriately accounting for as many variable as possible. Variables that
may need to be accounted for include:
Amount of Initial Data
Number of Replicas
Rate of Ingest
Duration of retention
Scratch Space
Compression
Read/Write I/O Mix
Our initial guidance for the number of compute nodes to the number of Isilon nodes is a ratio of 2:1. However, this is only an initial
guidance and we strongly recommend a more formal discussion with you Customer Solution Centers Solution Architect to come up with
a more specifically tailored recommendation given your requirements for capacity, performance, and any additional functions that the
Isilon may be serving.
Isilon Platform Isilon clusters simplify storage by combining the file system, volume manager, and data protection into the EMC Isilon OneFS® operating system. Through the clustered use of EMC Isilon high-performance X-Series nodes, high-capacity NLSeries, and high density HD-Series nodes, a single Isilon cluster can contain a mix of tiers that provide the best economics, throughput, or IOs per second into the petabyte range. With over 80 percent storage utilization, Isilon clusters need less raw capacity than most storage systems. Compared to traditional direct-attached storage (DAS) Hadoop, Isilon can do so at a third of the storage capacity while providing more protection. Consolidating your unstructured data on Isilon results in greater efficiency, simplified management, and cost savings.
Server Platform There are plenty of options when it comes to compute and infrastructure nodes inside the Dell EMC PowerEdge portfolio. We’ve
detailed two possible recommended configurations above, but there are many others as well that can be discussed with your Dell EMC
PowerEdge Rack / Tower Servers – These R- and T- series server are one of the most popular options for customers looking for
traditional 1U and 2U options. Either the PowerEdge R630 at 1U for density or the PowerEdge R730/XD for drive options are great
choices.
Modular Servers – Customers looking for robust manageability and integrated networking can look to the Dell EMC modular
infrastructure portfolio. The Dell EMC PowerEdge M1000 Blade chassis and the Dell EMC PowerEdge FX families are great choices.
Just make sure that you have enough drive slots or disk capacity to accommodate the local storage/scratch space that is needed.
These are also great for incidents where a highly datacenter density is required (co-location / hosting).
Server CPU The server core and frequency requirements for each customer can vary wildly. We recommend working closely with your Dell EMC
Customer Solution Centers Solution Architects to identify the right processor given you unique workload. You can also utilize the ability
to execute a proof-of-concept in the Customer Solution Centers at no charge to you in order to get an accurate characterization of your
expected performance.
Dell - Internal Use - Confidential 17
Server Memory As with the Server CPUs, this can vary from customer to customer and use-case to use-case. Generally we recommend starting at
256GB and going up from there as your utilization of in-memory technologies (Spark, Impala, Alluxio, etc.) increases.
Server Local Storage You’ll need some host-side cache / scratch space for your compute nodes. Approximately 5-8TB is common on either rotational or flash
memory. You should have enough scratch space across your compute nodes that is equal to approx 25% of your usable Hadoop
capacity. With the rapidly falling prices of flash memory, it can make sense to utilize those technologies to get fast local storage in ever-
increasing amounts. If you do opt for SSDs, this local scratch space can be SSDs either in drive-bays or in PCI-E form-factors.
Network At a minimum, you’ll want dual 10GbE from each host to the Isilon data nodes. As your bandwidth needs increase, you’ll want to
consider either segmenting off front-side (client to compute nodes) to their own network cards, or increasing the number and/or speed
of the links to each node. Prices on 25GbE and 40GbE cards are becoming very affordable and you may want to consider investing in
those early in order to reduce complexity (no need for complication bonding) as well as preparing you for the ever-increasing bandwidth
needs of emerging workloads. The Dell EMC Networking S6100 switch is an excellent switch for high-bandwidth needs either at the
host level or at the aggregation tier linking multiple racks together.
It’s also worth noting that as the Dell EMC Isilon product evolves, investing in 40GbE networking will be very wise for both compute-
node connectivity as well as datanode-to-datanode connectivity.
Dell EMC Customer Solution Centers
The Dell EMC Customer Solution Centers are a global network of connected labs that allow Dell to help customers architect, validate
and build solutions. With multiple footprints in every region, they can help you understand anything from simple hardware platforms, to
more complex solutions. These engagements range from an informal 30-60 minute briefings, through a longer half-day workshop, and
on to a proof-of-concept that allow customers to kick the tires of their solution prior to signing on the dotted line. Customers may engage
with their account team and have them submit a request to take advantage of these services for no charge.
Links Dell Customer Solution Centers – http://www.dell.com/customersolutioncenter Dell EMC FX PowerEdge Server FX Architecture – http://www.dell.com/en-us/work/learn/fx-server-solutions Dell EMC Isilon Info Hub For Hadoop - https://community.emc.com/docs/DOC-39529 Isilon Hadoop Tools - https://github.com/Isilon/isilon_hadoop_tools Cloudera – http://cloudera.com Hortonworks – http://hortonworks.com