This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Trivadis ist führend bei der IT-Beratung, der Systemintegration, dem Solution-Engineering und der Erbringung von IT-Services mit Fokussierung auf und Technologien im D-A-CH-Raum. Unsere strategischen Geschäftsfelder...
§ The network has to provide high bandwidth, low latency, and should scale seamlessly with Hadoop clusters to provide predictable performance
§ And many more, like § Integration with operational data systems § Authentication, authorization, encryption § Centralized management
DOAG Jahreskonferenz 2014 Big Data Infrastructure
10
Infrastructure Requirements 7
Figure 1.2: Picture of a row of servers in a Google WSC, 2012.
1.6.1 STORAGEDisk drives or Flash devices are connected directly to each individual server and managed by a global distributed file system (such as Google’s GFS [58]) or they can be part of Network Attached Storage (NAS) devices directly connected to the cluster-level switching fabric. A NAS tends to be a simpler solution to deploy initially because it allows some of the data management responsibilities to be outsourced to a NAS appliance vendor. Keeping storage separate from computing nodes also makes it easier to enforce quality of service guarantees since the NAS runs no compute jobs be-sides the storage server. In contrast, attaching disks directly to compute nodes can reduce hardware costs (the disks leverage the existing server enclosure) and improve networking fabric utilization (each server network port is effectively dynamically shared between the computing tasks and the file system).
The replication model between these two approaches is also fundamentally different. A NAS tends to provide high availability through replication or error correction capabilities within each appliance, whereas systems like GFS implement replication across different machines and conse-quently will use more networking bandwidth to complete write operations. However, GFS-like systems are able to keep data available even after the loss of an entire server enclosure or rack and may allow higher aggregate read bandwidth because the same data can be sourced from multiple
1.6. ARCHITECTURAL OVERVIEW OF WSCS
Will my infrastructure meet my needs
now and in the future without putting my business at risk?
When enterprises adopt Hadoop, one of the decisions they must make is the deployment model. There are four options:
DOAG Jahreskonferenz 2014 Big Data Infrastructure
11
Where to Deploy your Hadoop Cluster?
When enterprises adopt Hadoop, one of the decisions they must make is the deployment model. There are four options as illustrated in Figure 1:
��On-premise full custom. With this option, businesses purchase commodity hardware, then they install software and operate it themselves. This option gives businesses full control of the Hadoop cluster.
��Hadoop appliance. This preconfigured Hadoop cluster allows businesses to bypass detailed technical configuration decisions and jumpstart data analysis.
��Hadoop hosting. Much as with a traditional ISP model, organizations rely on a service provider to deploy and operate Hadoop clusters on their behalf.
� Hadoop-as-a-Service. This option gives businesses instant access to Hadoop clusters with a pay-per-use consumption model, providing greater business agility.
To determine which of these options presents the right deployment model, organizations must consider five key areas. The first is the price-performance ratio, and it is the focus of this paper. The Hadoop-as-a-service model is typically cloud-based and uses virtualization technology to automate deployment and operation processes (in comparison, the other models typically use physical machines directly).
There have existed two divergent views related to the price-performance ratio for Hadoop deployments. One view is that a virtualized Hadoop cluster is slower because Hadoop’s workload has intensive I/O operations, which tend to run slowly on virtualized environments. The other view is that the cloud-based model provides compelling cost savings because its individual server node tends to be less expensive; furthermore, Hadoop is horizontally scalable.
The second area of consideration is data privacy, which is a common concern when storing data outside of corporate-owned infrastructure. Cloud-based deployment requires a comprehensive cloud-data privacy strategy that encompasses areas such as proper implementation of legal requirements, well-orchestrated data-protection technologies, as well as the organization’s culture with regard to adopting emerging technologies. Accenture Cloud Data Privacy Framework outlines a detailed approach to help clients address this issue.
The third area is data gravity. Once data volume reaches a certain point, physical data migration becomes prohibitively slow, which means that many organizations are locked into their current data platform. Therefore, the portability of data, the anticipated future growth of data, and the location of data must all be carefully considered.
A related and fourth area is data enrichment, which involves leveraging multiple datasets to uncover new insights. For example, combining a consumer’s purchase history and social-networking activities can yield a deeper understanding of the consumer’s lifestyle and key personal events and therefore enable companies to introduce new services and products of interest. The primary challenge is that the storage of these multiple datasets increases the volume of data, resulting in slow connectivity. Therefore, many organizations choose to co-locate these datasets. Given volume and portability considerations, most organizations choose to move the smaller datasets to the location of the larger ones. Thus, thinking strategically about where to house your data, considering both current and future needs, is key.
The fifth area is the productivity of developers and data scientists. They tap into the datasets, create a “sandbox” environment, explore the data analysis ideas, and deploy them into production. Cloud’s self-service deployment model tends to expedite this process.
Figure 1. The spectrum of Hadoop deployment options
§ Oracle R Support for Big Data § R is an open-source language and
environment for statistical analysis and graphing
§ The standard R distribution is installed on all nodes of Oracle Big Data Appliance
§ Oracle R Connector for Hadoop provides R users with high-performance, native access to HDFS and the MapReduce programming framework
§ Oracle R Enterprise is a separate package that provides real-time access to Oracle Database.
§ Oracle NoSQL Database § Oracle NoSQL Database is a
distributed key-value database built on storage technology of Berkeley DB Java Edition.
§ An intelligent driver on top of Berkeley DB keeps track of the underlying storage topology, shards the data and knows where data can be placed with the lowest latency
§ Oracle Big Data Lite VM § http://www.oracle.com/technetwork/database/bigdata-appliance/
oracle-bigdatalite-2104726.html
§ MOS Notes § Information Center: Oracle Big Data Appliance (Doc ID 1445762.2) § Big Data Connectors (ID 1487399.2) § Sqoop Frequently Asked Questions (FAQ) (Doc ID 1510470.1)
Technical white paper | HP Reference Architecture for MapR M5
11
This section specifies which server to use and the rationale behind it. The Reference Architectures section will provide topologies for the deployment of control and worker services across the nodes for clusters of varying sizes.
Processor configuration MapR manages the amount of work each server is able to undertake via the amount of Map/Reduce slots configured for that server. The more cores available to the server, the more Map/Reduce slots can be configured for the server (see the Computation section for more detail). We recommend 6 core processors for a good balance of price and performance. We recommend that Hyper-Threading is turned on.
Drive configuration Redundancy is built into the MapR architecture and thus there is no need for RAID or additional hardware components to improve redundancy on the server as it is all coordinated and managed in the MapR software.
MapR Benefit Drives should use a Just a Bunch of Disks (JBOD) configuration, which can be achieved with the HP P420 RAID controller by configuring each individual disk as a separate RAID 0 volume. We recommend disabling array acceleration on the controller to better handle large block I/Os in the Hadoop environment.
Lastly, servers should provide a large amount of storage capacity which increases the total capacity of the distributed file system and provide that capacity by using at least twelve 2TB Large Form Factor drives for optimum I/O performance. The DL380e supports 14 Large Form Factor (LFF) drives, which allows one to either use all 14 drives for data or use 12 drives for data and the additional 2 for mirroring the operating system and MapR runtime. Hot pluggable drives are recommended so that drives can be replaced without restarting the server.
Memory configuration Servers running the node processes should have sufficient memory for either HBase or for the amount of Map/Reduce Slots configured on the server. A server with larger RAM configuration will deliver optimum performance for both HBase and Map/Reduce. To ensure optimal memory performance and bandwidth, we recommend using 8GB or 16GB DIMMs to populate each of the 6 memory channels as needed.
Network configuration The DL380e includes four 1GbE NICs onboard. MapR automatically identifies the available NICs on the server and bonds them via the MapR software to increase throughput.
MapR Benefit Each of the reference architecture configurations below specifies an additional Top of Rack Switch for redundancy. To best make use of this, we recommend cabling the ProLiant DL380e Worker Nodes so that NIC 1 is cabled to Switch 1 and NIC 2 is cabled to Switch 2, repeating the same process for NICs 3 and 4. Each NIC in the server should have its own IP subnet instead of sharing the same subnet with other NICs.
HP ProLiant DL380e Gen8 The HP ProLiant DL380e Gen8 (2U) is an excellent choice as the server platform for the worker nodes.
Oracle BDA + High performance scalable network architecture
+ Highly integrated into Oracle eco system
+ Complete software stack Oracle & Hadoop
+ Single point of support
+ Competitive price/ performance ratio for enterprise class demands
DOAG Jahreskonferenz 2014 Big Data Infrastructure
30
Appliance, Cloud or DIY?
Amazon EC2 Instances + Fast and easy deployment
+ Scales from very small to very large cluster setups
+ Capacity on demand on hourly base + Optional enterprise class hadoop distribution
+ Interesting price model for volatile utilisation and capacity on demand
Do it Yourself + Low entry point
+ Free choice of hardware
+ Free choice of software stack
Technical white paper | HP Reference Architecture for MapR M5
11
This section specifies which server to use and the rationale behind it. The Reference Architectures section will provide topologies for the deployment of control and worker services across the nodes for clusters of varying sizes.
Processor configuration MapR manages the amount of work each server is able to undertake via the amount of Map/Reduce slots configured for that server. The more cores available to the server, the more Map/Reduce slots can be configured for the server (see the Computation section for more detail). We recommend 6 core processors for a good balance of price and performance. We recommend that Hyper-Threading is turned on.
Drive configuration Redundancy is built into the MapR architecture and thus there is no need for RAID or additional hardware components to improve redundancy on the server as it is all coordinated and managed in the MapR software.
MapR Benefit Drives should use a Just a Bunch of Disks (JBOD) configuration, which can be achieved with the HP P420 RAID controller by configuring each individual disk as a separate RAID 0 volume. We recommend disabling array acceleration on the controller to better handle large block I/Os in the Hadoop environment.
Lastly, servers should provide a large amount of storage capacity which increases the total capacity of the distributed file system and provide that capacity by using at least twelve 2TB Large Form Factor drives for optimum I/O performance. The DL380e supports 14 Large Form Factor (LFF) drives, which allows one to either use all 14 drives for data or use 12 drives for data and the additional 2 for mirroring the operating system and MapR runtime. Hot pluggable drives are recommended so that drives can be replaced without restarting the server.
Memory configuration Servers running the node processes should have sufficient memory for either HBase or for the amount of Map/Reduce Slots configured on the server. A server with larger RAM configuration will deliver optimum performance for both HBase and Map/Reduce. To ensure optimal memory performance and bandwidth, we recommend using 8GB or 16GB DIMMs to populate each of the 6 memory channels as needed.
Network configuration The DL380e includes four 1GbE NICs onboard. MapR automatically identifies the available NICs on the server and bonds them via the MapR software to increase throughput.
MapR Benefit Each of the reference architecture configurations below specifies an additional Top of Rack Switch for redundancy. To best make use of this, we recommend cabling the ProLiant DL380e Worker Nodes so that NIC 1 is cabled to Switch 1 and NIC 2 is cabled to Switch 2, repeating the same process for NICs 3 and 4. Each NIC in the server should have its own IP subnet instead of sharing the same subnet with other NICs.
HP ProLiant DL380e Gen8 The HP ProLiant DL380e Gen8 (2U) is an excellent choice as the server platform for the worker nodes.