Technical Report Manila and Sahara Integration in OpenStack Using NetApp NFS Data in Hadoop and Spark Jeff Applewhite, NetApp October 2015 | TR-4464 Abstract Today’s businesses must store an unprecedented volume of data and must manage the depth and complexity of the data that they capture. Apache Hadoop has gained in popularity because of its ability to handle large and diverse data types. Apache Spark has recently gained in popularity because of its ability to rapidly analyze data through its in-memory approach to processing and to natively use data stored in a Hadoop Distributed File System (HDFS). Although some companies are successfully bursting to the cloud for these types of analytics, the options for ingesting and exporting data to and from these technologies have been limited in OpenStack until now. The OpenStack Shared File Systems project (Manila) provides basic provisioning and management of file shares to users and services in an OpenStack cloud. The OpenStack Data Processing project (Sahara) provides a framework for exposing big data services, such as Spark and Hadoop, within an OpenStack cloud. Natural synergy and popular demand led the two project teams to develop a joint solution that exposes Manila file shares within the Sahara construct to solve real-world big data challenges. This guide assists end users in the task of using this important new development in the OpenStack cloud capability. It examines common workflows for how a Sahara user can access big data that resides in Hadoop, Swift, and Manila NFS shares.
25
Embed
Manila and Sahara Integration in OpenStack Report Manila and Sahara Integration in OpenStack Using NetApp NFS Data in Hadoop and Spark Jeff Applewhite, NetApp October 2015 | TR-4464
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Technical Report
Manila and Sahara Integration in OpenStack Using NetApp NFS Data in Hadoop and Spark
Jeff Applewhite, NetApp
October 2015 | TR-4464
Abstract
Today’s businesses must store an unprecedented volume of data and must manage the depth
and complexity of the data that they capture. Apache Hadoop has gained in popularity
because of its ability to handle large and diverse data types. Apache Spark has recently
gained in popularity because of its ability to rapidly analyze data through its in-memory
approach to processing and to natively use data stored in a Hadoop Distributed File System
(HDFS). Although some companies are successfully bursting to the cloud for these types of
analytics, the options for ingesting and exporting data to and from these technologies have
been limited in OpenStack until now.
The OpenStack Shared File Systems project (Manila) provides basic provisioning and
management of file shares to users and services in an OpenStack cloud. The OpenStack Data
Processing project (Sahara) provides a framework for exposing big data services, such as
Spark and Hadoop, within an OpenStack cloud. Natural synergy and popular demand led the
two project teams to develop a joint solution that exposes Manila file shares within the Sahara
construct to solve real-world big data challenges. This guide assists end users in the task of
using this important new development in the OpenStack cloud capability. It examines common
workflows for how a Sahara user can access big data that resides in Hadoop, Swift, and
1.2 New Developments in Manila and Sahara ......................................................................................................4
1.3 Relevant Features in Manila ...........................................................................................................................5
1.4 Ephemeral Storage Versus Cinder Storage in Sahara ....................................................................................6
3.1 Prepare the Manila Data Source ................................................................................................................... 11
3.2 Create the Spark Image ................................................................................................................................ 14
3.3 Create the Spark Binary Data ....................................................................................................................... 15
3.4 Update the Cluster Templates ...................................................................................................................... 16
3.5 Create a Spark Job Template Based on the Spark Binary ............................................................................ 17
3.6 Create a Sahara Data Source That Uses Manila Shares .............................................................................. 17
3.7 Create a Spark Template That Uses Manila Shares ..................................................................................... 18
3.8 Launch the Spark Cluster from Your Template ............................................................................................. 20
4 Use Cases ............................................................................................................................................ 21
4.1 Put a Job Binary on a Manila Share .............................................................................................................. 21
4.2 Put Data from a Manila Share into HDFS ..................................................................................................... 22
4.3 Launch a Job by Using the Manila Share as a Data Source ......................................................................... 23
Version History ......................................................................................................................................... 24
LIST OF TABLES
Table 1) Liberty release updates and proposals related to Manila and Sahara. .............................................................5
Table 1) Liberty release updates and proposals related to Manila and Sahara.
Enhancement Status
Manila as a Runtime Data Source Implemented in Liberty release
Addition of Manila as a Binary Store Implemented in Liberty release
API to Mount and Unmount Manila Shares to Sahara Clusters Implemented in Liberty release
Adding Manila Shares as an Option for Data Sources in the Horizon UI In progress
Adding Support for Manila-Based Shares in the Sahara Horizon UI In progress
Using NetApp NFS Driver to Support NFS Shares as a Data Source in Sahara
In progress
1.3 Relevant Features in Manila
The Manila project offers a vendor-agnostic API framework for the provisioning of file shares in an
OpenStack cloud. The file protocol varies depending on the vendor-specific plug-in that is used for
enabling file sharing on various devices.
NetApp has enabled the NFS and CIFS protocols within the NetApp Manila driver. This document
focuses on the NFS protocol as a data source or target for provisioning file shares. For some use cases,
an HDFS plug-in for Manila might hold some advantages over directly accessing HDFS. The primary
advantage that Manila gives the administrator is an API-driven access control framework with which to
secure instance access to the HDFS data nodes.
Manila has specific data management features that are important for big data applications. It supports
capabilities that have long been associated with NetApp storage arrays. In a NetApp storage context,
Manila operates at the level of NetApp FlexVol® volumes such that the following features come into play:
Creation of volume NetApp Snapshot® copies
Creation of shares from Snapshot copies (NetApp FlexClone® technology)
Deduplication
Compression
Thin provisioning or thick provisioning
Creation of a catalog of differentiated storage pools that are based on the underlying disk media (SSD, SAS, or SATA)
These features can be of enormous value for big data applications. Administrators can tailor the
underlying storage to the workload at hand. For large text-based results or datasets, it is possible to
realize substantial storage savings through NetApp deduplication and/or compression with very little to no
performance impact.
When Manila runs with the NetApp clustered Data ONTAP® operating system as the storage back end, it
can run in two distinct modes:
In the first mode, the Manila driver creates an entire NetApp storage virtual machine (SVM) for each share on the target NetApp cluster. This mode provides maximum flexibility in the networking and features that are available to the provisioning process.
In the second mode, Manila creates shared exports from a designated SVM that the storage administrator created. In this case, the driver creates a new FlexVol volume and exports this volume to the desired network segments or users.
The running mode of Manila is largely immaterial for the purposes of this guide. Whether each share
resides within a new SVM or is simply a FlexVol volume within an existing SVM does not matter. What is
important is that the Spark nodes can access the network from which the NFS server on the SVM serves
data.
1.4 Ephemeral Storage Versus Cinder Storage in Sahara
The Sahara project has mechanisms whereby storage disks can be mounted on an ephemeral disk that
resides on a Nova compute node or remotely on a Cinder volume. It also gives the user the option of
accessing only Cinder volumes that run locally on the node in question. For big data applications, this
option can have significant performance implications because of the volume of data and the design goal
of limiting network round trips to access data. NetApp has reference architectures for Hadoop that
describe how NetApp E-Series arrays can be used to provide both high performance and reduction of the
number of copies of a given object in HDFS. These architectures deliver an easy to manage, high-
performance, and cost-effective solution.
As in every design, administrators have important trade-offs to consider when building a cloud that can
host big data applications effectively. One option is to use high-performance storage and mount it on
ephemeral disks that reside on the Nova compute nodes (for example, on
/var/lib/nova/instances). In this way, the end user can be certain that disks for Hadoop or Spark
instances are local to the compute node on which they are run. Another benefit of this design is that it
adheres to the general maxim that locally accessed disks provide higher throughput for Hadoop
applications. In this scenario, disk interconnects are SATA, SAS, or PCIe.
Note: For more details about mounting a file system and the mount options that are desirable for Hadoop on NetApp E-Series arrays, see TR-3969: NetApp Open Solution for Hadoop Solutions Guide.
It might not be feasible to attach specific disks to Nova compute nodes, though, for reasons of
homogeneity of the cloud deployment. In other cases, the simplicity of having a storage pool that can be
accessed remotely through the Cinder block service project outweighs the loss of data locality that is
gained by using an ephemeral disk or local Cinder storage. Although this document puts forward these
issues for architects to consider, its primary purpose is to highlight the new features in Sahara that enable
shared file systems through Manila shares.
2 Solution Description
Sahara and Manila together offer strong usability for big data administrators. This guide describes a
workflow that enables administrators to bring data into Sahara from enterprise data sources that reside on
NetApp NFS storage. The workflow also enables administrators to save the resultant datasets to the
same shares or to different shares for purposes of distribution, further analysis, and so forth.
This solution is based on new integration work by the Sahara and Manila project teams. The sample
workflow shows an administrator of Sahara how to move data between traditional sources such as HDFS
and the newer NFS Manila-based data source.
Figure 1 illustrates the workflow that Sahara uses to enable Manila shares. Starting at the Sahara
controller, the administrator creates a Spark cluster that results in four Spark instances: a master and
three workers. The Sahara service handles all API calls to the Manila service to add access based on a
Manila data source object. Shares are then mounted within the Spark cluster nodes and are accessible
by them for input or output operations as well as for job binaries, libraries, and so on.
Figure 1) Workflow for bringing data from NetApp NFS storage to Sahara.
The specific steps for the workflow in Figure 1 are covered in later sections, but at a high level, the
workflow involves the following tasks:
1. The administrator deploys the Spark cluster from a template that includes predefined Manila shares.
2. The administrator boots the Spark nodes. When the boot process finishes, the Sahara code automatically mounts the Manila shares.
3. A Spark binary (for example, word-count) that is stored on one of the shares becomes the source
for the new workload.
4. The Spark workload initializes and uses data on a Manila NFS share that resides on all nodes.
5. The Spark job finishes and writes data to the output Manila file share.
6. End users in the enterprise can consume the resulting datasets natively over familiar protocols.
2.1 Sahara Architecture
The Sahara architecture, which is shown in Figure 2, consists of several components:
Auth component. Responsible for client authentication and authorization; communicates with Keystone, the OpenStack identity service.
Data access layer (DAL). Makes internal models persistent in the database.
Provisioning engine. Responsible for communication with the following OpenStack services: compute (Nova), orchestration (Heat), block storage (Cinder), and image (Glance).
Vendor plug-ins. Pluggable mechanisms that are responsible for configuring and launching data processing frameworks on the provisioned virtual machines. Existing management solutions, such as Apache Ambari and Cloudera Manager Admin Console, can also be used for that purpose.
Elastic data processing (EDP). Responsible for scheduling and managing data processing jobs on clusters that are provisioned by Sahara.
REST API. Exposes Sahara functionality through a REST HTTP interface.
Python client. Client for the Sahara REST API.
Note: Like other OpenStack components, Sahara has its own Python client.
Sahara pages. GUI for Sahara; located in the OpenStack dashboard (Horizon).
Figure 2) Sahara architecture (graphic supplied by OpenStack).
2.2 OpenStack Requirements
The procedures in this guide were compiled from systems running the Liberty release of OpenStack.
Additional enhancements are planned for Horizon to make some of the configuration steps easier
compared with how they are documented here. All functionality works from the CLI in the Liberty release
and there are no requirements for Horizon beyond the stable Liberty release to achieve the goals that are
outlined in this guide. Future Horizon patches will leverage underlying capabilities in the services and add
a graphical interface to ease administration.
Instructions for the installation of OpenStack are beyond the scope of this guide. For additional
information about OpenStack, see OpenStack Installation Guide for Ubuntu 14.04 (Kilo release).
2.3 Manila Configuration
You configure Manila by changing the contents of the manila.conf file and restarting all of the Manila
processes. Depending on the OpenStack distribution that you use, to restart the processes, you might
need to run commands such as service openstack-manila-api restart or service manila-
The manila.conf file contains a set of configuration options (one per line) that are specified as
option_name=value. Configuration options are grouped together into stanzas that are denoted by
[stanza_name]. The file must contain at least one stanza named [DEFAULT]. The [DEFAULT] stanza
contains configuration parameters that apply generically to Manila (and not to any particular back end).
You must place options that are associated with a particular Manila back end in a separate stanza.
Note: Although you can specify driver-specific configuration options within the [DEFAULT] stanza, you cannot define multiple Manila back ends within the [DEFAULT] stanza. NetApp strongly recommends that you place driver-specific configuration in separate stanzas. Ensure that you list the back ends that should be enabled as the value for the enabled_share_backends configuration option. For example:
enabled_share_backends=clusterOne,clusterTwo
The enabled_share_backends option must be specified within the [DEFAULT] configuration stanza.
Manila Network Plug-ins
The following network plug-ins are valid as of the Kilo release of OpenStack:
Standalone network. For this plug-in, you define all IP settings (address range, subnet mask, gateway, and version) through configuration options in the driver-specific configuration stanza.
Nova network: simple. This plug-in uses a single Nova network ID for all share servers. You specify the ID of the Nova network to be leveraged through a configuration option in the driver-specific configuration stanza.
Nova network: configurable. This plug-in enables end users of Manila to create share networks that map to different Nova networks. Values for the segmentation protocol (for example, VLAN), IP address, subnet mask, protocol, and gateway are obtained from the Nova network when a new share server is created. You can specify default values for the network ID and for the subnet ID through configuration options in the driver-specific configuration stanza. However, values that are specified by end users when they define share networks take precedence over values that you declare in the configuration file.
Neutron network. This plug-in uses Neutron networks and subnets for defining share networks. Values for the segmentation protocol (for example, VLAN), IP address, subnet mask, protocol, and gateway are obtained from Neutron when a new share server is created. You can specify default values for the network ID and for the subnet ID through configuration options in the driver-specific configuration stanza. However, values that are specified by end users when they define share networks take precedence over values that you declare in the configuration file.
The Manila network plug-ins provide a variety of integration approaches with the network services
available with OpenStack. To choose a network plug-in, set the value of the network_api_class
configuration option within the driver-specific configuration stanza of the manila.conf file. You should
use the plug-ins only with the NetApp clustered Data ONTAP driver with share server management.
2.4 Testbed Workflow for Creating Manila Shares
The setup for the testbed on which this document is based did not require network plug-ins or share
server management. Instead, it used existing Neutron networking and created shares from an existing
SVM (no share server was created).
Note: This guide includes information about the Manila network plug-ins for cases in which more advanced network configurations are required. The mode in which Manila runs is not material to the basic functionality of the new Sahara integrations. For more information about the Manila modes, see the NetApp OpenStack Deployment and Operations Guide.
Figure 3 shows the steps that take place when a user requests the creation of a share in Manila and the
selected back end does not use share servers.
Figure 3) Manila workflow for share creation without share servers.
The workflow in Figure 3 has the following steps:
1. The client issues a request to create the share by invoking the REST API (the client might use the
python-manilaclient CLI utility). The manila-api process and the manila-scheduler
process perform the following tasks:
a. The manila-api process validates the request and user credentials. After validation, it puts the
message on the AMQP queue for processing.
b. The manila-share process takes the message off the queue and sends the message to
manila-scheduler to determine the pool and the back end into which to provision the share.
c. The manila-scheduler process takes the message off the queue and generates a candidate
list of resource pools. The list is based on the current state and the criteria for the requested share (size, availability zone, and share type, which includes extra specs).
d. The manila-share process reads the response message from manila-scheduler from the
queue; it iterates through the candidate list by invoking back-end driver methods for the corresponding pools until it is successful.
2. If selected by the scheduler, the NetApp Manila driver creates the requested share through interactions with the storage subsystem (depending on configuration and protocol). Without a share server, the NetApp Manila driver exports the shares through the data LIFs in the SVM that is scoped to the Manila back end.
3. The manila-share process creates share metadata and posts a response message to the AMQP
queue. The manila-api process reads the response message from the queue and responds to the
client with share ID information.
4. After a share is created and exported by the back end, the client uses the ID information to request updated share details, and it uses export information from the response to mount the share (through protocol-specific commands). The Sahara code itself then manages the creation of share-access rules that give the instances in OpenStack access to the newly created shares.
For more details about the configuration of Manila, see the NetApp OpenStack Deployment and
You can create or manage a Manila share and then use it as a data source or a target in a working Spark
cluster within the Sahara framework. To do so, you must follow the complete workflow that Sahara uses
to enable Manila shares. The workflow is described in the sections that follow.
3.1 Prepare the Manila Data Source
Manila is the starting point for the process. Instructions for the configuration of Manila by using a NetApp
FAS system are covered in detail in the NetApp OpenStack Deployment and Operations Guide.
During the cluster setup, Sahara must access instances through an SSH session. To establish this
connection, it uses either the fixed or the floating IP address of the instance. By default, Sahara is
configured to use floating IP addresses for access. This behavior is controlled by the
use_floating_ips configuration parameter. For this setup, you have two options to enable all
instances to gain a floating IP address:
If you are using the Nova network, you can configure it to assign floating IP addresses automatically
by setting the auto_assign_floating_ips parameter to True in the Nova configuration file
(usually nova.conf). You can specify a floating IP address pool for each node group directly.
Note: When you use floating IP addresses for management (use_floating_ips=True), every instance in the cluster must have a floating IP address, or else Sahara cannot use the cluster.
If you are not using floating IP addresses (use_floating_ips=False), Sahara uses fixed IP
addresses for instance management. If you use Neutron for the networking service, you can choose the fixed IP address network for all instances in the cluster. Whether you use the Nova network or Neutron, verify that all instances running Sahara have access to the fixed IP address networks.
In the test environment, we tested floating IP addresses and fixed IP addresses, and both approaches
worked well. Floating IP addresses are desirable in certain use cases, but they do not generally perform
as well as fixed, unrouted IP connections to the NFS storage. The advantage of using fixed IP addresses
is that network traffic for NFS flows over a native network connection rather than through the additional
overhead of iptables with NAT. For this reason, the performance of a native, unrouted network is
generally better. On the other hand, if complex network requirements are involved for access to the NFS
share, a public network with a shared pool might provide the needed flexibility to access the target.
To begin the workflow, you must have an existing data source that you want to convert and use within
Sahara, or you must create a data source that will contain the analytics result set.
Scenario One: Manage an Existing NFS Share
In some scenarios, you might need existing data in the enterprise for data processing. To bring existing
shares into Manila and thus make them available for later provisioning to Spark or Hadoop clusters,
Note: In the command block example, 192.168.90.167:/mnt/cb5c3580-5da8-4d16-9071-0251a5e208c1/ represents an actual Sahara node (IP address) with a Manila mount point in your Spark cluster.
3. Create the spark-wordcount binary:
To create it by using the internal database, run the following command:
manila-output is the name of the data source. You will reference this name in the Spark jobs
that you create.
cb5c3580-5da8-4d16-9071-0251a5e208c1 represents the ID of the Manila share that you
created in the “Scenario Two: Create an NFS Share and a Share Network” section. You can repeat the command for the second share, substituting the ID as appropriate.
Note: The %JOB_EXEC_ID% string gives each job a unique output directory to prevent duplicate output errors in Hadoop and Spark.
3.7 Create a Spark Template That Uses Manila Shares
The current method for using Manila shares involves modifying the default cluster template. If you pulled
the source for Sahara by using $ git clone https://github.com/openstack/sahara.git (see
the “Create the Spark Binary Data” section), the source includes the
sahara/plugins/default_templates/spark/v1_3_1/cluster.json file, which defines a valid
Spark cluster. You must copy and edit the file to include a reference to the two Manila shares that you
created in Horizon.
To edit the .json file and create your template, complete the following steps:
1. Copy the .json file and save it as manila-spark-cluster.json.
2. Edit your file to include one or more existing Manila share IDs. The share IDs must correspond to the IDs for the two Manila shares that you created in Horizon.
3. Add the ID for your Neutron management network to the file. The network ID must correspond to one of the following options:
An existing public network (from which a floating ID is obtained by Sahara) if you configured
Sahara with the default setting use_floating_ips=True. Sahara maps the floating IP address
to the VM and uses it for provisioning the shares to Manila through the public network.
An existing private Neutron network if you configured Sahara with the setting
use_floating_ips=False and you can route to or access your NFS storage from this private
network. In this case, your VM instances do not receive a public floating IP address.
Best Practice
Use Manila over nonrouted, private networks that are dedicated to storage.
In the following .json file example, the sample network ID is shown as green text, and the sample share IDs are shown as blue text.
Refer to the Interoperability Matrix Tool (IMT) on the NetApp Support site to validate that the exact product and feature versions described in this document are supported for your specific environment. The NetApp IMT defines the product components and versions that can be used to construct configurations that are supported by NetApp. Specific results depend on each customer's installation in accordance with published specifications.
Trademark Information
NetApp, the NetApp logo, Go Further, Faster, AltaVault, ASUP, AutoSupport, Campaign Express, Cloud
ONTAP, Clustered Data ONTAP, Customer Fitness, Data ONTAP, DataMotion, Fitness, Flash Accel,
Software derived from copyrighted NetApp material is subject to the following license and disclaimer:
THIS SOFTWARE IS PROVIDED BY NETAPP "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, WHICH ARE HEREBY DISCLAIMED. IN NO EVENT SHALL NETAPP BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
NetApp reserves the right to change any products described herein at any time, and without notice. NetApp assumes no responsibility or liability arising from the use of products described herein, except as expressly agreed to in writing by NetApp. The use or purchase of this product does not convey a license under any patent rights, trademark rights, or any other intellectual property rights of NetApp.
The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications.
RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.277-7103 (October 1988) and FAR 52-227-19 (June 1987).