Top Banner
Lustre ® Networking A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. JULY 2007 V ERSION 4 BY PETER J. BRAAM
16

Lustre Networking Whitepaper v4

Sep 10, 2014

Download

Documents

Bao Nguyen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lustre Networking Whitepaper v4

Lustre® Networking

A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC.JULY 2007VERSION 4

BY PETER J. BRAAM

Page 2: Lustre Networking Whitepaper v4

AbstractThis paper provides information about Lustre® networking (LNETTM) that can be used to plan cluster file system deployments for optimal performance and scalability. We will review Lustre message passing, Lustre Network Drivers (LNDs), and routing in Lustre networks and describe how these can be used to improve cluster storage management. The final section of this paper describes some new LNET features that are currently under consideration or planned for release.

Contents

Challenges in Cluster Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1. Lustre Networking - Architecture and Current Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1 LNET Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Network Types Supported in Lustre Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Routers and Multiple Interfaces in Lustre Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2. Applications of LNET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 RDMA and LNET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Using LNET to Implement a Site-Wide or Global File System . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Using Lustre over the WAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Using Lustre Routers for Load Balancing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3. Anticipated Features in Future Releases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 New Features For Multiple Interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Server-Driven QoS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 A Router Control Plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Asynchronous I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5 Messaging Passing Interface LND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Lustre Networking

Page 3: Lustre Networking Whitepaper v4

Challenges in Cluster NetworkingNetworking in today's data centers provides many challenges. For performance, file system clients must access servers using native protocols over a variety of networks, preferably leveraging capabilities like remote direct memory access (RDMA). In large installations, multiple networks may be encountered and all storage must be simultaneously accessible over multiple networks through routers and by using multiple network interfaces on the servers. Storage management nightmares, such as having multiple copies of data as they are staged on file systems local to a cluster, are common practice but highly undesirable.

LNET provides features that address many of these challenges. In the first section, some of the features of LNET are described, followed by a section discussing how these features can be used in specific high-performance computing (HPC) networking applications. The final section describes how LNET is expected to evolve to enhance load balancing, quality of service (QoS) and high availability in networks on a local and global scale.

1. Lustre Networking - Architecture and Current FeaturesThis section describes some key features of the LNET architecture that can be used to simplify and enhance HPC networking.

1.1 LNET Architecture LNET architecture has evolved through extensive research into a set of protocols and APIs to support high-performance, high-availability file systems. In a cluster with a Lustre file system, the system network is the network connecting the servers and the clients.

The disk storage behind the metadata servers (MDSs) and object storage servers (OSSs) in a Lustre file system is connected to these servers using traditional storage area networking (SAN) technologies, but this SAN does not extend to the Lustre client systems and typically does not require SAN switches. LNET is only used over the system network where it provides all communication infrastructure required by the Lustre file system.

Key features of LNET include:

• RDMA, when supported by underlying networks such as Elan, Myrinet, and InfiniBand

• Support for many commonly-used network types such as InfiniBand and IP

• High availability and recovery features enabling transparent recovery in conjunction with failover servers

• Simultaneous availability of multiple network types with routing between them

Figure 1 shows how these network features are implemented in a cluster deployed with LNET.

© Cluster File Systems, Inc. 2007 3

Page 4: Lustre Networking Whitepaper v4

Figure 1. Lustre architecture for clusters

LNET is implemented using layered software modules. The file system uses a remote procedure API with interfaces for recovery and bulk transport. This API in turn uses the LNET Message Passing API, which has its roots in the Sandia Portals message passing API, a well-known API in the HPC community. The LNET architecture supports pluggable drivers to provide support for multiple network types individually or simultaneously, similar in concept to the Portals network abstraction layer (NAL). The drivers, called LNDs, are loaded into the driver stack, one LND for each network type that is in use. Routing is possible between the different network types. This was implemented early in the Lustre product cycle to provide a key customer, Lawrence Livermore National Laboratories (LLNL) with a site-wide file system, as will be discussed in Using LNET to Implement a Site-Wide or Global File System on page 8. Figure 2 shows how the software modules and APIs are layered.

OSS 7

Pool of clustered Metadata Servers (MDS) 1-100

Lustre Clients 1 - 100,000

MDS disk storage containing Metadata Targets (MDT)

= failover

MDS 1(active)

MDS 2(standby)

OSS 1

OSS 2

OSS 3

OSS 4

OSS 5

OSS 6

Object Storage Servers (OSS)1-

Commodity Storage

Enterprise-Class Storage Arrays and SAN Fabric

Simultaneous support of multiple network types

Router

GigE

ElanMyrinetInfiniBand

Shared storage enables failover OSS

OSS storage with Object Storage Targets (OST)

4 Lustre Networking

Page 5: Lustre Networking Whitepaper v4

Figure 2. Modular LNET implemented with layered APIs

A Lustre network is a set of configured interfaces on nodes that can send traffic directly from one interface on the network to another. In a Lustre network, configured interfaces are named using network identifiers (NIDs). The NID is a string that has the form <address>@<type><network id>. Examples of NIDs are 192.168.1.1@tcp0, designating an address on the 0th Lustre TCP network, and 4@elan8, designating address 4 on the 8th Lustre Elan network.

1.2 Network Types Supported in Lustre NetworksLNET includes LNDs to support many network types including:

• InfiniBand: OpenFabrics versions 1.0 and 1.2, Mellanox Gold, Cisco, Voltaire, and Silverstorm

• TCP: Any network carrying TCP traffic, including GigE, 10GigE, and IPoIB

• Quadrics: Elan3, Elan4

• Myricom: GM, MX

• Cray: Seastar, RapidArray

The LNDs that support these networks are pluggable modules for the LNET software stack.

Vendor Network Device Libraries

Cluster File Systems LNET Library

Lustre Network Drivers (LNDs)

Network I/O (NIO) API

Lustre Request Processing

Support for multiple network typesIncluding routing API

Similar to Sandia Portals withsome new and different features

Move small and large buffersUse RDMA

Generate events

Zero-copy marshalling librariesService framework and request dispatchConnection and address namingGeneric recovery infrastructure

Portable Lustre component

Not portable

API

Not supplied by CFS

Legend:

© Cluster File Systems, Inc. 2007 5

Page 6: Lustre Networking Whitepaper v4

1.3 Routers and Multiple Interfaces in Lustre NetworksA Lustre network consists of one or more interfaces on nodes configured with NIDS that communicate without the use of intermediate router nodes with their own NIDS. LNET can conveniently define a Lustre network by enumerating the IP addresses of the interfaces forming the Lustre network. A Lustre network is not required to be physically separated from another Lustre network, although that is possible.

When more than one Lustre network is present, LNET can route traffic between networks using routing nodes in the network. An example is shown in Figure 3, where one of the routers is also an OSS. If multiple routers are present between a pair of networks, they offer both load balancing and high availability through redundancy.

Figure 3. Lustre networks connected through routers

When multiple interfaces of the same type are available, load balancing traffic across all links becomes important. If the underlying network software for the network type supports interface bonding, resulting in one address, then LNET can rely on that mechanism. Such interface bonding is available for IP networks and Elan4, but, at least presently, not for InfiniBand.

If the network does not provide channel bonding then Lustre networks can help. Each of the interfaces is placed on a separate Lustre network. The clients on each of these Lustre networks together can utilize all server interfaces, and this configuration provides static load balancing.

Later in this paper, we will describe additional features that may be developed in future releases to allow LNET to even better manage multiple network interfaces.

Figure 4 shows how a Lustre server with several server interfaces can be configured to provide load balancing for clients placed on more than one Lustre network. At the top, two Lustre networks are configured as one physical network using a single switch. At the bottom, they are configured as two physical networks using two switches.

OSS

MDS

Router

132.6.1.2

132.6.1.4

132.6.1.10

elan0 Lustre Network

192.168.0.2

192.168.0.10

tcp0 Lustre Network

Elan Clients TCP Clients

TCP clients access MDS through the

router

EthernetSwitch

Elan Switch

6 Lustre Networking

Page 7: Lustre Networking Whitepaper v4

Figure 4. A Lustre server with multiple network interfaces offering load balancing to the cluster

Server

10.0.0.1 10.0.0.2

10.0.0.4

10.0.0.6

10.0.0.8

10.0.0.3

10.0.0.5

10.0.0.7

Multiple Interfaces

vib1 Network Rail

vib0 Network Rail

Clients Clients

vib1 Lustre networkvib0 Lustre network

Server

10.0.0.1 10.0.0.2

10.0.0.4

10.0.0.6

10.0.0.8

10.0.0.3

10.0.0.5

10.0.0.7

Multiple Interfaces

vib1 Network Rail

vib0 Network Rail

Clients Clients

vib1 Lustre networkvib0 Lustre network

Switch

Switch

© Cluster File Systems, Inc. 2007 7

Page 8: Lustre Networking Whitepaper v4

2. Applications of LNETLNET provides much versatility for deployments and a few opportunities are described in this section.

2.1 RDMA and LNETWith the exception of TCP, LNET provides support for RDMA on all network types. When RDMA is used, nodes can achieve almost full bandwidth with extremely low CPU utilization. This is advantageous, particularly for nodes that are busy running other software, such as Lustre server software. The LND automatically uses this feature for large message sizes.

However, provisioning with sufficient CPU power and high-performance motherboards may justify TCP networking as a trade-off to using RDMA. On 64-bit processors, LNET can saturate several gigE interfaces with relatively low CPU utilization, and with the recently released dual-core Intel Xeon processor 5100 series ("Woodcrest"), the bandwidth on a 10 GigE network can approach a gigabyte per second. LNET provides extraordinary bandwidth utilization of TCP networks. For example, end-to-end I/O over a single GigE link routinely exceeds 110 MB/sec with LNET.

The Internet Wide Area RDMA Protocol (iWARP), developed by the RDMA Consortium, is an extension to TCP/IP that supports RDMA over TCP/IP networks. Linux supports the iWARP protocol using the Open Fabrics Alliance (OFA) code and interfaces. LNET uses an appropriately modified OFA LND to provide support for iWARP.

2.2 Using LNET to Implement a Site-Wide or Global File SystemSite-wide file systems and global file systems are implemented to provide transparent access from multiple clusters to one or more file systems. Site-wide file systems are typically associated with one site, while global file systems may span multiple locations and therefore utilize wide-area networking.

Site-wide file systems are typically desirable in HPC centers where many clusters exist on different high-speed networks. It is usually neither easy to extend such networks nor to connect such networks to other networks. LNET makes this possible.

An increasingly popular approach is to build a storage island at the center of such an installation. The storage island contains storage arrays and servers and utilizes an InfiniBand or TCP network. Multiple clusters can connect to this island through Lustre routing nodes. The routing nodes are simple Lustre systems with at least two network interfaces, one to the internal cluster network and one to the network used in the storage island. Figure 5 shows an example of a global file system.

8 Lustre Networking

Page 9: Lustre Networking Whitepaper v4

Figure 5. A global file system implemented using Lustre networks

The benefits of site-wide and global file systems are not to be underestimated. Traditional data management for multiple clusters frequently involves staging data from one cluster on the file system to another. By deploying a site-wide Lustre file system, multiple copies of the data are no longer needed and substantial savings can be achieved through improved storage management and reduced capacity requirements.

Routers

Server Farm

OSS

MDS

IP Network

Elan4

Clients

InfiniBand

Storage Network

Cluster 1

RoutersClients Cluster 2

Storage Island

Switch

Switch

Switch

© Cluster File Systems, Inc. 2007 9

Page 10: Lustre Networking Whitepaper v4

2.3 Using Lustre over the WANLustre has been successfully deployed over wide area networks (WANs). Typically, even over a WAN, 80 percent of raw bandwidth can be achieved, which is significantly more than that achieved by many other file systems over local area networks (LANs). For example, within the United States, Lustre deployments have achieved a bandwidth of 970 MB/sec over a WAN using a single 10GigE interface (from a single client!). Between Europe and the United States, 97 MB/sec has been achieved with a single GigE connection. On LANs, observed I/O bandwidths are only slightly higher: 1100 MB/sec on a 10GigE network and 118 MB/sec on a GigE network.

Routers can also be used advantageously to connect servers distributed over a WAN. For example, a single Lustre cluster may consist of two widely-separated groups of Lustre servers and clients with each group interconnected by an InfiniBand network. As shown in Figure 6, Lustre routing nodes can be used to connect the two groups of Lustre servers and clients via an IP-based WAN. Alternatively, the servers could have an InfiniBand and Ethernet interface. However, this configuration may require more ports on switches, so the routing solution may be more cost effective. Another alternative would use general purpose InfiniBand-to-IP routers, which are not yet available and are likely to be costly.

Figure 6. A Lustre cluster distributed over a WAN

InfiniBand

WAN

InfiniBand

Router Router

IPIP

Lustre cluster group 1

Location A

Clients Servers

Lustre cluster group 2

Location B

Clients Servers

10 Lustre Networking

Page 11: Lustre Networking Whitepaper v4

2.4 Using Lustre Routers for Load BalancingCommodity servers can be used as Lustre routers to provide a cost effective load-balanced, redundant router configuration. For example, consider an installation with servers on a network with 10 GigE interfaces and many clients attached to a GigE network. It is possible, but typically costly, to purchase IP switching equipment that can connect to both the servers and the clients.

With a Lustre network, the purchase of such costly switches can be avoided. For a more cost-effective solution, two separate networks can be created. A smaller, faster network contains the servers and a set of router nodes with sufficient aggregate throughput. A second client network with slower interfaces contains all the client nodes and is also attached to the router nodes. If this second network already exists and has sufficient free ports to add the Lustre router nodes, no changes to this client network are required. Figure 7 shows an installation with this configuration.

Figure 7. An installation combining slow and fast networks using Lustre routers

The routers provide a redundant, load-balanced path between the clients and the servers. This network configuration allows many clients together to use the full bandwidth of a server, even if individual clients have insufficient network bandwidth to do so. Because multiple routers stream data to the server network simultaneously, the server network can see data throughput in excess of what a single router can deliver.

Load balancing, redundant router farm

GigE Clients 10GigE ServersRouter Farm

10GigESwitch

GigESwitch

© Cluster File Systems, Inc. 2007 11

Page 12: Lustre Networking Whitepaper v4

3. Anticipated Features in Future ReleasesAlthough LNET offers many features today, more are coming in future releases. Some possible new features include support of multiple network interfaces, implementation of server-driven QoS guarantees, asynchronous I/O, a control interface for routers and a Messaging Passing Interface (MPI) LND that will allow LNET to utilize MPI as the networking API.

A few additional features are being planned but are not described in detail in this paper. Among these are an LNET implementation with LNDs for InfiniBand and TCP on Solaris servers and user-level access to the LNET API.

3.1 New Features For Multiple InterfacesAs discussed above, LNET can currently exploit multiple interfaces by placing them on different Lustre networks. This configuration provides reasonable load balancing for a server with many clients. However, it is a static configuration that does not handle link-level failover or dynamic load balancing.

We plan to address these shortcomings with the following design. First, LNET will virtualize multiple interfaces and offer the aggregate as one NID to the users of the LNET API. In concept, this is quite similar to the aggregation (also referred to as bonding or trunking) of Ethernet interfaces using protocols like 802.3ad Dynamic Link Aggregation. The key features that a future LNET release may offer are:

• Load balancing: All links are used based on availability of throughput capacity.

• Link-level high availability: If one link fails, the other channels transparently continue to be used for communication.

These features are shown in Figure 8.

Figure 8. Link-level load balancing and failover

From a design perspective, these load-balancing and high-availability features are similar to the features offered with LNET routing described in the section Using Lustre Routers for Load Balancing on page 11. A challenge in developing these features is providing a simple way to configure the network. Assigning and publishing NIDs for the bonded interfaces should be simple and flexible and work even if not all links are available at startup. We expect to use the management server protocol to resolve this issue.

Evenly loaded traffic

Client

Server

All traffic

Client

Server

x

x

Link failure accommodated without server failover

Switch Switch

12 Lustre Networking

Page 13: Lustre Networking Whitepaper v4

3.2 Server-Driven QoSQoS is often a critical issue, for example, when multiple clusters are competing for bandwidth from the same storage servers. A primary QoS goal is to avoid overwhelming server systems with conflicting demands from multiple clusters or systems, resulting in performance degradation for all clusters. Setting and enforcing policies is one way to avoid this.

For example, a policy can be established that guarantees that a certain minimal bandwidth is allocated to resources that must respond in real-time, such as for visualization. Or a policy can be defined that gives systems or clusters doing mission-critical work priority for bandwidth over less important clusters or systems. The Lustre QoS system’s role is not to determine an appropriate set of policies but to provide capabilities that allow policies to be defined and enforced.

Two components proposed for the Lustre QoS scheduler are a global Epoch Handler (EH) and a Local Request Scheduler (LRS). The EH provides a shared time slice among all servers. This time slice can be relatively large (one second, for example) to avoid overhead due to excessive server-to-server networking and latency. The LRS is responsible for receiving and queuing requests according to a local policy. The EH and LRS together allow all servers in a cluster to execute the same policy during the same time slice. Note that the policy may subdivide the time slices and use the subdivision advantageously. The LRS also provides summary data to the EH to support global knowledge and adaptation.

Figure 9 shows how these features can be used to schedule rendering and visualization of streaming data. In this implementation, LRS policy allocates 30 percent of each epoch time slice to visualization and 70 percent to rendering.

Figure 9. Using server-driven QoS to schedule video rendering and visualization

Rendering Cluster Visualization Cluster

Epoch 1 2 3

30%

Visualization

Rendering

Epoch Messaging

OSS 1-3

70%

Visualization

30%

Rendering

70%

Visualization

30%

Rendering

70%

© Cluster File Systems, Inc. 2007 13

Page 14: Lustre Networking Whitepaper v4

3.3 A Router Control PlaneLustre technology is expected to be used in vast worldwide file systems that traverse multiple Lustre networks with many routers. To achieve wide-area QoS guarantees that cannot be achieved with static configurations, the configurations of these networks must change dynamically. A control interface is required between the routers and outside administrative systems to handle these situations. Requirements are currently being developed for a Lustre Router Control Plane to help address these issues.

For example, features are being considered for the Lustre Router Control Plane that could be used when data packets are being routed by routers from A to B and also from C to D and, for operational reasons, a preference needs to be given to routing the packets from C to D. The control plane would apply a policy to the routers so that packets would be sent from C to D before packets are sent from A to B.

The Lustre Router Control Plane may also include the capability to provide input to a server-driven QoS subsystem, linking router policies with server policies. It might be particularly interesting to have an interface between the server-driven QoS subsystem and the router control plane to allow coordinated adjustment of QoS in a cluster and a wide-area network.

3.4 Asynchronous I/O In large compute clusters, the potential exists for significant I/O optimization. When a client writes large amounts of data, a truly asynchronous I/O mechanism would allow the client to register the memory pages that need to be written for RDMA and allow the server to transfer the data to storage without causing interrupts on the client. This makes the client CPU fully available to the application again, which is a significant benefit in some situations.

Figure 10. Network-level DMA with handshake interrupts and without handshake interrupts

LNET supports RDMA, but currently a handshake at the operating system level is required to initiate the RDMA as shown in Figure 10 (left). The handshake exchanges the network-level DMA addresses to be used. The proposed change to LNET would eliminate the handshake and include the network-level DMA addresses in the initial request to transfer data as shown in Figure 10 (right).

Sending message without DMA handshake

LNET

Sending message with DMA handshake

LNDNetwork

LND LNETSource Node

Put message description

Get DMA address

RDMA dataEvent

Sink NodeLND LND LNETLNET

Put message description

Register source buffer

Send description and source RDMA address

Source Node Sink NodeNetwork

Register sink buffer

Register source buffer

RDMA dataEvent

Register sink buffer

14 Lustre Networking

Page 15: Lustre Networking Whitepaper v4

3.5 Messaging Passing Interface LNDOften the Lustre file system is used on an HPC platform on which applications use the MPI library as the primary networking library. MPI implementations are usually well tuned because they have a direct impact on the performance of the cluster. MPI-1.X was not suitable for client-server networking such as required by LNET but MPI-2.X includes suitable features, as well as desirable high-availability features.

CFS is planning to construct a LND that runs over MPI. It will first be used by the liblustre I/O library, which offers access to Lustre servers directly from user-space applications. This will significantly improve the portability of liblustre to other platforms.

© Cluster File Systems, Inc. 2007 15

Page 16: Lustre Networking Whitepaper v4

Conclusion LNET provides an exceptionally flexible and innovative infrastructure. Among the many features and benefits that have been discussed, the most significant are:

• Native support for all commonly used HPC networks

• Extremely fast data rates through RDMA and unparalleled TCP throughput

• Support for site-wide file systems through routing, eliminating staging, and copying of data between clusters

• Load-balancing router support to eliminate low-speed network bottlenecks

Lustre networking will continue to evolve with features to handle link aggregation, server-driven QoS, a rich control interface to large routed networks and asynchronous I/O without interrupts.

Legal DisclaimerLustre is a registered trademark of Cluster File Systems, Inc. and LNET is a trademark of Cluster File Systems, Inc. Other product names are the trademarks of their owners. Although CFS strives for accuracy, we reserve the right to change, postpone, or eliminate features at our sole discretion.

16 Lustre Networking