Distributed Access to Parallel File Systems by Dean Hildebrand A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 2007 Doctoral Committee: Adjunct Professor Peter Honeyman, Chair Professor Farnam Jahanian Professor William R. Martin Associate Professor Peter M. Chen Professor Darrell Long, University of California, Santa Cruz
141
Embed
Distributed Access to Parallel File Systems...The first architecture, Split-Server NFSv4, targets parallel file system architectures that disallow customization and/or direct storage
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Distributed Access to Parallel File Systems
by
Dean Hildebrand
A dissertation submitted in partial fulfillment of the requirements for the degree of
Doctor of Philosophy (Computer Science and Engineering)
in The University of Michigan 2007
Doctoral Committee:
Adjunct Professor Peter Honeyman, Chair Professor Farnam Jahanian Professor William R. Martin Associate Professor Peter M. Chen Professor Darrell Long, University of California, Santa Cruz
III. A Model of Remote Data Access ....................................................................... 34
3.1. Architecture for remote data access.......................................................... 34 3.2. Parallel file system data access architecture ............................................. 37 3.3. NFSv4 data access architecture ................................................................ 37 3.4. Remote data access requirements ............................................................. 38 3.5. Other general data access architectures .................................................... 39
vi
IV. Remote Access to Unmodified Parallel File Systems....................................... 43
4.1. NFSv4 state maintenance.......................................................................... 44 4.2. Architecture............................................................................................... 44 4.3. Fault tolerance........................................................................................... 48 4.4. Security ..................................................................................................... 49 4.5. Evaluation ................................................................................................. 49 4.6. Related work ............................................................................................. 55 4.7. Conclusion ................................................................................................ 56
V. Flexible Remote Data Access .............................................................................. 57
5.1. pNFS architecture ..................................................................................... 58 5.2. Parallel virtual file system version 2......................................................... 62 5.3. pNFS prototype......................................................................................... 63 5.4. Evaluation ................................................................................................. 65 5.5. Additional pNFS design and implementation issues ................................ 69 5.6. Related work ............................................................................................. 70 5.7. Conclusion ................................................................................................ 71
VI. Large Files, Small Writes, and pNFS ............................................................... 72
6.1. Small I/O requests..................................................................................... 73 6.2. Small writes and pNFS ............................................................................. 74 6.3. Evaluation ................................................................................................. 79 6.4. Related work ............................................................................................. 85 6.5. Conclusion ................................................................................................ 86
VII. Direct Data Access with a Commodity Storage Protocol .............................. 87
7.1. Commodity high-performance remote data access................................... 89 7.2. pNFS and storage protocol-specific layout drivers................................... 90 7.3. Direct-pNFS.............................................................................................. 93 7.4. Direct-pNFS prototype.............................................................................. 96 7.5. Evaluation ................................................................................................. 96 7.6. Related work ........................................................................................... 104 7.7. Conclusion .............................................................................................. 106
VIII. Summary and Conclusion............................................................................. 108
tection semantics, instead requiring users to obtain Kerberos [41] tokens that map onto
access control lists, which control access at the granularity of a directory. Kerberos is a
network authentication protocol that provides strong authentication for client/server ap-
plications by using secret-key cryptography. AFS uses a secure RPC called Rx for all
communication.
Some work has begun to create an implementation of AFS that provides remote ac-
cess to existing data stores, although it appears such a system does not yet exist.
Vice, a predecessor of AFS, forms the basis of the high availability Coda file system
[42], which adds support for mutable server replication and disconnected operation.
Coda has three client access strategies: read-one-data, read data from a single preferred
server; read-all-status, obtain version and status information from all servers; and write-
all, write updates to all available servers. Clients can continue to work with cached cop-
ies of files when disconnected, with updates propagated to the servers when the client is
reconnected. If servers contain different versions of the same file, stale replicas are asyn-
chronously refreshed. If conflicting versions exist, user intervention is usually required.
2.2.1.7 DCE/DFS
The Open Software Foundation uses AFS [40] as the basis for the DEcorum file sys-
tem (DCE/DFS) [43], a major component of its Distributed Computing Environment
(DCE). It redesigned the AFS server with an extended virtual file system interface,
called VFS+, enabling it to support a range of underlying file systems. A specialized un-
derlying file system, Episode [44], supports data replication, data migration, and access
control lists (ACLs), which specify the users that can access a file system resource. DFS
12
supports single-copy consistency semantics, ensuring that clients see the latest changes to
a file. A token manager running on each server manages consistency by returning vari-
ous types of tokens to clients, e.g., read and write tokens, open tokens, and file attribute
tokens. A server can prohibit multiple clients from modifying cached copies of the same
file. Leases [45] are placed on the tokens to let the server revoke tokens that are not re-
newed by the client within a lease period, allowing quick recovery from a failed client
holding exclusive-access tokens. A recovered server enters a grace period—lasting for a
few minutes—which allows clients to detect server failure and reacquire tokens.
2.2.1.8 AppleTalk
Developed by Apple Computer in the early 1980s, the AppleTalk protocol suite [46]
facilitated file transfer, printer sharing, and mail service among Apple systems. Built
from the ground up, AppleTalk managed every layer of the OSI reference model [47].
AppleTalk currently includes a set of protocols to work with existing data link protocols
such as Ethernet, Token Ring, and FDDI.
The AppleTalk Filing Protocol (AFP) allows Macintosh clients to access remote files
in the same manner as local files. AFP uses several other protocols in the AppleTalk pro-
tocol suite including the AppleTalk Session Protocol, the AppleTalk Transaction Proto-
col, and the AppleTalk Echo Protocol. The Mac OS continues to use AFP as a primary
file sharing protocol, but Mac support for NFS is growing.
2.2.1.9 Common Internet File System
The Server Message Block (SMB) protocol, now known as the Common Internet File
System (CIFS) [2], was created for PCs in the 1980's by IBM and later extended by
3COM, Intel, and Microsoft. SMB was designed to provide remote access to the
DOS/FAT file system, but now NTFS forms the basis for CIFS. CIFS uses NetBIOS
(Network Basic Input Output System) sessions, a session management layer originally
designed to operate over a proprietary transport protocol (NETBEUI), but now operates
over TCP/IP and UDP [48, 49]. Once a session ends, the server may close all open files.
13
Server failure results in the loss of all server state, including all open files and current file
offsets.
CIFS has a unique cache consistency model that uses opportunistic locks (oplocks).
On file open, a client specifies the access it requires (read, write, or both) and the access
to deny others, and in return receives an oplock (if caching is available). There are three
types of oplocks: exclusive, level II, and batch. The first client to open a file receives an
exclusive lock. The server disables caching if two clients request write access to the
same file, forcing both clients to write through the server. A client requesting read access
to a file with an exclusive lock disables caching on the client requesting read access and
changes the writing client’s cache to read-only (level II). Batch locks allow a client to
retain an oplock across multiple file opens and closes.
Other features of CIFS include request batching and the use of the Microsoft DFS fa-
cility to stitch together several file servers into a single namespace. Security is handled
with password authentication on the server. Although CIFS is exclusively controlled by
Microsoft, Samba [50] is a suite of open source programs that provides file and print ser-
vices to PC clients using the CIFS protocol.
2.2.1.10 Sprite operating system
The Sprite operating system [51, 52] was developed at the University of California at
Berkley for networked, diskless, and large main memory workstations. The Sprite dis-
tributed file system supports full UNIX semantics. To resolve a file location, each client
maintains a file prefix table, which caches the association of file paths and their home
server. Caching of file prefixes reduces recursive lookup traffic as a client walks through
the directory tree, but the broadcast mechanism used to distribute file prefix information
limits Sprite to LAN environments. Sprite supports single-copy semantics by tracking
open files on clients and whether clients are reading or writing, allowing only a single
writer or multiple readers to cache file data. The server uses callbacks to invalidate client
caches when conflicting open requests occur. Sprite uses a writeback cache, flushing
dirty blocks to the server after thirty seconds, and writing to disk in another thirty sec-
onds. This caching model was integrated into Spritely NFS [53] at the cost of file open
performance and complicating the server recovery model by storing the list of clients that
14
have opened a file on disk. Not Quite NFS (NQNFS) [54] avoids storing state on disk
and the use of open and close commands by using a lease mechanism on the client data
cache. This avoids having to introduce additional server recovery semantics to NFS by
simply allowing leases to expire before a failed server recovers.
2.3. High-performance computing
2.3.1. Supercomputers
This section gives an overview of modern supercomputers from a data perspective.
Figure 2.1 displays a typical supercomputer data access architecture, consisting of a pri-
mary I/O (storage) network that connects compute nodes, login/development nodes, visu-
alization facilities, archival facilities, and remote compute and data facilities.
The Thunderbird cluster at Sandia National Laboratories, currently the largest PC
cluster in the world, achieves 38 teraflops/s. Thunderbird consists of 4,096 dual proces-
sor nodes using Infiniband for inter-node communication and Gigabit Ethernet for stor-
age access. This architecture supports direct storage access for all nodes.
ASCI Purple has 1536 8-way nodes (12,208 CPUs) and achieves 75 teraflops/s [55].
ASCI Purple is a hybrid of commodity and specialized components. Custom designed
1 from ASCI Technology Prospectus, July 2001
Figure 2.1: ASCI platform, data storage, and file system architecture1.
15
compute nodes communicate via the IBM Federation interconnect, which has a peak bi-
directional bandwidth of 8 GB/s and a latency of 4.4 µs, and uses a commodity Infini-
band storage network for data access. To meet I/O throughput requirements, 128 nodes
are designated I/O nodes to act as a bridge between the Federation and Infiniband inter-
connects. Compute nodes trap I/O calls and automatically re-execute them on the I/O
nodes, with the results shipped back to the originating compute node. I/O nodes also
handle process authentication, accounting, and authorization on behalf of the compute
nodes.
Figure 2.2 describes the hierarchical architecture of the IBM BlueGene/L hybrid ma-
chine [56, 57]. A full BlueGene/L system has 65,536 dual-processor compute nodes, or-
ders of magnitude more than contemporary systems such as ASCI White [58], Earth
Simulator [59], and ASCI Red [60]. BlueGene/L has one I/O node for every sixty-four
compute nodes, although this ratio is configurable. Currently, each node has a small
amount of memory, limiting it to highly parallelizable applications.
2 from www.llnl.gov/asc/computing_resources/bluegenel/configuration.html
Figure 2.2: The ASCI BlueGene/L hierarchical architecture2.
16
2.3.2. High-performance computing applications
The design of modern distributed file system architectures derives mainly from sev-
eral workload characterization studies [40, 61-64] of UNIX users and their applications.
Applications that use thousands of processors have entirely different workloads. This
section gives an overview of HPC applications and their I/O requirements.
High-performance applications aim to use every available processor. Each node cal-
culates the results on a piece of the analysis domain, with the results from every node be-
ing combined at some later point. The example parallel application in Figure 2.3 exe-
cutes on all cluster nodes in parallel. A physical simulation, e.g., propagation of a wave,
moves forward in time. The unit of time depends on the required resolution of the result.
In step 1, nodes load their inputs individually or designate one node to load the data
and distribute it among the nodes. At the beginning of each computation (step 3), nodes
synchronize among themselves. After a node completes its computation for a time step,
step 5 communicates dependent data, a ghost cell, to domain neighbors for the next time
step. Step 6 checkpoints (writes) the data to guard against system or application failures.
The next section discusses checkpointing in more detail. In step 9, all nodes write their
results directly to storage or have a single node gather the results and write the combined
result to storage. Flash [65] is an example of the former; mpiBLAST [66] exemplifies
the latter.
Figure 2.3: A typical high-performance application. Pseudo-code of a typical high-performance computing application that each node executes in parallel.
1. Load initialization data 2. Begin loop 3. BARRIER 4. Compute results for current time step 5. Distribute ghost cells to dependent nodes 6. If required, checkpoint data to storage 7. Advance time forward 8. End loop 9. Write result
17
2.3.2.1 I/O
Miller and Katz [67] divided high-performance application I/O into three categories:
required, data staging, and checkpoint. Required I/O consists of loading initialization
data and storing the final results. Data staging supports applications whose data do not fit
in main memory. These data can be stored in virtual memory automatically by the oper-
ating system, or manually written to disk using out-of-core techniques for supercomput-
ers that do not support virtual memory. Checkpoint I/O, also known as defensive I/O,
stores intermediate results to prevent data loss. Checkpointing results after every compu-
tation increases application runtime to an unacceptable extent, but deferring checkpoint
I/O too long risks unacceptable data loss. The time between checkpoints depends on a
system’s mean time between hardware failures (MTBHF). Failure rates vary widely
across systems, depending mostly on the number of processors and the type and intensity
of application workloads [68].
Applications write checkpoint and result files in different ways. These are the most
common three methods [69]:
1. Single file per process/node. Each node or process creates a unique file. This
method is clumsy since an application must use the same number of nodes to re-
start computation after failure or interruption. In addition, the files must later be
integrated or mined by special tools and post-processors to generate a final result.
2. Small number of files. An application can write a smaller number of files than
processes/nodes in the computation by performing some integration work at each
checkpoint (and for the final result). This method allows an application to restart
computation on a different number of processes/nodes. However, special process-
ing and tools are still required.
3. Single file. An application integrates all information into a single file. This
method allows applications to restart computation easily on any number of proc-
esses/nodes and obviates the need for special post-processing and tools.
The overall performance of a supercomputer depends not only on its raw computa-
tional speed but also on job scheduling efficiency, reboot and recovery times (including
checkpoint and restart times), and the level of process management. The Effective Sys-
18
tem Performance (ESP) [70] of a supercomputer is a measure of its total system utiliza-
tion in a real-world operational environment. ESP is calculated by measuring the time it
takes to run a fixed number of parallel jobs through the batch scheduler of a supercom-
puter. The U.S. Department of Energy procures large systems with an ESP of at least
seventy percent [12].
Defensive I/O is very time consuming and decreases the ESP of a supercomputer.
Some systems report that up to seventy-five percent of I/O is defensive, with only twenty
five percent being productive I/O, consisting of visualization dumps, diagnostic physics
data, traces of key physics variables over time, etc. To ensure defensive I/O does not re-
duce the ESP of a machine below seventy percent, the Sandia National Research Lab
uses a general rule of thumb of requiring a throughput of one gigabyte per second for
every teraflop of computing power [12].
2.3.2.2 I/O workload characterizations
Workload characterization studies of supercomputing applications on vector com-
puters in the early 1990s [67, 71] found that I/O is cyclic, predictable, and bursty, with
file access sizes relatively constant within each application. Files are read from begin-
ning to end, benefiting from increased read-ahead. Buffered writes on the client was
found to be of little benefit due to the large amount of data being written and the small
data cache typical of vector machines.
The CHARISMA study [72-74] instrumented I/O libraries to characterize the I/O ac-
cess patterns of distributed memory computers in the mid-1990s. The main difference
from the vector machine study is the large number of small I/O requests. CHARISMA
found that approximately 90% of file accesses are small, but approximately 90% of data
is transferred in large requests. The large number of small write requests, and the short
interval between them, indicates that buffered-writes benefit performance in some in-
stances. Small requests often result from partitioning a data set across many processors
but may also be inherent in some applications.
Write requests dominate read requests in the applications studied by CHARISMA.
Write data sharing between clients is infrequent, since it is rarely useful to re-write the
same byte. With read-only files, approximately 24% are replicated on each client and
19
64% experience false sharing. A data set can be divided into a series of fixed size data
regions, or data blocks. With false sharing, nodes share data blocks, but do not access the
same byte ranges within the data blocks. This can occur when a data set is divided
among clients according to block divisions instead of the requested byte-ranges, creating
unnecessary data contention. Read-write files exhibit very little byte sharing between
clients due to the difficulty of maintaining consistency, but also experience false sharing.
Separate jobs never share files.
Clients interleave access to a single file, creating high inter-process spatial locality on
the I/O nodes, benefiting from I/O server data caching. Current strided data access usu-
ally uses standard UNIX I/O operations, with application developers citing a lack of port-
ability of the available and much more efficient strided interfaces provided by parallel
file systems. The rapid pace of technological change means that applications generally
outlast their targeted platform and must be portable to newer machines.
The Scalable I/O [75-77] initiative in the mid-1990s instrumented applications to
characterize I/O access patterns with distributed memory machines. The findings are
similar to CHARISMA, but emphasize the inefficiency of the UNIX I/O interface with
different file sizes, and spatial and temporal data access within a file. For example, one
sample application generates most of its I/O by seeking through a file. The study lead to
mation passed from the application to the file system, to improve interaction with the I/O
nodes. The Scalable I/O initiative also found that using a single and shared file for stor-
ing information was still common and slow.
In 2004, a group at the University of California, Santa Cruz, analyzed applications on
a large Linux cluster [78]. Like the findings from a decade earlier, this study found that
I/O is bursty, most requests consist of small data transfers, and most data is transferred in
a few large requests. It is common for a master node to collect results from other nodes
and write them to storage using many small requests. Each client reads back the data in
large chunks. In addition, use of a single file is still common and accessing that file—
even with modern parallel file systems—is slower than accessing separate files by a fac-
tor of five.
20
2.4. Scaling data access
This section discusses common techniques for increasing file system I/O throughput.
2.4.1. Standard parallel application interfaces
A parallel application’s lifespan may be ten years or longer, and may run on several
different supercomputers architectures. A standard communication interface ensures the
portability of applications. Three communication libraries dominate the development and
execution of supercomputer applications.
MPI (Message Passing Interface) [79] is an API for inter-node communication. MPI-
2 [80] includes a new interface for data access named MPI-IO, which defines standard
I/O interfaces and standard data access hints. MPICH2 [81] is an open-source implemen-
tation of MPI that includes an MPI-IO framework named ROMIO [82]. Specialized im-
plementations of MPI-IO also exist [83].
ROMIO improves single client performance by increasing client request size through
a technique called data sieving [84]. With data sieving, noncontiguous data requests are
converted into a single, large contiguous request consisting of the byte range from the
first to the last requested byte. Data sieving then places the requested data portions in the
user’s buffer. Access to large chunks of data usually outperforms access to smaller
chunks, but data sieving also increases the amount of transferred data and number of
read-modify-write sequences.
MPI-IO can also improve I/O performance from multiple clients using collective I/O,
which can exist at the disk level (disk-directed I/O [85]), at the server level (server-
directed I/O [86]), or at the client level (two-phase I/O [87]). In disk-directed I/O, I/O
nodes use disk layout information to optimize the client I/O requests. In server-directed
I/O, a master client communicates in-memory and on-disk distributions for the arrays
prior to client data access with a master I/O node. The master I/O node then shares the
information with other I/O nodes. This communication allows clients and I/O nodes to
coordinate and improve access to logically sequential regions of files, not just physically
sequential regions. Finally, ROMIO uses two-phase I/O, which organizes noncontiguous
client requests into contiguous requests. For example, two-phase I/O converts inter-
21
leaved client read requests from a single file into contiguous read requests by having cli-
ents read contiguous data regions, re-distributing the data to the appropriate clients.
Writing a file is similar except clients first distribute data among themselves so that cli-
ents can write contiguous sections of the file. Accessing data in large chunks outweighs
the increased inter-process communication cost for data redistribution.
Other APIs for inter-process communication include OpenMP [88] and High-
performance Fortran (HPF) [89], but they do not include an I/O interface. OpenMP di-
vides computations among processors in a shared memory computer. Many applications
use a mixture of OpenMP and MPI in their applications, using MPI between nodes and
OpenMP to improve performance on a single SMP node. HPF first appeared in 1993.
HPF-2, released in 1997, includes support for data distribution, data and task parallelism,
data mapping, external language support, and asynchronous I/O, but has limited use due
to its restriction to programs written in Fortran.
2.4.2. Parallel file systems
Early high-performance computing engineered connected monolithic computers to
monolithic storage systems. The emergence of low cost clusters broke this model by cre-
ating many more producers and consumers of data than the memory, CPU, and network
interface of a single file server could manage.
Initial efforts to increase performance included client caches, buffered writes on the
client (Sprite, AFS), and write gathering on the server [90], but they did not address the
single server bottleneck. Many system administrators manually stitch together several
file servers to create a larger and more scalable namespace. This has implications in ad-
ministration costs, backup creation, load balancing, and quota management. In addition,
re-organizing the namespace to meet increased demand is visible to users.
A more transparent way to combine file servers into a single namespace is to forward
requests between file servers [91]. Clients access a single server; requests for files not
stored on that server are forwarded to the file’s home server. Unfortunately, servers are
still potential bottlenecks since a directory or file is still bound to a single server. Fur-
thermore, data may now travel through two servers.
22
Another method aggregates all storage behind file servers using a storage network
[92]. This allows a client to access any server, with servers acting as intermediaries be-
tween clients and storage. This is an in-band solution since control and data both traverse
the same path. This design still requires clients to send all requests for a file to a single
server and can require an expensive storage network.
Out-of-band solutions (OOB), which separate control and data message paths [93-96],
currently offer the best I/O performance in a LAN. They enable direct and parallel access
to storage from multiple endpoints. OOB parallel file systems stripe files across available
storage nodes, increasing the aggregate I/O throughput by distributing the I/O across the
bisectional bandwidth of the storage network between clients and storage. This technique
can reduce the likelihood of any one storage node becoming a bottleneck and offers scal-
able access to a single file. A single network connects clients, metadata servers, and stor-
age, with client-to-client communication occurring over this network or on an optional
host network.
Out-of-band separation of data and control paths has been advocated for decades [93,
94] because it allows an architecture to improve data transfer and control messages sepa-
rately. Note that inter-dependencies between data and control may restrict unbounded,
individual improvement.
Symmetric OOB parallel file systems, depicted in Figure 2.4a, combine clients and
metadata servers into single, fully capable servers, with metadata distributed among
them. Locks can be distributed or centralized. Maintaining consistent metadata informa-
tion among an increasing number of servers limits scalability. These systems generally
require a SAN. Examples include GPFS [97], GFS [98, 99], OCFS2 [100], and
PolyServe Matrix Server [101].
(a) Symmetric (b) Asymmetric
Figure 2.4: Symmetric and asymmetric out-of-band parallel file systems (PFS).
23
Asymmetric OOB parallel file systems, depicted in Figure 2.4b, divide nodes into cli-
ents and metadata servers. To perform I/O, clients first obtain a file layout map describ-
ing the placement of data in storage from a metadata server. Clients then use the file lay-
out map to access data directly and in parallel. These systems allow data to be accessed
at the block, object, or file level. Block-based systems access disks directly using the
SCSI protocol via FibreChannel or iSCSI. Object- and file-based systems have the po-
tential to improve scalability with a smaller file layout map, shifting the responsibility of
knowing the exact location of every block from clients to storage. Examples of block-
based systems include IBM TotalStorage SAN FS (also known as Storage Tank) [102],
and EMC HighRoad [103]. Examples of object-based systems include Lustre [104] and
Panasas ActiveScale [105]. Examples of file-based systems include Swift [96, 106] and
PVFS [107].
2.4.2.1 Parallel file systems and POSIX
The POSIX API and semantics impede efficient sharing of a single file’s address
space in a cluster of computers. The POSIX I/O programming model is a single system
in which processes employ fast synchronization and communication primitives to resolve
data access conflicts. Due to an increase in synchronization and communication time,
this model is invalid for multiple clients accessing data in a parallel file system.
Several semantics exemplify the mismatch between POSIX semantics and parallel
file systems.
• Single process/node. The POSIX API forces every application node to execute
the same operation when one node could perform operations such as name resolu-
tion on behalf of all nodes in the distributed application.
• Time stamp freshness. POSIX mandates that file metadata is kept current. Each
I/O operation on the storage nodes alters a time stamp, with single second resolu-
tion, which must be propagated to the metadata server. Examples include modifi-
cation time and access time.
• File size freshness. As part of file metadata, POSIX mandates that the size of a
file is kept current. Clients, storage nodes, and the metadata server must all coor-
24
dinate to determine the current size of a file while it is being updated and ex-
tended.
• Last writer wins. POSIX mandates that when two or more processes write to a
file at the same location, the file contain the last data written. While a parallel file
system can implement this requirement on each individual storage node, it is hard
to implement this requirement for write requests that span multiple storage nodes.
• Data visibility. POSIX mandates that modified data is immediately visible to all
processes immediately after a write operation. With each client maintaining a
separate data cache, satisfying this requirement is tricky.
The high-performance community is proposing extensions to the POSIX I/O API to
address the needs of the growing high-end computing sector [69]. These extensions lev-
erage the intensely cooperative nature of high-performance applications, which are capa-
ble of arranging data access to avoid file address space conflicts.
2.5. NFS architectures
Figure 2.5a displays the NFS architecture, in which trusted clients access a single disk
on a single server. Depending on the required performance, reliability, etc. of an NFS
installation, each component of this model can be realized in different ways using a wide-
range of technologies.
• NFS clients. NFS client implementations exist for nearly every operating system.
Clients can be diskless and can contain multiple network interfaces (NICs) for in-
creased bandwidth.
• Host network. IP-based networks use TCP, UDP, or SCTP [108]. Remote Di-
rect Memory Access (RDMA) support is currently under investigation [109].
(a) Simple NFS remote data access (b) NFS namespace partitioning
Figure 2.5: NFS remote data access.
25
• NFS server. The NFS server provides file, metadata, and lock services to the
NFSv4 client. The VFS/Vnode interface translates client requests to requests to
the underlying file system.
• Storage and storage network. NFS supports almost any modern storage system.
SCSI and SATA command sets are common with directly connected disks.
Hardware RAID systems are also common, and software RAID systems are
emerging as a cheaper (yet slower) alternative. iSCSI is emerging as a ubiquitous
storage protocol for IP-based networks. FCIP enables communication between
Fibre Channel SANs by tunneling FCP, the protocol for Fibre Channel networks,
over IP. iFCP, on the other hand, allows Fibre Channel devices to connect di-
rectly to IP-based networks by replacing the Fibre Channel transport with TCP/IP.
NFS works well with small groups, but is limited in every aspect of its design: con-
sumption of compute cycles, memory, bandwidth, storage capacity, etc. To scale up the
number of clients, Figure 2.5b shows how many enterprises, such as universities and
large organizations, partition a file system among several NFS servers. For example, par-
titioning student home directories among many servers spreads the load among the serv-
ers.
The Automounter automatically mounts file systems as clients access different parts
of the namespace [110]. High-demand read-only data may be replicated similarly, with
the Automounter automatically mounting the closest replica server. One problem with
the Automounter is a noticeable delay as users navigate into mounted-on-demand directo-
ries. Other disadvantages of this approach include administration costs, backup man-
agement, load balancing, visible namespace reorganization, and quota management. De-
spite these problems, many organizations use this technique to provide access to very
large data stores.
Figure 2.6: NFS with databases.
26
Database deployments are emerging as another environment for NFS. Figure 2.6 il-
lustrates database clients using a local disk with the database server storing object and log
files in NFS. Database systems manage caches on their own and depend on synchronous
writes, therefore these NFS installations disable client caching and asynchronous writes.
To improve the performance of synchronous writes, some NFS hardware vendors (such
as Network Appliance) write to NVRAM synchronously, and then asynchronously write
this data to disk. It is common for database applications to have many servers accessing
a single file at the same time. CIFS is unsuitable for this type of database deployment
due to its lack of appropriate lock semantics, a specific write block size, and appropriate
commit semantics.
The original goals of distributed file systems were to provide distributed access to lo-
cal file systems, but NFS is now widely used to provide distributed access to other net-
work-based file systems. Although parallel file systems already have remote data access
capability, many lack heterogeneous clients, a strong security model, and satisfactory
WAN performance.
Figure 2.7 illustrates standard NFSv4 clients accessing symmetric and asymmetric
out-of-band parallel file systems. The NFSv4 server accesses a single parallel file system
client and translates all NFS requests to parallel file system specific operations. Symmet-
ric OOB file systems are often limited to a small number of nodes due to the high cost of
a SAN. NFS can increase the number of clients accessing a symmetric OOB file system
by attaching additional NFSv3 clients to each node in Figure 2.7a.
2.6. Additional NFS architectures
This section examines NFS architecture variants that attempt to scale one or more as-
pects of NFS. These architectures transform NFS into a type of parallel file system, in-
creasing scalability but eliminating the file system independence of NFS.
27
2.6.1. NFS-based asymmetric parallel file system
Many examples exist of systems that use the NFS protocol to create an asymmetric
parallel file system. In these systems, a directory service (metadata manager) typically
manages the namespace while files are striped across NFS data servers. NFS clients use
the directory service to retrieve file metadata and file location information. Data servers
store file segments (stripes) in a local file system such as Ext3 [111] or ReiserFS [112].
Several directory service strategies have been suggested, each offering different ad-
vantages.
Explicit metadata node. This strategy uses an NFS server as the metadata node that
manages the file system. Clients access the metadata node to retrieve a list of data serv-
ers and layout information describing how files are striped across them. Clients maintain
data consistency by applying advisory locks to the metadata node. Unmodified NFS
servers are used for storage. Support for mandatory locks or access control lists require a
communication channel to coordinate state information among the metadata nodes and
data servers.
Store metadata information in files. The Expand file system [113] stores file metadata
and location information in regular files in the file system. Clients determine the NFS
data server and pathname of the metadata file by hashing the file pathname. To perform
I/O, a client opens the metadata file for a particular file to retrieve data access informa-
tion. Expand uses unmodified NFS servers but extends NFS clients to locate, parse, in-
terpret, and use file layout information. A major problem with file hashing based on
pathname is that renaming files requires migrating metadata between data servers. In ad-
(a) Symmetric (b) Asymmetric
Figure 2.7: NFS exporting symmetric and asymmetric parallel file systems (PFS).
28
dition, Expand uses data server names with the metadata file to describe file striping in-
formation, which complicates incremental expansion of data servers.
Directory service elimination. The Bigfoot-NFS file system [114] eliminates the direc-
tory service outright, instead requiring clients to gather file information by analyzing the
file system. Any given file is stored on a single unmodified NFS server. Clients discover
which data server stores a file by requesting file information from all servers. Clients ig-
nore failed responses and use the data server that returned a successful response to access
the file. Bigfoot-NFS reduces file discovery time by parallelizing NFS client requests.
The lack of a metadata service simplifies failure recovery, but the inability to stripe files
across multiple data servers and increased network traffic limits the I/O bandwidth to a
given file.
2.6.2. NFS client request forwarding
The nfsp file system [91] architecture contains clients, data servers, and a metadata
node. Unmodified NFS clients mount and issue file metadata and I/O requests to the
metadata node. Each file is stored on a single NFS data server. Metadata nodes forward
client I/O requests to the data server containing the file, which replies directly to the cli-
ent by spoofing its IP address. The inability to stripe files across multiple data servers
and a single metadata node forwarding I/O requests limits the I/O bandwidth to a given
file.
2.6.3. NFS-based peer-to-peer file system
The Kosha file system [115] uses the NFS protocol to create a peer-to-peer file sys-
tem. Kosha taps into available storage space on client nodes by placing a NFS client and
server on each node. To perform I/O, unmodified NFS clients mount and access their
local NFS server (through the loopback network device). A NFS server routes local cli-
ent requests to the remote NFS server containing the requested file. An NFS server de-
termines the correct NFS data server by hashing the file pathname. Each file is stored on
a single NFS data server, which limits the I/O bandwidth to a given file
29
2.6.4. NFS request routing
The Slice file system prototype [116] divides NFS requests into three classes: large
I/O, small-file I/O, and namespace. A µProxy, interposed between clients and servers,
routes NFS client requests between storage servers, small-file servers, and directory serv-
ers. Large I/O flows directly to storage servers while small-file servers aggregate I/O op-
erations of small files and the initial segments of large files. (Chapter VI investigates the
small I/O problem in more depth, demonstrating that small I/O requests are not limited to
small files.)
Slice introduces two policies for transparent scaling of the name space among the di-
rectory servers. The first method uses a directory as the unit of distribution. This works
well when the number of active directories is large relative to the number of directory
servers, but it binds large directories to a single server. The second method uses a file
pathname as the unit of distribution. This balances request distributions independent of
workload by distributing them probabilistically, but increases the cost and complexity of
coordination among directory servers.
2.7. NFSv4 protocol
NFSv4 extends versions 2 and 3 with the following features:
• Fully integrated security. NFSv4 offers authentication, integrity, and privacy by
mandating support of RPCSEC_GSS [117], an API that allows support for a vari-
ety of security mechanisms to be used by the RPC layer. NFSv4 requires support
of the LIPKEY [118] and SPKM-3 [118] pubic key mechanisms and the Kerberos
V5 symmetric key mechanism [119]. NFSv4 also supports security flavors other
than RPCSEC_GSS, such as AUTH_NONE, AUTH_SYS, and AUTH_DH.
AUTH_NONE provides no authentication. AUTH_SYS provides a UNIX-style
authentication. AUTH_DH provides DES-encrypted authentication based on a
network-wide string name, with session keys exchanged via the Diffie-Hellman
public key scheme. The requirement of support for a base set of security proto-
cols is a departure from earlier NFS versions, which left data privacy and integrity
support as implementation details.
30
• Compound requests. Operation bundling, a feature supported in CIFS, allows
clients to combine multiple operations into a single RPC request. This feature re-
duces the number of round-trip-times between the client and the server to accom-
plish a job, e.g., opening a file, and simplifies the specification of the protocol.
• Incremental protocol extensions. NFSv4 allows extensions that do not com-
promise backward compatibility through a series of minor versions.
• Stateful server. The introduction of OPEN and CLOSE commands creates a
stateful server. This allows enhancements such as mandatory locking and server
callbacks and opens the door to consistent client caching. See Section 2.7.1.
• Root file handles. NFSv4 does not use a separate mount protocol to provide the
initial mapping between a path name and file handle. Instead, a client uses a root
file handle and navigates through the file system from there.
• New attribute types. NFSv4 supports three new types of attributes: mandatory,
recommended, and named. NFSv4 also supports access control lists. This attrib-
ute model is extensible in that new attributes can be introduced in minor revisions
of the protocol.
• Internationalization. NFSv4 encodes file and directory names with UTF-8 to
deal accommodate international character sets.
• File system migration and replication. The fs_location attribute provides for
file system migration and replication.
• Cross-platform interoperability. NFSv4 enhances interoperability with the in-
troduction of recommended and named attributes, and by mandating support for
TCP and Windows share reservations.
2.7.1. Stateful server: the new NFSv4 scalability hurdle
The broadest architectural change for NFSv4 is the introduction of a stateful server to
support exclusive opens called share reservations, mandatory locking, and file delega-
tions. This change significantly increases the complexity of the protocol, its implementa-
31
tions, and most notably its fault tolerance semantics. In addition, access to a single and
shared data store can no longer be exported by multiple NFSv4 servers without a mecha-
nism for maintaining global state consistency among the servers.
A share reservation controls access to a file, based on the CIFS oplocks model [2]. A
client issuing an OPEN operation to a server specifies both the type of access required
(read, write, or both) and the types of access to deny others (deny none, deny read, deny
write, or deny both). The NFSv4 server maintains access/deny state to ensure that future
OPEN requests do not conflict with current share reservations. NFSv4 also supports
mandatory and advisory byte-range locks.
An NFSv4 server maintains information about clients and their currently open files,
and can therefore safely pass control of a file to the first client that opens it. A delegation
grants a client exclusive responsibility for consistent access to the file, allowing client
processes to acquire file locks without server communication. Delegations come in two
flavors. A read delegation guarantees the client that no other client has the ability to
write to the file. A write delegation guarantees the client that no other client has read or
write access to the file. If another client opens the file, it breaks these conditions, so the
server must revoke the delegation by way of a callback. The server places a lease on all
state, e.g., client connection information, file and byte-range locks, delegations. If the
server does not hear from a given client within the lease period, the server is permitted to
discard all of the client’s associated state. A failed server that recovers enters a grace pe-
riod, lasting up to a few minutes, which allows clients to detect the server failure and re-
acquire their previously acquired locks and delegations.
The NFSv4 caching model combines the efficient support for close-to-open semantics
of AFS with the block based caching and easier recovery semantics provided by
DCE/DFS, Sprite, and Not Quite NFS. As with NFSv3, clients cache attribute and direc-
tory information for a duration determined by the client. However, a client holding a
delegation is assured that the cached data for that file is consistent.
Proposed extensions to NFSv4 include directory delegations, which grant clients ex-
clusive responsibility for consistent access to directory contents, and sessions, which pro-
vide exactly-once semantics, multipathing and trunking of transport connections, RDMA
support, and enhanced security.
32
2.8. pNFS protocol
To meet enterprise and grand challenge-scale performance and interoperability re-
quirements, the University of Michigan hosted a workshop on December 4, 2003, titled
“NFS Extensions for Parallel Storage (NEPS)” [120]. This workshop raised awareness of
the need for increasing the scalability of NFSv4 [121] and created a set of requirements
and design considerations [122]. The result is the pNFS protocol [123], which promises
file access scalability as well as operating system and storage system independence.
pNFS separates the control and data flows of NFSv4, allowing data to transfer in parallel
from many clients to many storage endpoints. This removes the single server bottleneck
by distributing I/O across the bisectional bandwidth of the storage network between the
clients and storage devices.
The goals of pNFS are to:
• Enable implementations to match or exceed the performance of the underlying
file system.
• Provide high per-file, per-directory and per-file system bandwidth and capacity.
• Support any storage protocol, including (but not limited to) block-, object-, and
file-based storage protocols.
• Obey NFSv4 minor versioning rules, which require that all future versions have
legacy support.
• Support existing storage protocols and infrastructures, e.g., SBC on Fibre Channel
[16], and iSCSI, OSD on Fibre Channel and iSCSI, NFSv4.
For a file system to realize scalable data access, it must be able to achieve perform-
ance gains relative to the amount of additional hardware. For example, if physical disk
access is the I/O bottleneck, a truly scalable file system can realize benefits from increas-
ing the number of disks. The cycle of identifying bottlenecks, removing them, and in-
creasing performance is endless. pNFS provides a framework for continuous saturation
of system resources by separating the data and control flows and by not specifying a data
flow protocol. Focusing on the control flow and leaving the details of the data flow to
implementers allows continuous I/O throughput improvements without protocol modifi-
33
cation. Implementers are free to use the best storage protocol and data access strategy for
their system.
pNFS extensions to NFSv4 focus on device discovery and file layout management.
Device discovery informs clients of available storage devices. A file layout consists of
all information required by a client to access a byte range of a file. For example, a layout
for the block-based Fibre Channel Protocol may contain information about block size,
offset of the first block on each storage device, and an array of tuples that contains device
identifiers, block numbers, and block counts. To ensure the consistency of the file layout
and the data it describes, pNFS includes operations that synchronize the file layout
among the pNFS server and its clients. To ensure heterogeneous storage protocol support
and unlimited data layout strategies, the file layout is opaque in the protocol.
pNFS does not address client caching or coherency of data stored in separate client
caches. Rather, it assumes that existing NFSv4 cache-coherency mechanisms suffice.
Separating the control and data paths in pNFS introduces new security concerns. Al-
though RPCSEC_GSS continues to secure the NFSv4 control path, securing the data path
may require additional effort. pNFS does not define a new security architecture but dis-
cusses general security considerations. For example, certain storage protocols cannot
provide protection against eavesdropping. Environments that require confidentiality must
either isolate the communication channel or use standard NFSv4.
In addition, pNFS does not define mechanisms to recover from errors along the data
path, but leaves their definition to the supporting data access protocols instead.
34
CHAPTER III
A Model of Remote Data Access
Modern data infrastructures are complex, consisting of numerous hardware and soft-
ware components. Issuing a simple question such as, “What is the size of a file?” may
spark a flurry of network traffic and disk accesses, as each component gathers its portion
of the answer. A necessary condition for improving data access is a clear picture of a
system’s control and data flows. This picture includes the components that generate and
receive requests, the number of components that generate and receive requests, the timing
and sequence of requests, and the traversal path of the requests and responses. Successful
application of this knowledge helps to identify bottlenecks and inefficiencies that fetter
the scalability of the system.
To clarify the novel contributions of the NFSv4 and parallel file system architecture
variants discussed in the proceeding chapters, this chapter presents a course architecture
that identifies the components and communication channels for remote data access. I use
this architecture to illustrate the performance bottlenecks of using NFSv4 with parallel
file systems. Finally, I detail the principal requirements for accessing remote data stores.
3.1. Architecture for remote data access
This section describes an architecture that encapsulates the components and commu-
nication channels of data access. Shown in Figure 3.1, remote data access consists of five
major components: application, data, control, metadata, and storage. In addition to the
flow of data, remote data access also consists of five major control paths that manage
data integrity and facilitate data access. These components and communication channels
form a course architecture for describing remote data access, which I use to describe and
analyze the remote data access architectures in subsequent chapters. I use circles,
35
squares, pentagons, hexagons, and disks to represent the application, data, control, meta-
data, and storage components of the data access architecture, respectively.
This dissertation applies the architecture at the granularity of a file system, providing
a clear picture of file system interactions. A file system contains each of the five compo-
nents, although some are “virtual”, and comprise other components. In addition, a single
machine may assume the roles of multiple components, which I portray by adjoining
components.
The following provides a detailed description of each component:
1. Application. Generate file system, file, and data requests. Typically, these
are nodes running applications.
2. Data. Fulfill application component I/O requests through communication
with storage. These support a specific storage protocol.
3. Control. Fulfill application component metadata requests through commu-
nication with metadata components. These support a specific metadata protocol.
Figure 3.1: General architecture for remote data access. Application components generate and analyze data. Data and control components fulfill application requests by accessing storage and metadata components. Metadata components describe and con-trol access to storage. Storage is the persistent repository for data. Directional arrows originate at the node that initiated the communication.
36
4. Metadata. Describe and control access to storage, e.g., file and directory
location information, access control, and data consistency mechanisms. Examples
include an NFS server and parallel file system metadata nodes.
5. Storage. Persistent repository for data, e.g., Fibre Channel disk array, or
nodes with a directly attached disk.
With data components in the middle, application and storage components bookend
the flow of data. Control components support a metadata protocol for communication
with file system metadata component(s). These components connect to one or more net-
works, each supporting different types of traffic. A storage network, e.g., Fibre Channel,
Infiniband, Ethernet, or SCSI, connects data and storage components. A host network is
IP-based and uses metadata components to facilitate data sharing.
Independent control flows, request, control, and manage different types of informa-
tion. The different types of control flows are as follows:
1. Control ↔ Metadata. To satisfy application component metadata requests, con-
trol components retrieve file and directory information, lock file system objects,
and authenticate. To ensure data consistency, metadata components update or re-
voke file system resources on clients.
2. Control ↔ Control. This flow coordinates data access, e.g., collective I/O.
3. Metadata ↔ Metadata. Systems with multiple metadata nodes use this flow to
maintain metadata consistency and to balance load.
4. Metadata ↔ Storage. This flow manages storage, synchronizing file and direc-
tory metadata information as well as access control information. It can also facili-
tate recovery.
5. Storage ↔ Storage. This flow facilitates data redistribution and migration.
37
3.2. Parallel file system data access architecture
Figure 3.2a depicts a symmetric parallel file system with data, control, metadata, and
application components all residing on the same machine. Storage consists of storage
devices accessed via a block-based protocol, e.g., GPFS, or GFS. Figure 3.2b shows an
asymmetric parallel file system with application, control, and data components residing
on the same machine, metadata components on separate machines. Storage consists of
storage devices accessed via a block-, file-, or object-based protocol, e.g., Lustre, or
PVFS2. Storage for block-based systems consists of a disk array while storage for ob-
ject- and file-based systems consists of a fully functional node formatted with a local file
system such as Ext3 or XFS [124].
3.3. NFSv4 data access architecture
Viewing the NFSv4 and parallel file system architectures as a single, integrated archi-
tecture allows the identification of performance bottlenecks and opens the door for devis-
ing mitigation strategies. Figure 3.3 displays my base instantiation of the general archi-
tecture, NFSv4 exporting a parallel file system. NFSv4 application metadata requests are
fulfilled by an NFSv4 control component that communicates with an NFSv4 metadata
component on the NFSv4 server, which in turn uses a PFS control component to commu-
nicate with the PFS metadata component. NFSv4 application data requests are fulfilled
(a) Symmetric (b) Asymmetric
Figure 3.2: Parallel file system data access architectures. (a) A symmetric parallel file system has data, control, metadata, and application components all on the same ma-chine; storage consists of storage devices accessed via a block-based protocol. (b) An asymmetric parallel file system has data, control, and application components on the same machine, metadata components on separate machines; storage consists of storage devices accessed via a block-, file-, or object-based protocol.
38
by an NFSv4 data component that proxies requests through a PFS data component, which
in turn communicates directly with storage. The NFSv4 virtual storage component is the
entire PFS architecture. The PFS virtual application component is the entire NFSv4 ar-
chitecture. Figure 3.3 does not display the virtual components.
Figure 3.3 readily illustrates the NFSv4 “single server” bottleneck discussed in Sec-
tion 2.4.2.1. Data requests from every NFSv4 client must fit through a single NFSv4
server using a single PFS data component. Subsequent chapters vary this architecture to
attain different levels of scalability, performance, security, heterogeneity, transparency,
and independence.
3.4. Remote data access requirements
The utility of each remote data access architecture variant presented in this disserta-
tion derives from several data access requirements. These requirements fall into the fol-
lowing categories:
• I/O workload. A data access architecture must deliver satisfactory performance.
An application’s I/O workload determines the type of performance required, e.g.,
single and multiple client I/O throughput, small I/O request, file creation and
metadata management, etc.
Figure 3.3: NFSv4-PFS data access architecture. With NFSv4 exporting a parallel file system, NFSv4 application metadata requests are fulfilled by an NFSv4 control compo-nent that communicates with the NFSv4 metadata component, which in turn uses a PFS control component to communicate with the PFS metadata component. NFSv4 applica-tion data requests are fulfilled by an NFSv4 data component that proxies requests through a PFS data component, which in turn communicates directly with storage. The NFSv4 storage component is the entire PFS architecture. The PFS application compo-nent is the entire NFSv4 architecture. With a symmetric parallel file system, the PFS metadata and data components are coupled.
39
• Security and access control. Many high-end computers run applications that
deal with both private and public information. Systems must be able to handle
both types of applications, ensuring that sensitive data is separate and secure. Se-
curity can be realized through air-gap, encryption, node fencing, and numerous
other methods. In addition, cross-realm access control to encourage research col-
laboration must be transparent and foolproof.
• Wide area networks. Beyond heightened security and access control require-
ments, successful global collaborations require high performance, heterogeneous,
and transparent access to data, independent of the underlying storage system.
• Local area networks. Performance and scalability are key requirements of ap-
plications designed to run in LAN environments. Heterogeneous data access and
storage system independence are also becoming increasingly important. For ex-
ample, in many multimedia studios, designers using PCs and large UNIX render-
ing clusters access multiple on- and off-site storage systems [11].
• Development and management. With today’s increasing reliance on middle-
ware applications, reducing development and administrator training costs and
problem determination time is vital.
3.5. Other general data access architectures
3.5.1. Swift architecture
The Swift parallel file system was an early pioneer in achieving scalable I/O through-
put using distributed disk striping [96]. Figure 3.4 displays the Swift architecture com-
ponents. Swift did not define a specific architecture, but instead listed four optional
components. In general, client components perform I/O by using a distribution agent
component to retrieve a transfer plan from a storage mediator component and transfer
data to/from storage agent components. The original Swift prototype used a standard
transfer plan, obviating the need for storage mediators.
40
Swift architecture components map almost one-to-one with the general architecture
components introduced in this chapter. Swift storage mediators function as metadata
components, Swift storage agents function as storage components, and Swift clients func-
tion as application components. The architecture presented here splits the role of a Swift
distribution agent into control and data components. Separating I/O and metadata re-
quests into separate components lets us represent out-of-band systems that use different
protocols for each channel. In addition, applying the architecture components iteratively
and including all communication channels provides a holistic view of remote data access.
3.5.2. Mass storage system reference model
In the late 1980s, the IEEE Computer Society Mass Storage Systems and Technology
Technical Committee attempted to organize the evolving storage industry by creating a
Mass Storage System Reference Model [93, 94], now referred to as the IEEE Reference
Model for Open Storage Systems Interconnection (OSSI model) [125]. Shown in Figure
3.5, its goal is to provide a framework for the coordination of standards development for
storage systems interconnection and a common perspective for existing standards. One
system—perhaps the only one—based directly on the OSSI model is the High-
performance Storage System (HPSS) [126].
The OSSI model decomposes a complete storage system into the following storage
modules, which are defined by several IEEE P1244 standards documents:
Figure 3.4: Swift architecture [96]. The Swift architecture consists of four components: clients, distribution agents, a storage mediator, and storage agents. Clients perform I/O by using a distribution agent to retrieve a transfer plan from a storage mediator and trans-fer data to/from storage agents.
41
• Application Environment Profile. The environmental software interfaces re-
quired by open storage system services.
• Object Identifiers. The format and algorithms used to generate globally unique
and immutable identifiers for every element within an open storage systems.
• Physical Volume Library. The software interfaces for services that manage re-
movable media cartridges and their optimization.
• Physical Volume Repository. The human and software interfaces for services
that stow cartridges and mount these cartridges onto devices, employing either ro-
botic or human transfer agents.
• Data Mover. The software interfaces for services that transfer data between two
endpoints.
• Storage System Management. A framework for consistent and portable services
to monitor and control storage system resources as motivated by site-specified
storage management policies.
• Virtual Storage Service. The software interfaces to access and organize persis-
tent storage.
The goals of the architecture presented in this chapter complement those of the OSSI
model. Figure 3.6 demonstrates how my architecture encompasses the OSSI modules
and protocols. The IEEE developed the OSSI model to expose areas where standards are
Figure 3.5: Reference Model for Open Storage Systems Interconnection (OSSI) [125]. The OSSI model diagram displays the software design relationships between the primary modules in a mass storage system to facilitate standards development for storage systems interconnection.
42
necessary (or need improvement), so they could be implemented and turned into com-
mercial products. . The OSSI model does not capture the physical nodes or data and con-
trol flows in a data architecture, but rather the design relationships between components.
For example, Figure 3.5 displays data movers and clients as separate objects connected
with a request flow, a representation more in line with modern software design tech-
niques than physical implementation. The architecture presented in this chapter focuses
on identifying potential bottlenecks by grouping a node’s components and identifying the
data and control flows that bind the nodes.
Figure 3.6: General data access architecture view of OSSI model. The OSSI modules in the general data access architecture. The Virtual Storage Service uses Data Movers to route data between application and storage components. Control components use the Physical Volume Li-brary, Physical Volume Repository, and Virtual Storage Service to mount and obtain file metadata information. Metadata components use the Storage System Management protocol to manage storage.
43
CHAPTER IV
Remote Access to Unmodified Parallel File Systems
Collaborations such as TeraGrid [127] allow global access to massive data sets in a
nearly seamless environment distributed across several sites. Data access transparency
allows users to seamlessly access data from multiple sites using a common set of tools
and semantics. The degree of transparency between sites can determine the success of
these collaborations. Factors affecting data access transparency include latency, band-
width, security, and software interoperability.
To improve performance and transparency at each site, the use of parallel file systems
is on the rise, allowing applications high-performance access to a large data store using a
single set of semantics. Parallel file systems can adapt to spiraling storage needs and re-
duce management costs by aggregating all available storage into a single framework.
Unfortunately, parallel file systems are highly specialized, lack seamless integration and
modern security features, often limited to a single operating system and hardware plat-
form, and suffer from slow offsite performance. In addition, many parallel file systems
are proprietary, which makes it almost impossible to add extensions for a user’s specific
needs or environment.
NFSv4 allows researchers access to remote files and databases using the same pro-
grams and procedures that they use to access local files, as well as obviating the need to
create and update local copies of a data set manually. To meet quality of service re-
quirements across metropolitan and wide-area networks, NFSv4 may need to use all
available bandwidth provided by the parallel file system. In addition, NFSv4 must be
able to provide parallel access to a single file from large numbers of clients, a common
requirement of scientific applications.
This chapter discusses the challenge of achieving full utilization of an unmodified
storage system’s available bandwidth while retaining the security, consistency, and het-
44
erogeneity features of NFSv4—features missing in many storage systems. I introduce
extensions that allow NFSv4 to scale beyond a single server by distributing data access
across the data components of the remote data store. These extensions include a new
server-to-server protocol and a file description and location mechanism. I refer to NFSv4
with these extensions as Split-Server NFSv4.
The remainder of this chapter is organized as follows. Section 4.1 discusses scaling
limitations of the NFSv4 protocol. Section 4.2 describes the NFSv4 protocol extensions
in Split-Server NFSv4. Sections 4.3 and 4.4 discuss fault tolerance and security implica-
tions of these extensions. Section 4.5 provides performance results of my Linux-based
prototype and discusses performance issues of NFS with parallel file systems. Section
4.5.5 reviews alternate possible architectures and Section 4.6 concludes this chapter.
4.1. NFSv4 state maintenance
NFSv4 server state is used to support exclusive opens (called share reservations),
mandatory locking, and file delegations. The need to manage consistency of state infor-
mation on multiple nodes fetters the ability to export an object via multiple NFSv4 serv-
ers. This “single server” constraint becomes a bottleneck if load increases while other
nodes in the parallel file system are underutilized. Partitioning the file system space
among multiple NFS servers helps, but increases administrative complexity, management
cost, and fails to address scalable access to a single file or directory, a critical require-
ment of many high-performance applications [7].
4.2. Architecture
Figure 4.1 shows how Split-Server NFSv4 modifies the NFSv4-PFS architecture of
Figure 3.3 by exporting the file system from all available parallel file system clients.
NFSv4 clients use their data component to send data requests to every available PFS data
component, distributing data requests across the bisectional bandwidth of the client net-
work. Any increase or decrease in available throughput of the parallel file system, e.g.,
additional nodes or increased network bandwidth, is reflected in Split-Server NFSv4 I/O
throughput. NFSv4 access control components exist with each PFS data component to
45
ensure that data servers allow only authorized data requests. The PFS data component
uses a control component to retrieve PFS file layout information for storage access.
The Split-Server NFSv4 extensions have the following goals:
• Read and write performance to scale linearly as parallel file system nodes are
added or removed.
• Support for unmodified parallel file systems.
• Single file system image with no partitioning.
• Negligible impact to NFSv4 security model and fault tolerance semantics.
• No dependency on special features of the underlying parallel file system.
4.2.1. NFSv4 extensions
To export a file from multiple NFSv4 servers exporting shared storage, the servers
need a common view of their shared state. NFSv4 servers must therefore share state in-
formation and do so consistently, i.e., with single-copy semantics. Without an identical
view of the shared state, conflicting file and byte-range locks can cause data corruption or
allow malicious clients to read and write data without proper authorization.
Figure 4.1: Split-Server NFSv4 data access architecture. NFSv4 application compo-nents use a control component to obtain metadata information from an NFSv4 metadata component and an NFSv4 data component to fulfill I/O requests. The NFSv4 metadata component uses a PFS control component to retrieve PFS metadata and shares its ac-cess control information with the data servers to ensure the data servers allow only au-thorized data requests. The PFS data component uses a PFS control component to ob-tain PFS file layout information for storage access.
46
To provide a consistent view, I use a state server to copy the portions of state needed
to serve READ, WRITE, and COMMIT requests at I/O nodes, (designated data servers).
Figure 4.2a shows the Split-Server NFSv4 architecture. Transforming NFSv4 into the
out-of-band protocol shown in Figure 4.2b, unleashes the I/O scalability of the underlying
parallel file system.
Many clients performing simultaneous metadata operations can overburden a state
server. For example, coordinating clients simultaneously opening separate result files.
To reduce the load on the state server, a system administrator can partition file system
metadata among several state servers, ensuring that all state for a single file resides on a
single state server. In addition, control processing can be distributed by allowing data
servers to handle operations that do not affect NFSv4 server state, e.g., SETATTR and
GETATTR.
4.2.2. Configuration and setup
The mechanics of a client connection to a server are the same as NFSv4, with the cli-
ent mounting the state server managing the file space of interest. Data servers register
with the state server at start-up or any time thereafter and are immediately available to
Figure 4.2: Split-Server NFSv4 design and process flow. Storage consists of a paral-lel file system such as GPFS. NFSv4 servers are divided into data servers, which handle s READ, WRITE, and COMMIT requests, and a state server, which handles file system and stateful requests. The state server coordinates with the data servers to ensure only authorized client I/O requests are fulfilled.
47
4.2.3. Distribution of state information
On receiving an OPEN request, a state server picks a data server to service the data
request. The selection algorithm is implementation defined. In my prototype, I use
round-robin. The state server then places share reservation state for the request on the
selected data server. The following items constitute a unique identifier for share reserva-
tion state:
• Client name, IP address, and verifier
• Access/Deny authority
• File handle
• File open owner
When a client issues a CLOSE request, the state server first reclaims the state from
the data server. Once reclamation is complete, the standard NFSv4 close procedure pro-
ceeds.
Support for locks does not require distributing additional state beyond share reserva-
tions. NFSv4 uses POSIX locks and relies on the locking subsystem of the underlying
parallel file system. Delegations also require no additional state on the data servers as the
state server manages conflicting access requests for a delegated file.
4.2.4. Redirection of clients
Split-Server NFSv4 extends the NFSv4 protocol with a new attribute called
FILE_LOCATION to enable Split-Server NFSv4 to provide access to a single file via multi-
ple nodes.
The FILE_LOCATION attribute specifies:
• Data server location information
• Root pathname
• Read-only flag
Clients use FILE_LOCATION information to direct READ, WRITE, and COMMIT re-
quests to the named server. The root pathname allows each data server to have its own
48
namespace. The read-only flag declares whether the data server will accept WRITE
commands.
4.3. Fault tolerance
The failure model for Split-Server NFSv4 follows that of NFSv4 with the following
modifications:
1. A failed state server can recover its runtime state by retrieving each part of the
state from the data servers.
2. The failure of a data server is not critical to system operation.
4.3.1. Client failure and recovery
An NFSv4 server places a lease on all share reservations, locks, and delegations is-
sued to a client. Clients must send RENEW operations, akin to heartbeat messages, to
the server to retain their leases. If a server does not receive a RENEW operation from a
client within the lease period, the server may unilaterally revoke all state associated with
the given client. Leases are also implicitly renewed as a side effect of a client request
that includes its identifier. However, Split-Server NFSv4 redirects READ, WRITE, and
COMMIT operations to the data servers, so the renewal implicit in these operations is no
longer visible to the state server. Therefore, RENEW operations are sent to a client’s
mounted state server either by the modifying a client to send explicit RENE operations,
or by engineering the data server that is actively fulfilling client requests to send them.
Enabling data servers to send RENEW messages on behalf of a client improves scalabil-
ity by limiting the maximum number of renewal messages received by a state server to
the number of data server nodes.
4.3.2. State server failure and recovery
A recovering state server stops servicing requests and queries data servers to rebuild
its state.
49
4.3.3. Data server failure and recovery
A failed data server is discovered by the state server when it tries to replicate state
and by clients who issue requests. A client obtains a new data server by reissuing the re-
quest for the FILE_LOCATION attribute. A data server that experiences a network partition
from the state server immediately stops fulfilling client requests, preventing a state server
from granting conflicting file access requests.
4.4. Security
The addition of data servers to the NFSv4 protocol does not require extra security
mechanisms. The client uses the security protocol negotiated with a state server for all
nodes. Servers communicate over RPCSEC_GSS, the secure RPC mandated for all
NFSv4 commands.
4.5. Evaluation
This section compares unmodified NFSv4 with Split-Server NFSv4 as they export a
GPFS file system. The test environment is shown in Figure 4.3. All nodes are connected
via an IntraCore 35160 Gigabit Ethernet switch with 1500-byte Ethernet frames.
Server System: The five server nodes are equipped with Pentium 4 processors with a
clock rate of 850 MHz and a 256 KB cache; 2 GB of RAM; one Seagate 80 GB, 7200
RPM hard drive with an Ultra ATA/100 interface and a 2 MB cache; and two 3Com
3C996B-T Gigabit Ethernet cards. Servers run a modified Linux 2.4.18 kernel with Red
Hat 9.
Figure 4.3: Split-Server NFSv4 experimental setup. The system has four Split-Server NFSv4 clients and five GPFS servers exporting a common file system. The GPFS serv-ers are exported by Split-Server NFSv4, consisting of a state server and at most four data servers.
50
Client System: Client nodes one through three are equipped with dual 1.7 GHz Pen-
tium 4 processors with a 256 KB cache; 2 GB of RAM; a Seagate 80 GB, 7200 RPM
hard drive with an Ultra ATA/100 interface and a 2 MB cache; and a 3Com 3C996B-T
Gigabit Ethernet card. Client node four is equipped with an Intel Xeon processor with a
clock rate of 1.4 GHz and a 256 KB cache; 1 GB RAM; an Adaptec 40 GB, 10K RPM
SCSI hard drive using Ultra 160 host adapter; and a AceNIC Gigabit Ethernet card. All
clients run the Linux 2.6.1 kernel with a Red Hat 9 distribution.
Netapp FAS960 Filer: The storage device has two processors, 6 GB of RAM, and a
quad Gigabit Ethernet card. It is connected to eight disks running RAID4.
The five servers run the GPFS v1.3 parallel file system with a 40 GB file system and
a 16 KB block size. GPFS maintains a 32 MB file and metadata cache known as the
pagepool. All experiments use forty NFSv4 server threads except the Split-Server
NFSv4 write experiments, which uses a single NFSv4 server thread to improve perform-
ance. (Discussed in Section 4.5.4)
4.5.1. Scalability experiments
To evaluate scalability, I measure the aggregate I/O throughput while increasing the
number of clients accessing GPFS, NFSv4, and Split-Server NFSv4. Since both standard
NFSv4 and Split-Server NFSv4 export a GPFS file system, the GPFS configuration con-
stitutes the theoretical ceiling on NFSv4 and Split-Server NFSv4 I/O throughput. The
extra hop between the GPFS server and the NFS client prevents the performance of
NFSv4 and Split-Server NFSv4 from equaling GPFS performance. The goal is for Split-
Server NFSv4 to scale linearly with GPFS.
GPFS is configured as a four node GPFS file system directly connected to the filer.
NFSv4 is configured with a single NFSv4 server running on a GPFS node and four cli-
ents. Split-Server NFSv4 is configured with a state server, four data servers (each run-
ning on a GPFS file system node), and four clients. At most one client accesses each data
server during an experiment.
To measure the aggregate I/O throughput, I use the IOZone [128] benchmark tool. In
the first set of experiments, each client reads/writes a separate 500 MB file. In the second
set of experiments, each client reads/writes disjoint 500 MB portions of a single pre-
51
existing file. The aggregate I/O throughput is calculated when the last client completes
its task. The value presented is the average over ten executions of the benchmark. The
write timing includes the time to flush the client’s cache to the server. Clients and serv-
ers purge their caches before each read experiment. All read experiments use a warm
filer cache to reduce the effect of disk access irregularities.
The experimental goal is to test whether that Split-Server NFSv4 scales linearly with
additional resources. I engineered a server bottleneck in the system by using a small
GPFS pagepool and block size, and by cutting the number of server clock cycles in half.
This ensures that each server is fully utilized, which implies that the results are applicable
to any system that needs to scale with additional servers.
4.5.2. Read performance
First, I measure read performance while increasing the number of clients from one to
four. Figure 4.4a shows the results with separate files. Figure 4.4b presents the results
with a single file. GPFS imposes a ceiling on performance with an aggregate read
throughput of 23 MB/s with a single server. With four servers, GPFS reaches 94.1 MB/s
and 91.9 MB/s in multiple and single file experiments respectively. The decrease in per-
formance for the single file experiment arises because all servers must access a single
metadata server. With Split-Server NFSv4, as I increase the number of clients and data
servers the aggregate read throughput increases linearly, reaching 65.7 MB/s with multi-
ple files and 59.4 MB/s for the single file experiment. NFSv4 aggregate read throughput
remains flat at approximately 16 MB/s in both experiments, a consequence of the single
server bottleneck.
4.5.3. Write performance
The second experiment measures the aggregate write throughput as I increase the
number of clients from one to four. I first measure the performance of all clients writing
to separate files, shown in Figure 4.5a.
GPFS sets the upper limit with an aggregate write throughput of 16.7 MB/s with a
single server and 61.4 MB/s with four servers. The fourth server overloads the filer’s
52
CPU. NFSv4 and Split-Server NFSv4 initially have an aggregate write throughput of ap-
proximately 8 MB/s. The aggregate write throughput of Split-Server NFSv4 increases
linearly, reaching a maximum of 32 MB/s. As in the read experiments, the aggregate
write throughput of NFSv4 remains flat as the number of clients is increased.
Figure 4.5b shows the results of each client writing to different regions of a single
file. The write performance of GPFS and NFSv4 is similar to the separate file experi-
ments. The major difference occurs with Split-Server NFSv4, achieving an initial aggre-
gate throughput of 6.1 MB/s and increasing to 18.7 MB/s. Poor performance and lack of
scalability is the result of modification time (mtime) synchronization between GPFS
servers. GPFS avoids synchronizing the mtime attribute when accessed directly. GPFS
must synchronize the mtime attribute when accessed with NFSv4 to ensure NFSv4 client
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4
Number of Nodes
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
GPFSSplit-Server NFSv4NFSv4
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4
Number of Nodes
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
GPFSSplit-Server NFSv4NFSv4
(a) Separate files (b) Single file Figure 4.4: Split-Server NFSv4 aggregate read throughput. GPFS consists of up to four file system nodes. NFSv4 is up to four clients accessing a single GPFS server. Split-Server NFSv4 consists of up to four clients accessing up to four data servers and a state server. Split-Server NFSv4 scales linearly as I increase the number of GPFS nodes but NFSv4 performance remains flat.
0
10
20
30
40
50
60
70
80
1 2 3 4
Number of Nodes
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
GPFSSplit-Server NFSv4NFSv4
0
10
20
30
40
50
60
70
80
1 2 3 4
Number of Nodes
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
GPFSSplit-Server NFSv4NFSv4
(a) Separate files (b) Single file Figure 4.5: Split-Server NFSv4 aggregate write throughput. GPFS consists of up to four file system nodes. NFSv4 is up to four clients accessing a single GPFS server. Split-Server NFSv4 consists of up to four clients accessing up to four data servers and a state server. With separate files, Split-Server NFSv4 scales linearly as I increase the number of GPFS nodes but NFSv4 performance remains flat. With a single file, Split-Server NFSv4 performance is fettered by mtime synchronization.
53
cache consistency. Furthermore, GPFS includes the state server among servers that syn-
chronize the mtime attribute, further reducing performance.
4.5.4. Discussion
Split-Server NFSv4 scales linearly with the number of GPFS nodes except when mul-
tiple clients write to a single file, which experiences lower performance since GPFS syn-
chronizes the mtime attribute to comply with the NFS protocol. Client cache synchroni-
zation relies on the mtime attribute, but it is unnecessary in some environments. For ex-
ample, some programs cache data themselves and use the OPEN option O_DIRECT to
disable client caching for a file. Other programs require only non-conflicting write con-
sistency, handling data consistency without relying on locks or cache consistency mecha-
nisms. PVFS2 [129] is designed for such programs. To succeed in these environments,
the NFS protocol must relax its client cache consistency semantics.
NFS block sizes have tended to be small. Block sizes were 4 KB in NFSv2, and grew
to 8 KB in NFSv3. Most recent implementations now support 32 KB or 64 KB. Syn-
chronous writes along with hardware and kernel limitations are some of the original rea-
sons for small block sizes. Another is UDP, which uses IP fragmentation to divide each
block into multiple requests. Consequently, the loss of a single request means the loss of
the entire block. The introduction in 2002 of TCP and a larger buffer space to the Linux
implementation of NFS allows for larger block sizes, but the current Linux kernel has a
32 KB limit. This creates a disparity with many parallel file systems, which use a stripe
size of greater than 64 KB. To avoid this data request inefficiency, NFS implementations
need to catch up to parallel file systems like GPFS that support block sizes of greater than
1 MB.
Multiple NFS server threads can also reduce I/O throughput. Even with a single NFS
client, the parallel file system assumes all requests are from different sources and per-
forms locking between threads. In addition, server threads can process read and write
requests out of order, hampering the parallel file system’s ability to improve its interac-
tion with the physical disk.
In NFSv3, the lack of OPEN and CLOSE commands leads to an implicit open and
close of a file in the underlying file system on every request. This does not degrade per-
54
formance with local file systems such as Ext3, but the extra communication required to
contact a metadata server in parallel file systems restricts NFSv3 throughput.
4.5.5. Supplementary Split-Server NFSv4 designs
4.5.5.1 File system directed load balancing
Split-Server NFSv4 distributes clients among the data servers using a round-robin al-
gorithm. Allowing the underlying file system to direct clients to data servers may im-
prove efficiency of available resources since the parallel file system may have more in-
sight into the current client load. For example, coordinated use of the parallel file sys-
tem’s data cache may prove effective with certain I/O access patterns. Allowing the un-
derlying file system to direct client load may also facilitate the use of multiple metadata
servers without an additional server-to-server protocol. This suggests extending the inter-
face between an NFSv4 server and its exported parallel file system.
4.5.5.2 Client directed load balancing
Split-Server NFSv4’s use of the FILE_LOCATION attribute enables a centralized way to
balance client load. To avoid having to modify the NFSv4 protocol, random distribution
of client load among data servers may prove sufficient in certain cases. Once clients dis-
cover the available data servers through configuration files or specialized mount options,
they can randomly distribute requests among the data servers. Data servers can retrieve
required state from the state server as needed.
4.5.5.3 NFSv4 client directed state distribution
Split-Server NFSv4’s server-to-server communication increases the load on a central
resource and delays stateful operations. Clients can reduce load on the metadata server
by assuming responsibility for the distribution of state. After a client successfully exe-
cutes a state generating request, e.g., LOCK, on the state server, it sends the same opera-
tion to every data server in the FILE_LOCATION attribute. In addition to reducing load on
the state server, this design also isolates all modifications to the NFS client.
55
4.6. Related work
Several systems aggregate partitioned NFSv3 servers into a single file system image
[91, 113, 114]. These systems, discussed in Section 2.6, transform NFS into a type of
parallel file system, which increases scalability but eliminates the file system independ-
ence of NFS.
The Storage Resource Broker (SRB) [130] aggregates storage resources, e.g., a file
system, an archival system, or a database, into a single data catalogue. The HTTP proto-
col is the most common and widespread way to access remote data stores. SRB and
HTTP also have some limits: they do not enable parallel I/O to multiple storage endpoints
and do not integrate with the local file system.
The notion of serverless or peer-to-peer file systems was popularized by xFS [131],
which eliminates the single server bottleneck and provides data redundancy through net-
work disk striping. More recently, wide-area file systems such as LegionFS[132] and
Gfarm [133] provide a fully integrated and distributed environment and a secure means
of cross-domain access. Targeted for the grid, these systems use data replication to pro-
vide reasonable performance to globally distributed data. The major drawback of these
systems is their lack of interoperability with other file systems–mandating themselves as
the only grid file system. Split-Server NFSv4 allows file system independent access to
remote data stores in the LAN or across the WAN.
GPFS-WAN [134, 135], used extensively in the TeraGrid [127], features exceptional
throughput across high-speed, long haul networks, but is focused on large I/O transfers
and is restricted to GPFS storage systems.
GridFTP [4] is also used extensively in Grid computing to enable high I/O through-
put, operating system independence, and secure WAN access to high-performance file
systems. Successful and popular, GridFTP nevertheless has some serious limitations: it
copies data instead of providing shared access to a single copy, which complicates its
consistency model and decreases storage capacity; it lacks direct data access and a global
namespace; runs as an application, and cannot be accessed as a file system without oper-
ating system modification. Split-Server NFSv4 is not intended to replace GridFTP, but to
work alongside it. For example, in tiered projects such as ATLAS at CERN, GridFTP
remains a natural choice for long-haul scheduled transfers among the upper tiers, while
56
Split-Server NFSv4 offers advantages in the lower tiers by letting scientists work with
files directly, promoting effective data management.
4.7. Conclusion
This chapter introduces extensions to NFSv4 to utilize an unmodified parallel file sys-
tem’s available bandwidth while retaining NFSv4 features and semantics. Using a new
FILE_LOCATION attribute, Split-Server NFSv4 provides parallel and scalable access to ex-
isting parallel file systems. The I/O throughput of the prototype scales linearly with the
number of parallel file system nodes except when multiple clients write to a single file,
which experiences lower performance due to mtime synchronization in the underlying
parallel file system.
57
CHAPTER V
Flexible Remote Data Access
Parallel file systems teach us two important lessons. First, direct and parallel data ac-
cess techniques can fully utilize available hardware resources, and second, standard data
flow protocols such as iSCSI [22], OSD [136], and FCP [21] can increase interoperability
and reduce development and management costs. Unfortunately, standard protocols are
useful only if file systems use them, and most parallel file systems support only one pro-
tocol for their data channel. In addition, most parallel file systems use proprietary control
protocols.
Distributed file systems view storage through parallel file system data components,
intermediary nodes through which data must travel. This extra layer of processing pre-
vents distributed file systems from matching the performance of the exported file system,
even for a single client.
An architectural framework should be able to encompass all storage architectures,
i.e., symmetric or asymmetric; in-band or out-of-band; and block-, object-, or file-based,
without sacrificing performance. The NFSv4 file service, with its global namespace,
high level of interoperability and portability, simple and cost-effective management, and
integrated security provides an ideal base for such a framework.
This chapter analyzes a prototype implementation of pNFS [121, 122], an extension
of NFSv4 that provides file access scalability plus operating system, hardware platform,
and storage system independence. pNFS eliminates the performance bottlenecks of NFS
by enabling the NFSv4 client to access storage directly. pNFS facilitates interoperability
between standard protocols by providing a framework for the co-existence of NFSv4 and
other file access protocols. My prototype demonstrates and validates the potential of
pNFS. The I/O throughput of my prototype equals that of its exported file system
(PVFS2 [129]) and is dramatically better than standard NFSv4.
58
The remainder of this chapter is organized as follows. Section 5.1 describes the
pNFS architecture. Sections 5.2 and 5.3 present PVFS2 and my pNFS prototype. Sec-
tion 5.4 measures performance of my Linux-based prototype. Section 5.5 discusses addi-
tional pNFS design and implementation issues, including the impact of locking and secu-
rity support on the pNFS protocol. Section 5.6 summarizes and concludes this chapter.
5.1. pNFS architecture
To meet enterprise and grand challenge-scale performance and interoperability re-
quirements, a group of engineers—initially ad-hoc but now integrated into the IETF—is
designing extensions to NFSv4 that provide parallel access to storage systems. The result
is pNFS, which promises file access scalability as well as operating system and storage
system independence. pNFS separates the control and data flows of NFSv4, allowing
data to transfer in parallel from many clients to many storage endpoints. This removes
the single server bottleneck by distributing I/O across the bisectional bandwidth of the
storage network between the clients and storage devices.
Figure 5.1 shows how pNFS alters the original NFSv4-PFS data access architecture
(Figure 3.3) by integrating NFSv4 application and control components with the parallel
file system data component. The NFSv4 client (NFSv4 control component) continues to
send control operations to the NFSv4 server (NFSv4 metadata component), but shifts the
Figure 5.1: pNFS data access architecture. NFSv4 control components access the NFSv4 metadata component and use a parallel file system (PFS) data component to per-form I/O directly to storage.
59
responsibility for achieving scalable I/O throughput to a storage-specific driver (PFS data
component).
Figure 5.2 depicts the architecture of pNFS, which adds a layout and I/O driver, and a
file layout retrieval interface to the standard NFSv4 architecture. pNFS clients send I/O
requests directly to storage and access the pNFS server for file metadata information.
A benefit of pNFS is its ability to match the performance of the underlying storage
system’s native client while continuing to support all standard NFSv4 features. This sup-
port is ensured by introducing pNFS extensions into a “minor version”, a standard exten-
sion mechanism of NFSv4. In addition, pNFS does not impose restrictions that might
limit the underlying file system’s ability to provide quality-enhancing features such as
usage statistics or storage management interfaces.
5.1.1. Layout and I/O driver
The layout driver understands the file layout and storage protocol of the storage sys-
tem. A layout consists of all information required to access any byte range of a file. For
example, a block layout may contain information about block size, offset of the first
block on each storage device, and an array of tuples that contains device identifiers, block
numbers, and block counts. An object layout specifies the storage devices for a file and
the information necessary to translate a logical byte sequence into a collection of objects.
A file layout is similar to an object layout but uses file handles instead of object identifi-
ers. The layout driver uses the layout to translate read and write requests from the pNFS
Server
Layout
I/O
Client
Layout Driver
pNFSParallel I/O
NFSv4 I/O and MetadataStorageNodes
pNFSClient I/O
Driver
StorageSystem
Layout
ControlFlow
pNFSServer
Figure 5.2: pNFS design. pNFS extends NFSv4 with the addition of a layout and I/O driver, and a file layout retrieval interface. The pNFS server obtains an opaque file layout map from the storage system and transfers it to the pNFS client and subsequently to its layout driver for direct and parallel data access.
60
client into I/O requests understood by the storage devices. The I/O driver performs raw
I/O, e.g., Myrinet GM [137], Infiniband [138], TCP/IP, to the storage nodes.
The layout driver can be specialized or (preferably) implement a standard protocol
such as the Fibre Channel Protocol (FCP), allowing multiple file systems to use the same
layout driver. Storage systems adopting this architecture reduce development and man-
agement obligations by obviating a specialized file system client. This holds the promise
of reducing development and maintenance of high-end storage systems.
5.1.2. NFSv4 protocol extensions
This section describes the NFSv4 protocol extensions to support pNFS.
File system attribute. A new file system attribute, LAYOUT_CLASSES, contains the
layout driver identifiers supported by the underlying file system. Upon encountering an
unknown file system identifier, a pNFS client retrieves this attribute and uses it to select
an appropriate layout driver, To prevent namespace collisions, a global registry main-
tainer such as IANA [139] specifies layout driver identifiers.
LAYOUTGET operation. The LAYOUTGET operation obtains file access information
for a byte-range of a file from the underlying storage system. The client issues a
LAYOUTGET operation after it opens a file and before it accesses file data. Implemen-
tations determine the frequency and byte range of the request.
The arguments are:
• File handle
• Layout type
• Access type
• Offset
• Extent
• Minimum size
• Maximum count
61
The file handle uniquely identifies the file. The layout type identifies the preferred
layout type. The offset and extent arguments specify the requested region of the file.
The access type specifies whether the requested file layout information is for reading,
writing, or both. This is useful for file systems that, for example, provide read-only rep-
licas of data. The minimum size specifies the minimum overlap with the requested offset
and length. The maximum count specifies the maximum number of bytes for the result,
including XDR overhead.
LAYOUTGET returns the requested layout as an opaque object and its associated
offset and extent. By returning file layout information to the client as an opaque object,
pNFS is able to support arbitrary file layout types. At no time does the pNFS client at-
tempt to interpret this object, it acts simply as a conduit between the storage system and
the layout driver. The byte range described by the returned layout may be larger than the
requested size due to block alignments, layout prefetching, etc.
LAYOUTCOMMIT operation. The LAYOUTCOMMIT operation commits changes
to the layout information. The client uses this operation to commit or discard provision-
ally allocated space, update the end of file, and fill in existing holes in the layout.
LAYOUTRETURN operation. The LAYOUTRETURN operation informs the NFSv4
server that layout information obtained earlier is no longer required. A client may return
a layout voluntarily or upon receipt of a server recall request.
CB_LAYOUTRECALL operation. If layout information is exclusive to a specific cli-
ent and other clients require conflicting access, the server can recall a layout from the cli-
ent using the CB_LAYOUTRECALL callback operation.3 The client should complete
any in-flight I/O operations using the recalled layout and write any buffered dirty data
directly to storage before returning the layout, or write it later using normal NFSv4 write
operations.
62
GETDEVINFO and GETDEVLIST operations. The GETDEVINFO and
GETDEVLIST operations retrieve additional information about one or more storage
nodes. The layout driver issues these operations if the device information inside the file
layout does not provide enough information for file access, e.g., SAN volume label in-
formation or port numbers.
5.2. Parallel virtual file system version 2
This section presents an overview of PVFS2, a user-level, open-source, scalable,
asymmetric parallel file system designed as a research tool and for production environ-
ments. Despite its lack of locking and security support, I chose PVFS2 because its user
level design provides a streamlined architecture for rapid prototyping of new ideas.
Figure 5.3 depicts the PVFS2 architecture.
PVFS2 consists of clients, storage nodes, and metadata servers. Metadata servers
store all information about the file system in a Berkeley DB database [140], distributing
metadata via a hash on the file name. File data is striped across storage nodes, which can
be increased in number as needed.
PVFS2 uses algorithmic file layouts for distributing data among the storage nodes.
The data distribution algorithm is user defined, defaulting to round-robin striping. The
clients and storage nodes share the data distribution algorithm, which does not change
during the lifetime of the file. A series of file handles, one for each storage node,
3 NFSv4 already contains a callback operation infrastructure for delegation support.
Kernel
User
Client
PVFS2 Client
Paralleli/O
ControlFlow
Application
PVFS2 StorageNodes
Kernel
User
PVFS2Metadata
ServerPVFS2Clientkmod
Figure 5.3: PVFS2 architecture. PVFS2 consists of clients, metadata servers, and storage nodes. The PVFS2 kernel module enables integration with the local file system. Data is striped across storage nodes using a user-defined algorithm.
63
uniquely identifies the set of file data stripes. Data is not committed with the metadata
server; instead, the client ensures that all data is committed to storage by negotiating with
each individual storage node.
An operating system specific kernel module provides for integration into user envi-
ronments and for access by other VFS file systems. Users are thus able to mount and ac-
cess PVFS2 through a POSIX interface. Currently, only Linux implementations exist of
this module. Data is memory mapped between the kernel module and the PVFS2 client
program to avoid extra data copies.
Efficient lock management with large numbers of clients is a hard problem. Large
parallel applications generally avoid using locks and manage data consistency through
organized and cooperative clients. PVFS2 shuns POSIX consistency semantics, which
require sequential consistency of file system operations, and replaces them with noncon-
flicting writes semantics, guaranteeing that writes to non-overlapping file regions are
visible on all subsequent reads once the write completes.
5.3. pNFS prototype
Prototypes of new protocols are essential for their clarification and provide insight
and evidence of their viability. A minimum requirement for the fitness of pNFS is its
ability to provide parallel access to arbitrary storage systems. This agnosticism toward
storage system particulars is vital for widespread adoption. As such, my prototype fo-
Client
Layout
I/O
pNFSParallel I/O
NFSv4 I/O and Metadata
Kernel
User
PVFS2 StorageNodes
pNFSClient
ApplicationPVFS2Layout
and I/O
Driver
Server
Layout
ControlFlow
Kernel
User
pNFSServer
PVFS2Client
Kernel
User
PVFS2Metadata
Server
Figure 5.4: pNFS prototype architecture. The pNFS server obtains the opaque file layout from the PVFS2 metadata server via the PVFS2 client, transferring it back to the pNFS client and subsequently to the PVFS2 layout driver for direct and parallel data ac-cess.
64
cuses on the retrieval and processing of the file layout to demonstrate that pNFS is agnos-
tic of the underlying storage system and can match the performance of the storage system
it exports. Figure 5.4 displays the architecture of my pNFS prototype with PVFS2 as the
exported file system.
5.3.1. PVFS2 layout
The PVFS2 file layout information consists of:
• File system id
• Set of file handles, one for each storage node
• Distribution id, uniquely defines layout algorithm
• Distribution parameters, e.g., stripe size
Since a PVFS2 layout applies to an entire file, no matter what byte range the pNFS
client requests using the LAYOUTGET operation, the returned byte range is the entire
file. Therefore, my prototype requests a layout once for each open file, incurring a single
additional round trip. If a pNFS client is eager with its requests, it can even eliminate this
single round trip time by including the LAYOUTGET in the same request as the OPEN
operation. The differences between these two designs are apparent in my evaluation.
The pNFS server obtains the layout from PVFS2 via a Linux VFS export operation.
5.3.2. Extensible “Pluggable” layout and I/O drivers
Our prototype facilitates interoperability by providing a framework for the co-
existence of the NFSv4 control protocol with all storage protocols. As shown in Figure
5.5, layout drivers are pluggable, using a standard set of interfaces for all storage proto-
cols. An I/O interface, based on the Linux file_operations interface4, facilitates the man-
agement of layout information and performing I/O with storage. A policy interface in-
forms the pNFS client of storage system specific policies, e.g., stripe and block size, lay-
4 The file_operations interface is the VFS interface that manages access to a file.
65
out retrieval timing. The policy interface also enables layout drivers to specify whether
they support NFSv4 data management services or use customized implementations. The
pNFS client can provide the following data management services: data cache, writeback
cache with write gathering, and readahead.
A layout driver registers with the pNFS client along with a unique identifier. The
pNFS client matches this identifier with the value of the LAYOUT_CLASSES attribute
to select the correct layout driver for file access. If there is no matching layout driver,
standard NFSv4 read and write mechanisms are used.
The PVFS2 layout driver supports three operations: read, write, and set_layout. To
inject the file layout map, the pNFS client passes the opaque layout as an argument to the
set_layout function. Once the layout driver has finished processing the layout, the pNFS
client is free to call the layout driver’s read and write functions. When data access is
complete, the pNFS client issues a standard NFSv4 close operation to the pNFS server.
The syntax for the PVFS2 layout driver I/O interface is: ssize_t read(struct file* file,char __user* buf, size_t count, loff_t* offset)
int set_layout(struct inode* ino,struct file* file,unsigned int cmd,unsigned long arg)
5.4. Evaluation
In this section, I describe experiments that assess the performance of my pNFS proto-
type. They demonstrate that pNFS can use the standard layout driver interface to scale
with PVFS2, and can achieve performance vastly superior to NFSv4.
StorageProtocol
Control
StorageNodes
Client
pNFS Client
Layout Driver
I/O Driver
I/O API Policy API
Server
Storage System
pNFS Server
Linux VFS API
ManagementProtocol
Figure 5.5: Linux pNFS prototype internal structure. pNFS clients use I/O and pol-icy interfaces to access storage nodes and determine file system polices. The pNFS server uses VFS export operations to communicate with the underlying file system.
66
5.4.1. Experimental Setup
The experiments are performed on a network of forty identical nodes partitioned into
twenty-three clients, sixteen storage nodes, and one metadata server. Each node is a 2
GHz dual-processor Opteron with 2 GB of DDR RAM and four Western Digital Caviar
Serial ATA disks, which have a nominal data rate of 150 MB/s and an average seek time
of 8.9 ms. The disks are configured with software RAID 0. The operating system kernel
is Linux 2.6.9-rc3. The version of PVFS2 is 1.0.1.
I test four configurations: two that access PVFS2 storage nodes directly via pNFS and
PVFS2 clients and two with unmodified NFSv4 clients. One NFSv4 configuration ac-
cesses an Ext3 file system. The other accesses a PVFS2 file system with an NFSv4
server, exported PVFS2 client, and PVFS2 metadata server all residing on the metadata
server. The metadata server runs eight pNFS or NFSv4 server threads when exporting
the PVFS2 or Ext3 file systems. I verified that varying the number of pNFS or NFSv4
server threads does not affect performance.
I compare the aggregate I/O throughput using the IOZone [128] benchmark tool while
increasing numbers of clients. The first set of experiments has two processes on each cli-
ent reading and writing separate 200 MB files. In the second set of experiments, each
client reads and writes disjoint 100 MB portions of a single pre-existing file. Aggregate
I/O throughput is calculated when the last client completes its task. The value presented
is the average over several executions of the benchmark. The write time includes a flush
of the client’s cache to the server. All read experiments use warm storage node caches to
reduce disk access irregularities.
5.4.2. LAYOUTGET performance
If a layout does not apply to an entire file, a LAYOUTGET request would be required
on every read or write. In the test environment, the time for a LAYOUTGET request is
0.85 ms. On a 1 MB transfer, this reduces I/O throughput by only 3-4 percent; with a 10
MB transfer, the relative cost is less than 0.5 percent.
67
5.4.3. I/O throughput performance
In all experiments, the performance of NFSv4 exporting PVFS2 achieves an aggre-
gate read and write throughput of only 1.9 MB/s and 0.9 MB/s respectively. I discuss
this in Section 5.4.4.
Figure 5.6a shows the write performance with each client writing to separate files. In
Figure 5.6b, all clients write to a single file. NFSv4 with Ext3 achieves an average ag-
gregate write throughput of 38 MB/s and 68 MB/s for the separate and single file experi-
ments. pNFS performance tracks PVFS2, reaching a maximum aggregate write through-
put of 384 MB/s with sixteen processes for separate files and 240 MB/s with seven cli-
(a) Separate files (b) Single file Figure 5.6: Aggregate pNFS write throughput. pNFS scales with PVFS2 while NFSv4 performance remains flat. pNFS and PVFS2 use sixteen storage nodes. With separate files, each client spawns two write processes.
(a) Separate files (b) Single file Figure 5.7: Aggregate pNFS read throughput. pNFS and PVFS2 scale linearly while NFSv4 performance remains flat. With a single file, pNFS performance is slightly below PVFS2 due to increasing layout retrieval congestion. pNFS-2, which removes the extra round trip time of LAYOUTGET, matches PVFS2 performance. pNFS and PVFS2 use sixteen storage nodes. With separate files, each client spawns two read processes.
68
ents writing to single file. With separate files, the bottleneck is the number of storage
nodes. Metadata processing limits the performance with a single file.
Figure 5.7a shows the read performance with two processes on each client reading
separate files. NFSv4 with Ext3 achieves its maximum network bandwidth of 115 MB/s.
pNFS again achieves the same performance as PVFS2. Initially, the extra overhead re-
quired to write to sixteen storage nodes reduces read throughput for two processes to 27
MB/s, but it scales almost linearly, reaching an aggregate read throughput of 550 MB/s
with 46 processes.
Figure 5.7b shows the read performance with each client reading from disjoint por-
tions of the same pre-existing file. NFSv4 with Ext3 again achieves its maximum net-
work bandwidth of 115 MB/s. PVFS2 scales linearly, starting with an aggregate read
throughput of 15 MB/s with a single client, increasing to 360 MB/s with twenty-three cli-
ents. The pNFS prototype, which incurs a single round trip time for the LAYOUTGET,
suffers slightly as the PVFS2 layout retrieval function takes longer with increasing num-
bers of clients, reaching an aggregate read throughput of 311 MB/s. A modified proto-
type combines the LAYOUTGET and OPEN operations into a single call. The prototype
labeled pNFS-2 excludes the LAYOUTGET operation from the measurements and
matches the performance of PVFS2.
5.4.4. Discussion
While these experiments offer convincing evidence that pNFS can match the per-
formance of the underlying file system, they also demonstrate that pNFS performance
can be adversely affected by a costly LAYOUTGET operation.
The poor performance of NFSv4 with PVFS2 suffers from a difference in block sizes.
Per-read and per-write processing overhead is small in NFSv4, which justifies a small
block size—32 KB on Linux, but PVFS2 has a much larger per-read and per-write over-
head and therefore uses a block size of 4 MB. In addition, PVFS2 does not perform write
gathering on the client, assuming each data request to be a multiple of the block size. To
make matters worse, the Linux kernel breaks the NFSv4 client’s request on the NFSv4
server into 4 KB chunks before it issues the requests to the PVFS2 client. Data transfer
69
overhead, e.g., creating connections to the storage nodes, and determining stripe loca-
tions, dominates with 4 KB requests. The impact on performance is devastating.
Lack of a commit operation in the PVFS2 kernel module also reduces the write per-
formance of NFSv4 with PVFS2. To prevent data loss, PVFS2 commits every write op-
eration, ignoring the NFSv4 COMMIT operation. Write gathering [90] on the server
combined with a commit from the PVFS2 client would comply with NFSv4 fault toler-
ance semantics and improve the interaction of PVFS2 with the disk.
5.5. Additional pNFS design and implementation issues
5.5.1. Locking
NFSv4 supports mandatory locking, which requires an additional piece of shared state
between the NFSv4 client and server: a unique identifier of the locking process. An
NFSv4 client includes a locking identifier with every read and write operation.
How pNFS storage nodes support mandatory locks is not covered in the pNFS opera-
tions Internet Draft [123]. Several possibilities exist: enable the storage nodes to interpret
NFSv4 lock identifiers, bundle a new pNFS operation to retrieve file system specific lock
information with the NFSv4 LOCK operation, or include lock information in the existing
file layout.
5.5.2. Security considerations
Separating control and data paths in pNFS introduces new security concerns to
NFSv4. Although RPCSEC_GSS continues to secure the NFSv4 control path, securing
the data path requires additional care. The current pNFS operations Internet Draft de-
scribes the general mechanisms that will be required, but does not go all the way in defin-
ing the new security architecture.
A file-based layout driver uses the RPCSEC_GSS security mechanism between the
client and storage nodes.
Object storage uses revocable cryptographic capabilities for file system objects that
the metadata server passes to clients. For data access, the layout driver requires the cor-
70
rect capability to access the storage nodes. It is expected that the capability will be
passed to the layout driver within the opaque layout object.
Block storage access protocols rely on SAN-based security, which is perhaps a mis-
nomer, as clients are implicitly trusted to access only their allotted blocks. LUN mask-
ing/unmapping and zone-based security schemes can fence clients to specific data blocks.
Some systems employ IPsec to secure the data stream. Placing more trust in the client for
SAN file systems is a step backwards to the NFSv4 trust model.
5.6. Related work
Several systems aggregate partitioned NFSv3 servers into a single file system image
[91, 113, 114]. These systems, discussed in Section 2.6, transform NFS into a type of
parallel file system, which increases scalability but eliminates the file system independ-
ence of NFS.
The Storage Resource Broker (SRB) [130] aggregates storage resources, e.g., a file
system, an archival system, or a database, into a single data catalogue. The HTTP proto-
col is the most common and widespread way to access remote data stores. SRB and
HTTP also have some limits: they do not enable parallel I/O to multiple storage endpoints
and do not integrate with the local file system.
EMC HighRoad [103] uses the NFS or CIFS protocol for its control operations and
stores data in an aggregated LAN and SAN environment. Its use of file semantics facili-
tates data sharing in SAN environments, but is limited to the EMC Symmetrix storage
system. A similar, non-commercial version is also available [141].
Several pNFS layout drivers are under development. At this writing, Sun Microsys-
tems, Inc. is developing file- and object-based layout implementations. Panasas object
and EMC bock drivers are currently under development.
GPFS-WAN [134, 135], used extensively in the TeraGrid [127], features exceptional
throughput across high-speed, long haul networks, but is focused on large I/O transfers
and is restricted to GPFS storage systems.
GridFTP [4] is also used extensively in Grid computing to enable high I/O through-
put, operating system independence, and secure WAN access to high-performance file
systems. Successful and popular, GridFTP nevertheless has some serious limitations: it
71
copies data instead of providing shared access to a single copy, which complicates its
consistency model and decreases storage capacity; it lacks direct data access and a global
namespace; runs as an application, and cannot be accessed as a file system without oper-
ating system modification.
Distributed replicas can be vital in reducing network latency when accessing data.
pNFS is not intended to replace GridFTP, but to work alongside it. For example, in
tiered projects such as ATLAS at CERN, GridFTP remains a natural choice for long-haul
scheduled transfers among the upper tiers, while the file system semantics of pNFS offers
advantages in the lower tiers by letting scientists work with files directly, promoting ef-
fective data management.
5.7. Conclusion
This chapter analyzes an early implementation of pNFS, an NFSv4 extension that
uses the storage protocol of the underlying file system to bypass the server bottleneck and
enable direct and parallel storage access. The prototype validates the viability of the
pNFS protocol by demonstrating that it is possible to achieve high throughput access to a
high-performance file system while retaining the benefits of NFSv4. Experiments dem-
onstrate that the aggregate throughput with the prototype equals that of its exported file
system and far exceeds NFSv4 performance.
72
CHAPTER VI
Large Files, Small Writes, and pNFS
Parallel file systems improve the aggregate throughput of bulk data transfers by scal-
ing disks, disk controllers, network, and servers—every aspect of the system architecture.
As system size increases, the cost of locating, managing, and protecting data increases the
per-request overhead. This overhead is small relative to the overall cost of large data
transfers, but considerable for smaller data requests. Many parallel file systems ignore
this high penalty for small I/O, focusing entirely on large data transfers.
Unfortunately, not all data comes in big packages. Numerous workload characteriza-
tion studies have highlighted the prevalence of small and sequential data requests in
modern scientific applications [72-78]. This trend will likely continue since many HPC
applications take years to develop, have a productive lifespan of ten years or more, and
are not easily re-architected for the latest file access paradigm [12]. Furthermore, many
current data access libraries such as HDF5 and NetCDF rely heavily on small data ac-
cesses to store individual data elements in a common (large) file [142, 143].
This chapter investigates the performance of parallel file systems with small writes.
Distributed file systems are optimized for small data accesses [2, 36]; not surprisingly,
studies demonstrate that small I/O is their middleware niche [63]. I demonstrate that dis-
tributed file systems can increase write throughput to parallel data stores—regardless of
file size—by overcoming small write inefficiencies in parallel file systems. By using di-
rect, parallel I/O for large write requests and a distributed file system for small write re-
quests, pNFS improves the overall write performance of parallel file systems. The pNFS
heterogeneous metadata protocol allows these improvements in write performance with
any parallel file system.
The remainder of this chapter is organized as follows. Section 6.1 explores the issues
that arise when writing small amounts of data in scientific applications. Section 6.2 de-
73
scribes how pNFS can improve the performance of these applications. Section 6.3 re-
ports the results of experiments with synthetic benchmarks and a real scientific applica-
tion. Section 6.5 summarizes and concludes the chapter.
6.1. Small I/O requests
Several scientific workload characterization studies demonstrate the need to improve
performance of small I/O requests to small and large files.
The CHARISMA study [72-74] finds that file sizes in scientific workloads are much
larger than those typically found in UNIX workstation environments and that most scien-
tific applications access only a few files. Approximately 90% of file accesses are
small—less than 4 KB—and represent a considerable portion of application execution
time, even though approximately 90% of the data is transferred in large accesses. In ad-
dition, most files are read-only or write-only and are accessed sequentially, but read-write
files are accessed primarily non-sequentially.
The Scalable I/O study [75-77] had similar findings, but remarked that most requests
are small writes into gigabyte sized files, consuming, for example, 98% of the execution
time of one application that was studied. Furthermore, it is common for a single node to
handle the majority of reads and writes, gathering the data from, or broadcasting the data
to the other nodes as necessary. This indicates that single node performance still requires
attention from parallel file systems. The study also notes that a lack of portability pre-
vents applications from using enhanced parallel file system interfaces.
A more recent study in 2004 of two physics applications [78] amplifies the earlier
findings. This study found that I/O is bursty, most requests consist of small data trans-
fers, and most data is transferred in a few large requests. It is common for a master node
to collect results from other nodes and write them to storage using many small requests.
Each client reads back the data in large chunks. In addition, use of a single file is still
common and accessing that file—even with modern parallel file systems—is slower than
accessing separate files by a factor of five.
NetCDF (Network Common Data Form) provides a portable and efficient mechanism
for sharing data between scientists and applications [142]. It is the predominant file for-
mat standard within many scientific communities [144]. NetCDF defines a file format
74
and an API for the storage and retrieval of a file’s contents. NetCDF stores data in a sin-
gle array-oriented file, which contains dimensions, variables, and attributes. Applications
individually define and write thousands of data elements, creating many sequential and
small write requests.
HDF5 is another popular portable file format and programming interface for storing
scientific data in a single file. It provides a rich data model, with emphasis on efficiency
of access, parallel I/O, and support for high-performance computing, but continues to de-
fine and store each data element separately, creating many small write requests.
This chapter demonstrates how pNFS can improve small write performance with par-
allel file systems for small and large files, regardless of whether an application or file
format library generates the write requests.
6.2. Small writes and pNFS
pNFS improves file access scalability by providing the NFSv4 client with support for
direct storage access. I now turn to an investigation of the relative costs of the direct I/O
path and the NFSv4 path.
6.2.1. File system I/O features
A single large I/O request can saturate a client’s network endpoint. Engineering a
parallel file system for large requests entails the use of large transfer buffers, synchronous
data requests, deploying many storage nodes, and the use of a write-through cache or no
cache at all.
NFS implementations have several features that are sometimes an advantage over the
direct write path:
• Asynchronous client requests. Many parallel file systems incur a per-request
overhead, which adds up for small requests. Directing requests to the NFSv4
server allows the server to absorb this overhead without delaying the client appli-
cation or consuming client CPU cycles. In addition, asynchrony allows request
pipelining on the NFSv4 server, reducing aggregate latency to the storage nodes.
75
• One server per request. Data written to a byte-range that spans multiple storage
nodes (e.g., multiple stripes) requires two separate requests, further increasing the
per-request overhead. The NFSv4 single server design can reduce client request
Table 6.1: Postmark write throughput with 1 KB block size. NFSv4 outperforms di-rect, parallel I/O for small writes.
76
a 3Com 3C996B-T Gigabit Ethernet card. PVFS2 has six storage nodes and one meta-
data server.
Table 6.1 shows the Postmark results for Ext3, NFSv4, and pNFS. Ext3 outperforms
remote clients, achieving a write throughput of 5.02 MB/s. NFSv4 achieves a write
throughput of 4.03 MB/s. pNFS exporting the PVFS2 parallel file system achieves a
write throughput of only 0.65 MB/s due to its inability to parallelize requests effectively
and its use of a write-through cache. By using the features discussed in Section 6.2.1,
NFSv4 raises the write throughput to the same PVFS2 file system by 1.79 MB/s. This
demonstrates that the parallel, direct I/O path is not always the best choice and the indi-
rect path is not always the worst choice.
6.2.3. pNFS write threshold
To enable the indirect I/O path for small writes, I modified the pNFS client prototype
to allow it to choose between the NFSv4 storage protocol and the storage protocol of the
underlying file system. To switch between them, I added a write threshold to the layout
driver. Write requests smaller than the threshold follow the NFSv4 data path. Write re-
quests larger than the threshold follow the layout driver data path. Figure 6.1 shows how
the write threshold alters the pNFS data access architecture. Clients use a PFS data com-
ponent for large write requests and an NFSv4 data component for small write requests.
An additional PFS data component on the metadata server funnels small write requests to
Figure 6.1: pNFS small write data access architecture. Clients use a PFS data com-ponent for large write requests and an NFSv4 data component for small write requests.
77
storage. Figure 6.2 illustrates the implementation of the write threshold in both the gen-
eral pNFS architecture and in the prototype.
pNFS features a heterogeneous metadata protocol that enables it to benefit from the
strengths of disparate storage protocols. A write threshold improves overall write per-
formance for pNFS by hitting the sweet spot of both the NFSv4 and underlying file sys-
tem storage protocols.
Just as any improvement to NFSv4 improves access to the file system it exports, these
improvements to pNFS are portable and benefit all parallel file systems equally by allow-
ing pNFS (and its exported parallel file systems) to concentrate on large data require-
ments, while NFSv4 efficiently processes small I/O.
6.2.4. Setting the write threshold
The advantage of a write threshold is that applications that mix small and large write
requests get the better performing I/O path automatically.
The optimal write threshold value depends on several factors, including
server capacity, network performance and capability, system load, and features specific to
the distributed and parallel file system. One way to choose a good threshold value is to
compare execution times for distributed and parallel file systems with various write sizes
and see where the performance indicators cross.
Client pNFSParallel I/O
NFSv4 Small Writes
Kernel
User
Application
PVFS2 StorageNodes
Write Threshold
I/O
LayoutpNFSClient
PVFS2Layout
and I/O
Driver
Kernel
User
PVFS2Metadata
Server
Server
Kernel
User
pNFSServer
Layout
Small Writes PVFS2Client
(a) pNFS data paths (b) pNFS prototype with write threshold Figure 6.2: pNFS write threshold. (a) pNFS utilizes NFSv4 I/O along the small write path when the write request size is less than the write threshold. (b) pNFS retrieves the write threshold from PVFS2 layout driver to determine the correct data path for a write request.
78
Figure 6.3 abstracts write request execution time with increasing request size for a
parallel file system and for an idle and busy distributed file system. When the distributed
file system is lightly loaded, the transfer size at which the parallel file system outper-
forms the distributed file system, labeled B, is the optimal write threshold. When the dis-
tributed file system is heavily loaded, each request takes longer to complete, so the slope
increases and intersects the parallel file system at the smaller threshold size, labeled A.
(If the distributed file system is thoroughly overloaded, the threshold value tends to zero,
i.e., an overloaded distributed file system is never a better choice.)
The workload characterization studies mentioned in Section 6.1 state that scientific
applications usually have a large gap between small and large write request sizes, with
very few requests in the middle. Experiments reveal that small requests are smaller than
the “busy” write threshold value, shown as A in Figure 6.3, and the large requests are lar-
ger than the “idle” write threshold values, shown as B. Applications should reap large
gains for any write threshold value between A and B.
For example, the ATLAS digitization application (Section 6.3.3) achieves the same
performance for any write threshold between 32 KB and 274 KB. In addition, 87 percent
of the write requests are smaller than 4 KB, which suggests that the threshold could be
even smaller without hurting performance.
The write threshold can be set at any time, including compile time, when a module
loads, and run time. For example, system administrators can determine the write thresh-
old as part of a file system and network installation and optimization. A natural value for
the write threshold is the write gather size of the distributed file system.
Figure 6.3: Determining the write threshold value. Write execution time increases with larger request sizes. Application write requests are either small or large, with few requests in the middle. The write threshold can be any value in this middle region.
79
6.3. Evaluation
In this section, I evaluate the performance of the write threshold heuristic in my pNFS
prototype.
6.3.1. Experimental setup
IOR and random write IOZone experiments use a pair of sixteen node clusters con-
nected with Myrinet. One cluster consists of dual 1.1 GHz processor PIII Xeon nodes.
The other consists of dual 1 GHz processor PIII Xeon nodes. Each node has 1 GB of
memory. The PVFS2 1.1.0 file system has eight storage nodes and one metadata server.
Each storage node has an Ultra160 SCSI disk controller and one Seagate Cheetah 18 GB,
10,033 RPM drive, with an average seek time of 5.2 ms. The NFSv4 server, PVFS2 cli-
ent, and PVFS2 metadata server are installed on a single node. All nodes run Linux
2.6.12-rc4.
ATLAS experiments use an eight node cluster of 1.7 GHz dual P4 processors, 2 GB
of memory, a Seagate 80 GB 7200 RPM hard drive with an Ultra ATA/100 interface and
a 2 MB cache, and a 3Com 3C996B-T Gigabit Ethernet card. The PVFS2 1.1.0 file sys-
tem has six storage nodes and one metadata server. The NFSv4 server, PVFS2 client,
and PVFS2 metadata server are installed on a single node. All nodes run Linux 2.6.12-
rc4.
6.3.2. IOR and IOZone benchmarks
6.3.2.1 Experimental design
The first experiment consists of a single client issuing one thousand sequential write
requests to a file, using the IOR benchmark [146]. A test completes when data is com-
mitted to disk. I repeat this experiment with ten clients writing to disjoint portions of a
single file. The second experiment consists of a single client randomly writing a 32 MB
file using IOZone [128].
For each experiment, I first compare the aggregate write throughput of pNFS and
NFSv4 with a range of individual request sizes. I then set the write threshold to be the
80
request size at which pNFS and NFSv4 have the same performance, and re-execute the
benchmark.
6.3.2.2 Experimental evaluation
Our first experiment, shown in Figure 6.4a, examines single client performance. The
performance of NFSv4 writing to PVFS2 or Ext3 is comparable because the NFSv4 32
KB write size is less than the PVFS2 64 KB stripe size, which isolates writes to a single
disk.
With a single pNFS client, writing through the NFSv4 server to PVFS2 is superior to
writing directly to PVFS2 until the request size reaches 64 KB. For 16-byte writes,
NFSv4 has sixty-seven times the throughput, with the ratio decreasing to one at 64 KB.
Write performance through the NFSv4 server reaches its peak at 32 KB, the NFSv4 client
request size. At 64 KB, direct storage access begins to outperform indirect access. pNFS
with a write threshold of 32 KB offers the performance benefits of both storage protocols
by using NFSv4 I/O until 32 KB, then switching to direct storage access with the PVFS2
storage protocol.
Figure 6.4b shows the results of ten nodes writing to disjoint segments of the same
file. Ext3 performance is limited by random requests from the NFSv4 server daemons.
Figure 6.4: Write throughput with threshold. (a) Write throughput of a single client is-suing consecutive small write requests. NFSv4 exporting PVFS2 outperforms pNFS until a write size of 64 KB. pNFS with a 32 KB write threshold achieves the best overall per-formance. (b) Aggregate write throughput of a ten clients issuing consecutive small write requests to a single file. NFSv4 exporting PVFS2 outperforms pNFS until a write size of 8 KB. pNFS with a 4 KB write threshold achieves the best overall performance. (c) Write throughput of a single client issuing random small write requests. NFSv4 exporting PVFS2 outperforms pNFS until a write size of 128 KB. pNFS with a 64 KB write thresh-old achieves the best overall performance. Data points are a power of two; lines are for readability.
81
Using NFSv4 I/O to access PVFS2 does not incur as many random accesses since the
writes are spread over eight disks.
PVFS2 throughput grows approximately linearly as the impact of the request over-
head diminishes. The aggregate performance of NFSv4 is the same as with a single cli-
ent, with the write performance crossover point between pNFS and NFSv4 occurring at 4
KB. With 16-byte writes, NFSv4 has twenty times the bandwidth, with the ratio decreas-
ing to one at just below 8 KB. The maximum bandwidth difference of 9 MB/s occurs at
1 KB. At 8 KB, direct storage access begins to outperform indirect access. pNFS with a
write threshold of 4 KB offers the performance benefits of both storage protocols.
Figure 6.4c shows the performance of writing to a 32 MB file in a random manner
with increasing request sizes. NFSv4 outperforms pNFS until the individual write size
reaches 128 KB, with a maximum difference of 13 MB/s occurring at 16 KB. pNFS us-
ing a write threshold of 64 KB experiences the performance benefits of both storage pro-
tocols.
6.3.3. ATLAS applications
Not every application behaves like the ones studied in Section 6.1. For example,
large writes dominate the FLASH I/O benchmark workload [147], with 99.7 percent of
requests greater than 163 KB (with default input parameters). However, beyond the
workload characterization studies, there is increasing anecdotal evidence to suggest that
small writes are quite common.
To assess the impact of the small write heuristic I use the ATLAS simulator, which
does make many small writes. ATLAS [148] is a particle physics experiment that seeks
new discoveries in head on collisions of high-energy protons using the Large Hadron
Collider accelerator [149]. Scheduled for completion in 2007, ATLAS will generate over
a petabyte of data each year to be distributed for analysis to a multi-tiered collection of
decentralized sites.
Currently, ATLAS physicists are performing large-scale simulations of the events
that will occur within its detector. These simulation efforts influence detector design and
the development of real-time event filtering algorithms for reducing the volume of data.
The ATLAS detector can detect one billion events with a combined data volume of forty
82
terabytes each second. After filtering, data from fewer than one hundred events per sec-
ond are stored for offline analysis.
The ATLAS simulation event data model consists of four stages. The Event Gen-
eration stage produces pseudo-random events drawn from a statistical distribution of
previous experiments. The Simulation stage then simulates the passage of particles
(events) through the detectors. The Digitization stage combines hit information
with estimates of internal noise, subjecting the hits to a parameterization of the known
response of the detectors to produce simulated digital output (digits). The Recon-
struction stage performs pattern recognition and track reconstruction algorithms on
the digits, converting raw digital data into meaningful physics quantities.
6.3.3.1 Experimental design
Experiments focus on the Digitization stage, the only stage that generates a
large amount of data. With 500 events, Digitization writes approximately 650 MB
of output data to a single file. Data are written randomly, with write request size distribu-
tions shown in Figure 6.5. Figure 6.5a shows that only 4 percent of write request sizes
are 275 KB or greater, with the rest below 32 KB. Figure 6.5b shows that 96 percent of
write requests are only responsible for 5 percent of the data, while 95 percent of the data
are written in requests whose size is greater than 275 KB. This distribution of write re-
quest size and total amount of data output closely matches the workload characterization
studies discussed in Section 6.1. Analysis of the Digitization write request distribu-
(a) Breakdown of total number of requests (b) Breakdown of total amount of data output
Figure 6.5: ATLAS digitization write request size distribution with 500 events.
83
tion with varying numbers of events indicates that the distribution in Figure 6.5 is a rep-
resentative sample.
Analysis of the Digitization trace data reveals a large number of fsync system
calls. For example, executing Digitization with 50 events produces more than 900
synchronous fsync calls. Synchronously committing data to storage reduces request par-
allelism and the effectiveness of write gathering.
ATLAS developers explain that the overwhelming use of fsync is an implementation
issue rather than an application necessity [150]. Therefore, to evaluate Digitization
write throughput I used IOZone to replay the write trace data while omitting fsync calls
for 50 and 500 events.
6.3.3.2 Experimental evaluation
To evaluate pNFS with the ATLAS simulator, I analyze the Digitization write
throughput with several write threshold values.
First, I use the IOZone benchmark to determine the maximum PVFS2 write through-
put. The maximum write throughput for a single-threaded application and an entire client
is 18 MB/s and 54 MB/s respectively. The single threaded application maximum per-
formance value sets the upper limit for ATLAS write throughput. Increasing the number
of threads simultaneously writing to storage increases the maximum write throughput
three-fold. Since ATLAS Digitization is a single threaded application generating
output for serialized events, it cannot directly take advantage of this extra performance.
As shown in Figure 6.6, pNFS achieves a write throughput of 11.3 MB/s and 11.9
MB/s with 50 and 500 events respectively. The small write requests reduce the applica-
tion’s peak write throughput by approximately 6 MB/s.
With a write threshold of 1 KB, 49 percent of requests are re-directed to the NFSv4
server, increasing performance by 23 percent. With a write threshold of 32 KB, 96 per-
cent of write requests use the NFSv4 I/O path. With 50 events, the increase in write per-
formance is 57 percent, for a write throughput of 17.8 MB/s. With 500 events, the in-
crease in write performance is 100 percent, for a write throughput of 23.8 MB/s.
84
It is interesting to note that 32 KB write threshold performance exceeds the single-
threaded application maximum write throughput. The NFSv4 server is multi-threaded, so
it can process multiple simultaneous write requests and outperform a single-threaded ap-
plication. This is another benefit of the increased parallelism available in distributed file
systems.
When pNFS funnels all Digitization output through the NFSv4 server, perform-
ance drops dramatically, but is still slightly better than the performance of pNFS with di-
rect I/O. In this experiment, the improved write performance of the smaller requests
overshadows the reduced performance of sending large write requests through the NFSv4
server.
The 50 and 500 event experiments have slightly different write request size and offset
distributions. In addition, the 500 event simulation has ten times the number of write re-
quests. The difference between the pNFS write threshold performance improvements in
the 50 and 500 event experiments seems to be due to a difference in behavior of the
NFSv4 writeback cache with these different write workloads.
6.3.4. Discussion
Experiments show that writing to the direct data path is not always the best choice.
Write request size plays an important role in determining the preferred data path.
Application Max Write Throughput = 18 MB/sClient Max Write Throughput = 54 MB/s
Figure 6.6: ATLAS digitization write throughput for 50 and 500 events. pNFS with a 32 KB write threshold achieves the best overall performance by directing small requests through the NFSv4 server and the 275 KB and 1MB requests to the PVFS2 storage nodes.
85
The Linux NFSv4 client gathers small writes into 32 KB requests. With very small
requests, the overhead of gathering requests diminishes the benefit. As the size of each
write request grows, the increase in throughput is considerable.
Performing an increased number of parallel asynchronous write requests also im-
proves performance. This is seen in both Figure 6.4a and Figure 6.4c, as the performance
of writing 32 KB requests exceeds that of writing directly to storage.
The Linux NFSv4 server does not perform write gathering. Our experiments clearly
show the benefit of increasing the write request size. The ability for the NFSv4 server to
combine small requests from multiple clients into a single large request should lead to
further advantages.
6.4. Related work
Log-structured file systems [151] increase the size of writes by appending small I/O
requests to a log and later flushing the log to disk. Zebra [152] extends this to distributed
environments. Side effects include large data layouts and erratic block sizes.
The Vesta parallel file system [153] improves I/O performance by using workload
characteristics provided by applications to optimize data layout on storage. Providing
this information can be difficult for applications that lack regular I/O patterns or whose
I/O access patterns change over time.
The Slice file system prototype [116] divides NFS requests into three classes: large
I/O, small-file I/O, and namespace. A µProxy, interposed between clients and servers,
routes NFS client requests between storage, small-file servers, and directory servers, re-
spectively. Large I/O flows directly to storage while small-file servers aggregate I/O op-
erations of small files and the initial segments of large files. This method benefits small
file performance, but ignores small I/O to large files.
Both the EMC HighRoad file system [103] and the RAID-II network file server [95]
transfer small files over a low-bandwidth network and use a high-bandwidth network for
large file requests, but differentiating small and large files does not help with small re-
quests to large files. This re-direction benefits only large requests, and may reduce the
performance of small requests.
86
GPFS [97] forwards data between I/O nodes for requests smaller than the block size.
This reduces the number of messages with the lock manager and possibly reduces the
number of read-modify-write sequences.
Both the Lustre [104] and the Panasas ActiveScale [105] file systems use a write-
behind cache to perform buffered writes. In addition, Lustre allows clients to place small
files on a single storage node to reduce access overhead.
Implementations of MPI-IO such as ROMIO [82] use application hints and file access
patterns to improve I/O request performance. The work reported here benefits and com-
plements MPI-IO and its implementations. MPI-IO is useful to applications that use its
API and have regular I/O access patterns, e.g., strided I/O, but MPI-IO small write per-
formance is limited by the deficiencies of the underlying parallel file system. Our pNFS
enhancements are beneficial for existing and unmodified applications. They are also
beneficial at the file system layer of MPI-IO implementations, to improve the perform-
ance of the underlying parallel file system.
6.5. Conclusion
Diverse file access patterns and computing environments in the high-performance
community make pNFS an indispensable tool for scalable data access. This chapter
demonstrates that pNFS can increase write throughput to parallel data stores—regardless
of file size—by overcoming the inefficient performance of parallel file systems when
write request sizes are small. pNFS improves the overall write performance of parallel
file systems by using direct, parallel I/O for large write requests and a distributed file sys-
tem for small write requests. Evaluation results using a real scientific application and
several benchmark programs demonstrate the benefits of this design. The pNFS hetero-
geneous metadata protocol allows any parallel file system to realize these write perform-
ance improvements.
87
CHAPTER VII
Direct Data Access with a Commodity Storage Protocol
Parallel file systems feature impressive throughput, but they are highly specialized,
have limited operating system and hardware platform support, poor cross-site perform-
ance, and often lack strong security mechanisms. In addition, while parallel file systems
excel at large data transfers, many do so at the expense of small I/O performance. While
large data transfers dominate many scientific applications, numerous workload charac-
terization studies have highlighted the prevalence of small, sequential data requests in
modern scientific applications [74, 77, 78].
Many application domains demonstrate the need for high bandwidth, concurrent, and
secure access to large datasets across a variety of platforms and file systems. Scientific
computing connects large computational and data facilities across the globe and can gen-
erate petabytes of data. Digital movie studios that generate terabytes of data every day
require access from Sun, Windows, SGI, and Linux workstations, and compute clusters
[11]. This need for heterogeneous data access creates a conflict between parallel file sys-
tems and application platforms. Distributed file systems such as NFS [38] and CIFS [2]
bridge the interoperability gap, but they are unable to deliver the superior performance of
a high-end storage system.
pNFS overcomes these enterprise- and grand challenge-scale obstacles by enabling
direct access to storage from clients while preserving operating system, hardware plat-
form, and parallel file system independence. pNFS provides file access scalability by
using the storage protocol of the underlying parallel file system to distribute I/O across
the bisectional bandwidth of the storage network between clients and storage devices,
removing the single server bottleneck that is so vexing to client/server-based systems. In
combination, the elimination of the single server bottleneck and the ability for direct ac-
cess to storage by clients yields superior file access performance and scalability.
88
Regrettably, pNFS does not retain NFSv4 file system access transparency, and can
therefore not shield applications from different parallel file system security protocols and
metadata and data consistency semantics. In addition, implementing pNFS support for
every storage protocol on every operating system and hardware platform is a colossal un-
dertaking. File systems that support standard storage protocols may be able to share de-
velopment costs, but full support for a particular protocol is often unrealized, hampering
interoperability. The pNFS file-based layout access protocol helps bridge this gap in
transparency with middle-tier data servers, but eliminates direct data access, which can
hurt performance.
This chapter introduces Direct-pNFS, a novel augmentation to pNFS that increases
portability and regains parallel file system access transparency while continuing to match
the performance of native parallel file system clients. Architecturally, Direct-pNFS uses
a standard distributed file system protocol for direct access to a parallel file system’s
storage nodes, bridging the gap between performance and transparency. Direct-pNFS
leverages the strengths of NFSv4 to improve I/O performance over the entire range of I/O
workloads. I know of no other distributed file system that offers this level of perform-
ance, scalability, file system access transparency, and file system independence.
Direct-pNFS makes the following contributions:
Heterogeneous and ubiquitous remote file system access. Direct-pNFS benefits are
available with an conventional pNFS client: Direct-pNFS uses the pNFS file-based layout
type, and does not require file system specific layout drivers, e.g., object [154] or PVFS2
[155].
Remote file system access transparency and independence. pNFS uses file system
specific storage protocols that can expose gaps in the underlying file system semantics
(such as security support). Direct-pNFS, on the other hand, retains NFSv4 file system
access transparency by using the NFSv4 storage protocol for data access. In addition,
Direct-pNFS remains independent of the underlying file system and does not interpret file
system-specific information.
I/O workload versatility. While distributed file systems are usually engineered to per-
form well on small data accesses [63], parallel file systems target scientific workloads
89
dominated by large data transfers. Direct-pNFS combines the strengths of both, provid-
ing versatile data access to manage efficiently a diversity of workloads.
Scalability and throughput. Direct-pNFS can match the I/O throughput and scalability
of the exported parallel file system without requiring the client to support any protocol
other than NFSv4. This chapter uses numerous benchmark programs to demonstrate that
Direct-pNFS matches the I/O throughput of a parallel file system and has superior per-
formance in workloads that contain many small I/O requests.
A case for commodity high-performance remote data access. Direct-pNFS complies
with emerging IETF standards and can use an unmodified pNFS client. This chapter
makes a case for open systems in the design of high-performance clients, demonstrating
that standards-compliant commodity software can deliver the performance of a custom
made parallel file system client. Using standard clients to access specialized storage sys-
tems offers ubiquitous data access and reduces development and support costs without
cramping storage system optimization.
The remainder of this chapter is organized as follows. Section 7.1 makes the case for
open systems in distributed data access. Section 7.2 reviews pNFS and its departure from
traditional client/server distributed file systems. Sections 7.3 and 7.4 describe the Direct-
pNFS architecture and Linux prototype. Section 7.5 reports the results of experiments
with micro-benchmarks and four different I/O workloads. I summarize and conclude in
Section 7.6.
7.1. Commodity high-performance remote data access
NFS owes its success to an open protocol, platform ubiquity, and transparent access
to file systems, independent of the underlying storage technology. Beyond performance
and scalability, standards-based high-performance data access needs all these properties
to be successful in Grid, cluster, enterprise, and personal computing.
The benefits of standards-based data access with these qualities are numerous. A sin-
gle client can access data within a LAN and across a WAN, which reduces the cost of
development, administrative, and support. System administrators can select a storage so-
lution with confidence that no matter the operating system and hardware platform, users
90
are able to access the data. In addition, storage vendors are free to focus on advanced
data management features such as fault tolerance, archiving, manageability, and scalabil-
ity without having to custom tailor their products across a broad spectrum of client plat-
forms.
7.2. pNFS and storage protocol-specific layout drivers
This section revisits the pNFS architecture described in Chapter V and discusses the
drawbacks of using storage protocol-specific layout drivers.
7.2.1. Hybrid file system semantics
Although parallel file systems separate control and data flows, there is tight integra-
tion of the control and data protocols. Users must adapt to different semantics for each
data repository. pNFS, on the other hand, allows applications to realize common file sys-
tem semantics across data repositories. As users access heterogeneous data repositories
with pNFS, the NFSv4 metadata protocol provides a degree of consistency with respect
to the file system semantics within each repository.
Unfortunately, certain semantics are layout driver and storage protocol dependent,
and they can drastically change application behavior. For example, Panasas Activescale
[105] supports the OSD security protocol [136], while Lustre [104] specialized security
protocol. This forces clients that need to access both parallel file systems to support mul-
tiple authentication, integrity, and privacy mechanisms. Additional examples of these
semantics include client caching, and fault tolerance.
7.2.2. The burden of layout and I/O driver development
The pNFS layout and I/O drivers are the workhorses of pNFS high-performance data
access. These specialized components understand the storage system’s storage protocol,
security protocol, file system semantics, device identification, and layout description and
management. For pNFS to achieve broad heterogeneous data access, layout and I/O
drivers must be developed and supported on a multiplicity of operating system and hard-
91
ware platforms—an effort comparable in magnitude to the development of a parallel file
system client.
7.2.3. The pNFS file-based layout driver
Currently, the IETF is developing three layout specifications: file, object, and block.
The pNFS protocol includes only the file-based layout format, with object- and block-
based to follow in separate specifications. As such, all pNFS implementations will sup-
port the file-based layout format for remote data access, while support for the object- and
block-based access methods will be optional.
A pNFS file-based layout governs an entire file and is valid until recalled by the
pNFS server. To perform data access, the file-based layout driver combines the layout
information with a known list of data servers for the file system, and sends READ,
WRITE, and COMMIT operations to the correct data servers. Once I/O is complete, the
client sends updated file metadata, e.g., size or modification time, to the pNFS server.
pNFS Client
Layout
I/O
NFSv4 I/Oand
Metadata
Kernel
User
pNFSClient
Application
pNFS Metadata Server
ControlFlow
Kernel
User
pNFSServer
PFSClient
File Layout Driver
I/O Driver
NFSv4Parallel I/O
PFSManagement Protocol
PFSMetadata
PFSMetadata
MgmtDaemon
PFSStorage
Kernel
User
Kernel
User
PFSMetadata
Server
Data Servers
Kernel
User
pNFSServer
PFSClient
I/O
Kernel
User
pNFSServer
PFSClient
I/O
PFSParallel I/O
MgmtDaemon
PFSStorage
Kernel
User
Storage
Data Servers
(3-tier pNFS)
(2-tier pNFS)
Figure 7.1: pNFS file-based architecture with a parallel file system. The pNFS file-based layout architecture consists of pNFS data servers, clients and a metadata server, plus parallel file system (PFS) storage nodes, clients, and metadata servers. The three-tier design prevents direct storage access and creates overlapping and redundant stor-age and metadata protocols. The two-tier design, pNFS servers, PFS clients, and stor-age on the same node, suffers from these problems plus diminished single client band-width.
92
pNFS file-based layout information consists of:
• Striping type and stripe size
• Data server identifiers
• File handles (one for each data server)
• Policy parameters
Figure 7.1 illustrates how the pNFS file-based layout provides access to an asymmet-
ric parallel file system. (Henceforth, I refer to this unspecified file system as PFS).
pNFS clients access pNFS data servers that export PFS clients, which in turn access data
from PFS storage nodes and metadata from PFS metadata servers. A PFS management
protocol binds metadata servers and storage, providing a consistent view of the file sys-
tem. pNFS clients use NFSv4 for I/O while PFS clients use the PFS storage protocol.
7.2.3.1 Performance issues
Architecturally, using a file-based layout offers some latitude. The architecture de-
picted in Figure 7.1 might have two tiers, or it might have three. The three-tier architec-
ture places PFS clients and storage on separate nodes, while the two-tier architecture
places PFS clients and storage on the same nodes. As shown in Figure 7.2, neither choice
features direct data access: the three-tier model has intermediary data servers while with
two tiers, tier-two PFS clients access data from other tier-two storage nodes. In addition,
the two-tier model transfers data between data servers, reducing the available bandwidth
(a) 3-tier (b) 2-tier
Figure 7.2: pNFS file-based data access. Both 3-tier and 2-tier architectures lose di-rect data access. (a) Intermediary pNFS data servers access PFS storage nodes. (b) pNFS data servers access both local and remote PFS storage nodes.
93
between clients and data servers. These architectures can improve NFS scalability, but
the lack of direct data access—a primary benefit of pNFS—scuttles performance.
Block size mismatches and overlapping metadata protocols also diminish perform-
ance. If the pNFS block size is greater than the PFS block size, a large pNFS data request
produces extra PFS data requests, each incurring a fixed amount of overhead. Con-
versely, a small pNFS data request forces a large PFS data request, unnecessarily taxing
storage resources and delaying the pNFS request. pNFS file system metadata requests to
the pNFS server, e.g., file size, layout information, become PFS client metadata requests
to the PFS metadata server. This ripple effect increases overhead and delay for pNFS
metadata requests.
It is hard to address these remote access inefficiencies with a fully connected block-
based parallel file systems, e.g., GPFS [97], GFS [98, 99], and PolyServe Matrix Server
[101], but for parallel file systems whose storage nodes admit NFS servers, Direct-pNFS
offers a solution.
7.3. Direct-pNFS
Direct-pNFS supports direct data access—without requiring a storage system specific
layout driver on every operating system and hardware platform—by exploiting file-based
layouts to describe the exact distribution of data on the storage nodes. Since a Direct-
Figure 7.3: Direct-pNFS data access architecture. NFSv4 application components use an NFSv4 data component to perform I/O directly to a PFS data component bundled with storage. The NFSv4 metadata component shares its access control information with the data servers to ensure the data servers allow only authorized data requests.
94
pNFS client knows the exact location of a file’s contents, it can target I/O requests to the
correct data servers. Direct-pNFS supports direct data access to any parallel file system
that allows NFS servers on its storage nodes—such as object based [104, 105], PVFS2
[129], and IBRIX Fusion [156]—and inherits the operational, fault tolerance, and security
semantics of NFSv4
7.3.1. Architecture
In the two- and three-tier pNFS architectures shown in Figure 7.1, the underlying data
layout is opaque to pNFS clients. This forces them to distribute I/O requests among data
servers without regard for the actual location of the data. To overcome this inefficient
data access, Direct-pNFS, shown in Figure 7.4, uses a layout translator to convert a par-
allel file system’s layout into a pNFS file-based layout. A pNFS server, which exists on
every PFS data server, can satisfy Direct-pNFS client data requests by accessing the local
PFS storage component. Direct-pNFS and PFS metadata components also co-exist on the
same node, which eliminates remote PFS metadata requests from the pNFS server.
The Direct-NFSv4 data access architecture, shown in Figure 7.3, alters the NFSv4-
PFS data access architecture (Figure 3.3) by using an NFSv4 data component to perform
direct I/O to a PFS data component bundled with storage. The PFS data component prox-
ies NFSv4 I/O requests to the local disk. An NFSv4 metadata component on storage
maintains NFSv4 access control semantics.
Direct-pNFS Client
Layout
I/O
Kernel
User
pNFSClient
Application
File Layout Driver
I/O Driver
pNFSServer
Data Servers
I/O
NFSv4Parallel I/O
NFSv4 I/Oand
Metadata
Metadata Server
FileLayout
Control Flow
Kernel
User
pNFSServer PFS
Management Protocol
MgmtDaemon
PFSStorage
Kernel
User
OptionalAggregation
Driver
PFSMetadata
Server
LayoutTranslator
PFSLayout
Figure 7.4: Direct-pNFS with a parallel file system. Direct-pNFS eliminates overlap-ping I/O and metadata protocols and uses the NFSv4 storage protocol to directly access storage. The PFS uses a layout translator to converts its layout into a pNFS file-based layout. A Direct-pNFS client may use an aggregation driver to support specialized file striping methods.
95
In combination, the use of accurate layout information and the placement of pNFS
servers on PFS storage and metadata nodes eliminates extra PFS data and metadata re-
quests and obviates the need for data servers to support the PFS storage protocol alto-
gether. The use of a single storage protocol also eliminates block size mismatches be-
tween storage protocols.
7.3.2. Layout translator
To give Direct-pNFS clients exact knowledge of the underlying data layout, a parallel
file system uses the layout translator to specify a file’s storage nodes, file handles, aggre-
gation type, and policy parameters. The layout translator is independent of the underly-
ing parallel file system and does not interpret PFS layout information. The layout trans-
lator simply gathers file-based layout information, as specified by the PFS, and creates a
pNFS file-based layout. The overhead for a PFS to use the layout translator is small and
confined to the PFS metadata server.
7.3.3. Optional aggregation drivers
It is impossible for the pNFS protocol to support every method of distributing data
among the storage nodes. At this writing, the pNFS protocol supports two aggregation
schemes: round-robin striping and a second method that specifies a list of devices that
form a cyclical pattern for all stripes in the file. To broaden support for unconventional
aggregation schemes such as variable stripe size [157] and replicated or hierarchical strip-
ing [19, 158], Direct-pNFS also supports optional “pluggable” aggregation drivers. An
aggregation driver provides a compact way for the Direct-pNFS client to understand how
the underlying parallel file system maps file data onto the storage nodes.
Aggregation drivers are operating system and platform independent, and are based on
the distribution drivers in PVFS2, which use a standard interface to adapt to most striping
schemes. Although aggregation drivers are non-standard components, their development
effort is minimal compared to the effort required to develop an entire layout driver.
96
7.4. Direct-pNFS prototype
I implemented a Direct-pNFS prototype that maintains strict agnosticism of the un-
derlying storage system and, as we shall see, matches the performance of the storage sys-
tem that it exports. Figure 7.5 displays the architecture of my Direct-pNFS prototype,
using PVFS2 for the exported file system.
Scientific data is easily re-created, so PVFS2 buffers data on storage nodes and sends
the data to stable storage only when necessary or at the application’s request (fsync). To
match this behavior, my Direct-pNFS prototype departs from the NFSv4 protocol, com-
mitting data to stable storage only when an application issues an fsync or closes the file.
At this writing, the user-level PVFS2 storage daemon does not support direct VFS ac-
cess. Instead, the Direct-pNFS data servers simulate direct storage access by way of the
existing PVFS2 client and the loopback device. The PVFS2 client on the data servers
functions solely as a conduit between the NFSv4 server and the PVFS2 storage node on
the node.
My Direct-pNFS prototype uses special NFSv4 StateIDs for access to the data serv-
ers, round robin striping as its aggregation scheme, and the GETDEVLIST,
LAYOUTGET, AND LAYOUTCOMMIT pNFS operations. A layout pertains to an en-
tire file, is stored in the file’s inode, and is valid for the lifetime of the inode.
7.5. Evaluation
In this section I asses the performance and I/O workload versatility of Direct-pNFS. I
first use the IOR micro-benchmark [146] to demonstrate the scalability and performance
of Direct-pNFS compared with PVFS2, pNFS file-based layout with two and three tiers,
and NFSv4. To explore the versatility of Direct-pNFS, I use two scientific I/O bench-
marks and two macro benchmarks to represent a variety of access patterns to large stor-
age systems:
NAS Parallel Benchmark 2.4 – BTIO. The NAS Parallel Benchmarks (NPB) are used
to evaluate the performance of parallel supercomputers. The BTIO benchmark is based
on a CFD code that uses an implicit algorithm to solve the 3D compressible Navier-
Stokes equations. I use the class A problem set, which uses a 64x64x64 grid, performs
97
200 time steps, checkpoints data every five time steps, and generates a 400 MB check-
point file. The benchmark uses MPI-IO collective file operations to ensure large write
requests to the storage system. All parameters are left as default.
ATLAS Application. ATLAS [148] is a particle physics experiment that seeks new dis-
coveries in head-on collisions of high-energy protons using the Large Hadron Collider
accelerator [149] under construction at CERN. The ATLAS simulation runs in four
stages; the Digitization stage simulates detector data generation. With 500 events,
Digitization spreads approximately 650 MB randomly over a single file. Each cli-
ent writes to a separate file. More information regarding ATLAS can be found in Section
6.3.3.
OLTP: OLTP models a database workload as a series of transactions on a single large
file. Each transaction consists of a random 8 KB read, modify, and write. Each client
performs 20,000 transactions, with data sent to stable storage after each transaction.
Postmark: The Postmark benchmark simulates metadata and small I/O intensive applica-
tions such as electronic mail, NetNews, and Web-based services [145]. Postmark per-
forms transactions on a large number of small randomly sized files (between 1 KB and
500 KB). Each transaction first deletes, creates, or opens a file, then reads or appends
Direct-pNFS Client
File Layout
I/O
NFSv4 I/Oand
Metadata
Kernel
User
pNFSClient
Application
Metadata Server
Layout
ControlFlow
Kernel
User
pNFSServer
PVFS2Metadata
Server
PVFS2Client
File Layout/Aggregation
Driver
Transport Driver
Kernel
User
pNFSServer
PVFS2Client
Data Servers
I/O
Loopback
PVFS2StorageServer
NFSv4Parallel I/O
PFSManagement Protocol
Figure 7.5: Direct-pNFS prototype architecture with the PVFS2 parallel file system. The PVFS2 metadata server converts the PVFS2 layout into an pNFS file-based layout, which is passed to the pNFS server and then to the Direct-pNFS file-based layout driver. The pNFS data server uses the PVFS2 client as a conduit to retrieve data from the local PVFS2 storage server. Data servers do not communicate.
98
512 bytes. Data are sent to stable storage before the file is closed. Postmark performs
2,000 transactions on 100 files in 10 directories. All other parameters are left as default.
7.5.1. Experimental setup
All experiments use a sixteen-node cluster connected via Gigabit Ethernet with jumbo
frames. To ensure a fair comparison between architectures, we keep the number of nodes
and disks in the back end constant. The PVFS2 1.5.1 file system has six storage nodes,
with one storage node doubling as a metadata manager, and a 2 MB stripe size. The
pNFS three-tier architecture uses three NFSv4 servers and three PVFS2 storage nodes.
For the three-tier architecture, we move the disks from the data servers to the storage
nodes. All NFS experiments use eight server threads and 2 MB wsize and rsize. All
nodes run Linux 2.6.17.
Storage System: Each PVFS2 storage node is equipped with dual 1.7 GHz P4 proces-
sor, 2 GB memory, one Seagate 80 GB 7200 RPM hard drive with Ultra ATA/100 inter-
face and 2 MB cache, and one 3Com 3C996B-T Gigabit Ethernet card.
Client System: Client nodes one through seven are equipped with dual 1.3 GHz P3
processor, 2 GB memory, and an Intel Pro Gigabit Ethernet card. Client nodes eight and
nine have the same configuration as the storage nodes.
7.5.2. Scalability and performance
Our first set of experiments uses the IOR benchmark to compare the scalability and
performance of Direct-pNFS, PVFS2, pNFS file-based layout with two and three tiers,
and NFSv4. In the first set of experiments, clients sequentially read and write separate
500 MB files. In the second set of experiments, clients sequentially read and write a dis-
joint 500 MB portion of a single file. To view the effect of I/O request size on perform-
ance, the experiments use a large block size (2 to 4 MB) and a small block size (8 KB).
Read experiments use a warm server cache. The presented value is the average over sev-
eral executions of the benchmark
Figure 7.6a and Figure 7.6b display the maximum aggregate write throughput with
separate files and a single file. Direct-pNFS matches the performance of PVFS2, reach-
99
ing a maximum aggregate write throughput of 119.2 MB/s and 110 MB/s for separate and
single file experiments. .
pNFS-3tier write performance levels off at 83 MB/s with four clients. pNFS-3tier
must split the six available servers between data servers and storage nodes, which cuts
the maximum network bandwidth in half relative to the network bandwidth for the other
pNFS and PVFS2 architectures. In addition, using two disks in each storage node does
not offer twice the disk bandwidth of a single disk due to the constant level of CPU,
memory, and bus bandwidth.
Lacking direct data access, pNFS-2tier incurs a write delay and performs a little
worse than Direct-pNFS and PVFS2. The additional transfer of data between data serv-
ers limits the maximum bandwidth between the pNFS clients and data servers. This is
not visible in Figure 7.6a and Figure 7.6b because network bandwidth exceeds disk
bandwidth, so Figure 7.6c repeats the multiple file write experiments with 100 Mbps
Ethernet. With this change, pNFS-2tier yields only half the performance of Direct-pNFS
and PVFS2, clearly demonstrating the network bottleneck of the pNFS-2tier architecture.
NFSv4 performance is unaffected by the number of clients, indicating a single server
bottleneck
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8
Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
Direct-pNFSPVFS2pNFS-2tierpNFS-3tierNFSv4
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8
Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
Direct-pNFSPVFS2pNFS-2tierpNFS-3tierNFSv4
0
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8
Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
Direct-pNFSPVFS2pNFS-2tier
(a) (b) (c)
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8
Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
B/s
)Direct-pNFSPVFS2pNFS-2tierpNFS-3tierNFSv4
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8
Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
Direct-pNFSPVFS2pNFS-2tierpNFS-3tierNFSv4
(d) (e)
Figure 7.6: Direct-pNFS aggregate write throughput. (a) and (b) With a separate or single file and a large block size, Direct-pNFS scales with PVFS2 while pNFS-2tier suf-fers from a lack of direct file access. pNFS-3tier and NFSv4 are CPU limited. (c) With separate files and 100 Mbps Ethernet, pNFS-2tier is bandwidth limited due to its need to transfer data between data servers. (d) and (e) With a separate or single file and an 8 KB block size, all NFSv4 architectures outperform PVFS2.
100
Figure 7.6d and Figure 7.6e display the aggregate write throughput with separate files
and a single file using an 8 KB block size. The performance for all NFSv4-based archi-
tectures is unaffected in the large block size experiments due to the NFSv4 client write
back cache, which combines write requests until they reach the NFSv4 wsize (2 MB in
my experiments). However, the performance of PVFS2, a parallel file system designed
for large I/O, decreases dramatically with small block sizes, reaching a maximum aggre-
gate write throughput of 39.4 MB/s.
Figure 7.7a and Figure 7.7b display the maximum aggregate read throughput with
separate files and a single file. With separate files, Direct-pNFS matches the perform-
ance of PVFS2, reaching a maximum aggregate read throughput of 509 MB/s and 482
MB/s. With a single file, PVFS2 has lower throughput than Direct-pNFS with only a few
clients, but outperforms Direct-pNFS with eight clients, reaching a maximum aggregate
read throughput of 530.7 MB/s.
Direct-pNFS places the NFSv4 and PVFS2 server modules on the same node, inher-
ently placing a higher demand on server resources. In addition, PVFS2 uses a fixed
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8
Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
Direct-pNFSPVFS2pNFS-2tierpNFS-3tierNFSv4
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8
Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
Direct-pNFSPVFS2pNFS-2tierpNFS-3tierNFSv4
(a) (b)
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8
Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
Direct-pNFSPVFS2pNFS-2tierpNFS-3tierNFSv4
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8
Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
Direct-pNFSPVFS2pNFS-2tierpNFS-3tierNFSv4
(c) (d)
Figure 7.7: Direct-pNFS aggregate read throughput. (a) With separate files and a large block size, Direct-pNFS outperforms PVFS2 for some numbers of clients. pNFS-2tier and pNFS-3tier are bandwidth limited due to a lack of direct file access. NFSv4 is bandwidth and CPU limited. (b) With a single file and a large block size, PVFS2 eventu-ally outperforms Direct-pNFS due to a prototype software limitation. pNFS-2tier and pNFS-3tier are bandwidth limited due to a lack of direct file access. NFSv4 is CPU lim-ited. (c) and (d) With a separate or single file and an 8 KB block size, all NFSv4 architec-tures outperform PVFS2.
101
number of buffers to transfer data between the kernel and the user-level storage daemon,
which creates an additional bottleneck. This is evident in Figure 7.7b, where PVFS2
achieves a higher aggregate I/O throughput than Direct-pNFS.
The division of the six available servers between data servers and storage nodes in
pNFS-3tier limits its maximum performance again, achieving a maximum aggregate
bandwidth of only 115 MB/s. NFSv4 aggregate performance is flat, limited to the band-
width of a single server.
The pNFS-2tier bandwidth bottleneck is readily visible in Figure 7.7a and Figure
7.7b, where disk bandwidth is no longer a factor. Each data server is responding to client
read requests and transferring data to other data servers so they can satisfy their client
read requests. Sending data to multiple targets limits each data server’s maximum read
bandwidth.
Figure 7.7c and Figure 7.7d display the aggregate read throughput with separate files
and a single file using an 8 KB block size. The performance for all NFSv4-based archi-
tectures remains unaffected in the large block size experiments due to NFSv4 client read
gathering. The performance of PVFS2 again decreases dramatically with small block
sizes, reaching a maximum aggregate read throughput of 51 MB/s.
7.5.3. Micro-benchmark discussion
Direct-pNFS matches or outperforms the aggregate I/O throughput of PVFS2. In ad-
dition, the asynchronous, multi-threaded design of Linux NFSv4 combined with its write
back cache achieves superior performance with smaller block sizes.
In the write experiments, both Direct-pNFS and PVFS2 fully utilize the available disk
bandwidth. In the read experiments, data are read directly from server cache, so the disks
are not a bottleneck. Instead, client and server CPU performance becomes the limiting
factor. The pNFS-2tier architecture offers comparable performance with fewer clients,
but is limited by network bandwidth as I increase the number of clients. The pNFS-3tier
architecture demonstrates that using intermediary data servers to access data is ineffi-
cient: those resources are better used as storage nodes.
The remaining experiments further demonstrate the versatility of Direct-pNFS with
workloads that use a range of block sizes.
102
7.5.4. Scientific application benchmarks
This section uses two scientific benchmarks to assess the performance of Direct-
pNFS in high-end computing environments.
7.5.4.1 ATLAS
To evaluate ATLAS Digitization write throughput I use IOZone to replay the
write trace data for 500 events. Each client writes to a separate file.
Figure 7.8a shows that Direct-pNFS can manage efficiently the mix of small and
large write requests, achieving an aggregate write throughput of 102.5 MB/s with eight
clients. While small write requests reduce the maximum write throughput achievable by
Direct-pNFS by approximately 14 percent, they severely reduce the performance of
PVFS2, which achieves only 41 percent of its maximum aggregate write throughput.
0
20
40
60
80
100
120
1 4 8Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
Direct-pNFSPVFS2
0
200
400
600
800
1000
1200
1400
1600
1 4 9Number of Clients
Tim
e (s
)
Direct-pNFSPVFS2
(a) ATLAS (b) BTIO
0
5
10
15
20
25
30
1 4 8Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
Direct-pNFSPVFS2
0
10
20
30
40
1 4 8Number of Clients
Tran
sact
ions
/Sec
ond
(tps)
Direct-pNFSPVFS2
(c) OLTP (d) Postmark
Figure 7.8: Direct-pNFS scientific and macro benchmark performance. (a) ATLAS. Direct-pNFS outperforms PVFS2 with a small and large write request workload. (b) BTIO. Direct-pNFS and PVFS2 achieve comparable performance with a large read and write workload. Lower time values are better. (c) OLTP. Direct-pNFS outperforms PVFS2 with an 8 KB read-modify-write write request workload. (d) Postmark. Direct-pNFS outperforms PVFS2 in a small read and append workload.
103
7.5.4.2 NAS Parallel Benchmark 2.4 – BTIO
The Block-Triangle I/O benchmark is an industry standard for measuring the I/O per-
formance of a cluster. Without optimization, BTIO I/O requests are small, ranging from
a few hundred bytes to eight kilobytes. The version of the benchmark used in this chap-
ter uses MPI-IO collective buffering [80], which increases the I/O request size to one MB
and greater. The benchmark times also include the ingestion and verification of the result
file.
BTIO performance experiments are shown in Figure 7.8b. BTIO running time is ap-
proximately the same for Direct-pNFS and PVFS2, with a maximum difference of five
percent with nine clients.
7.5.5. Synthetic workloads
This section uses two macro-benchmarks to analyze the performance of Direct-pNFS
in a more general setting.
7.5.5.1 OLTP
Figure 7.8c displays the OLTP experimental results. Direct-pNFS scales well with
the workload’s random 8 KB read-modify-write transactions, achieving 26 MB/s with
eight clients. As expected, PVFS2 performs poorly with small I/O requests, achieving an
aggregate I/O throughput of 6 MB/s.
7.5.5.2 Postmark
For the small I/O workload of Postmark, I reduce the stripe size, wsize, and rsize to
64 KB. This allows a more even distribution of requests among the storage nodes.
The Postmark experiments are shown in Figure 7.8d, with results given in transac-
tions per second. Direct-pNFS again leverages the asynchronous, multi-threaded Linux
NFSv4 implementation, designed for small I/O intensive workloads like Postmark, to
perform up to 36 times as many transactions per second as PVFS2.
104
7.5.6. Macro-benchmark discussion
This set of experiments demonstrates that Direct-pNFS performance compares well to
the exported parallel file system with the large I/O scientific application benchmark
BTIO. Direct-pNFS performance for ATLAS, for which 95% of the I/O requests are
smaller than 275 KB, far surpasses native file system performance. The Postmark and
OLTP benchmarks, also dominated by small I/O, yield similar results.
With Direct-pNFS demonstrating good performance on small I/O workloads, a natu-
ral next step is to explore performance with routine tasks such as a build/development
environment. Following the SSH build benchmark [159], I created a benchmark that un-
compresses, configures, and builds OpenSSH [160]. Using the same systems as above, I
compare the SSH build execution time using Direct-pNFS and PVFS2.
I find that Direct-pNFS reduces compilation time, a stage heavily dominated by small
read and write requests, but increases the time to uncompress and configure OpenSSH,
stages dominated by file creates and attribute updates. Tasks like file creation—
relatively simple for standalone file systems—become complex on parallel file systems,
leading to a lot of inter-node communication. Consequently, many parallel file systems
distribute metadata over many nodes and have clients gather and reconstruct the informa-
tion, relieving the overloaded metadata server. The NFSv4 metadata protocol relies on a
central metadata server, effectively recentralizing the decentralized parallel file system
metadata protocol. The sharp contrast in metadata management cost between NFSv4 and
parallel file systems—beyond the scope of this dissertation—merits further study.
7.6. Related work
Several pNFS layout drivers are under development. At this writing, Sun Microsys-
tems, Inc. is developing file- and object-based layout implementations. Panasas object
and EMC bock drivers are currently under development. pNFS file-based layout drivers
with the architecture in Figure 7.1 have been demonstrated with GPFS, Lustre, and
PVFS2.
Network Appliance is using the Linux file-based layout driver to bind disparate filers
(NFS servers) into a single file system image. The architecture differs from Figure 7.1 in
105
that the filers are not fully connected to storage; each filer is a standalone server attached
to Fibre Channel disk arrays. This continues previous work that aggregates partitioned
NFS servers into a single file system image [91, 113, 114]. Direct-pNFS generalizes
these architectures to be independent of the underlying parallel file system.
The Storage Resource Broker (SRB) [130] aggregates storage resources, e.g., a file
system, an archival system, or a database, into a single data catalogue. The HTTP proto-
col is the most common and widespread way to access remote data stores. SRB and
HTTP also have some limits: they do not enable parallel I/O to multiple storage endpoints
and do not integrate with the local file system.
EMC’s HighRoad [103] uses the NFS or CIFS protocol for its control operations and
stores data in an aggregated LAN and SAN environment. Its use of file semantics facili-
tates data sharing in SAN environments, but is limited to the EMC Symmetrix storage
system. A similar, non-commercial version is also available [141].
Another commodity protocol used along the high-performance data channel is the
4. Distributing I/O requests across multiple data servers inhibits the underlying file
system’s ability to perform effective readahead.
110
5. Overlapping metadata protocols lead to a communication ripple effect that in-
creases overhead and delay for remote metadata requests.
Chapter V investigates a remote data access architecture that allows full access to the
information stored by parallel file system components. This enables a distributed file
system to utilize resources using the same information parallel file systems use to scale to
large numbers of clients. I demonstrate that NFSv4 can use this information along with
the storage protocol of the underlying file system to increase performance and scalability
while retaining file system independence. In combination, the pNFS protocol, storage-
specific layout drivers, and some parallel file system customizations can overcome the
above I/O inefficiencies and enable customizable security and data access semantics.
The pNFS prototype matches the performance of the underlying parallel file system and
demonstrates that efficient layout generation is vital to achieve continuous scalability.
While components in distributed and parallel file systems may provide similar ser-
vices, their capabilities can be vastly different. Chapter VI demonstrates that pNFS can
increase the overall write throughput to parallel data stores—regardless of file size—by
using direct, parallel I/O for large write requests and a distributed file system for small
write requests. The large buffers, limited asynchrony, and high per-request overhead in-
herent to parallel file system scuttles small I/O performance. By completely isolating and
separating the control and data protocols in pNFS, a single file system can use any com-
bination of storage and metadata protocols, each excelling at specific workloads or sys-
tem environments. Use of multiple storage protocols increases the overall write perform-
ance of the ATLAS Digitization application by 57 to 100 percent.
Beyond the interaction of distributed and parallel file system components, the physi-
cal location of a component can affect overall system capabilities. Chapter VII analyzes
the cost of having pNFS clients support a parallel file system data component. A storage-
specific layout driver must be developed for every platform and operating system, reduc-
ing heterogeneity and widespread access to parallel file systems. In addition, requiring
support for semantics that are layout driver and storage protocol dependent, e.g., security,
client caching, and fault tolerance, reduces the data access transparency of NFSv4.
To increase the heterogeneity and transparency of pNFS, Direct-pNFS removes the
requirement that clients incorporate parallel file system data components. Direct-pNFS
111
uses the NFSv4 storage protocol for direct access to NFSv4-enabled parallel file system
storage nodes. A single layout driver potentially reduces development effort while re-
taining NFSv4 file access and security semantics. To perform heterogeneous data access
with only a single layout driver, parallel file-system specific data layout information is
converted into the standard pNFS file-based layout format. Pluggable aggregation driv-
ers provide support for most file distribution schemes. While aggregation drivers can
limit widespread data access, their development effort is likely to be less than for a layout
driver and they can be shared across storage systems. Direct-pNFS experiments demon-
strate that a commodity storage protocol can match the I/O throughput of the exported
parallel file system. Furthermore, Direct-pNFS leverages the small I/O strengths of
NFSv4 to outperform the parallel file system client in diverse workloads.
The pNFS extensions to the NFSv4 protocol are included in the upcoming NFSv4.1
minor version specification [123]. Implementation of NFSv4.1 is under way on several
major operating systems, bringing effective global data access closer to reality.
8.2. Supplementary observations
From a practical standpoint, this dissertation can be used as a guide for capacity plan-
ning decisions. Organizations often allocate limited hardware resources to satisfy local
data access requirements, adding resources for remote access is an afterthought. Used
during the planning and acquisition phases, the analyses in this dissertation can serve as a
guide for improving remote data access.
Unfortunately, the storage community suffers from a narrow vision of data access.
As this dissertation explains, many parallel file system providers continue to believe that
a storage solution can exist in isolation. This idea is quickly becoming old-fashioned.
Specialization of parallel file systems limits the availability of data and reduces collabo-
ration, a vital part of innovation [161]. In addition, the bandwidth available for remote
access across the WAN is continuing to increase, with global collaborations now using
multiple ten Gigabit Ethernet networks [162].
Storage solutions need a holistic approach that accounts for every data access per-
sona. Each persona has a specific I/O workload, set of supported operating systems, and
required level of scalability, performance, transparency, security, and data sharing. For
112
example, the data access requirements of compute clusters, archival systems, and indi-
vidual users (local and remote) are all different, but they all need to access the same stor-
age system. Specializing file systems for a single persona widens the gap between appli-
cations and data.
pNFS attempts to address these diverse requirements by combining the strengths of
NFSv4 with direct storage access. As a result, pNFS uses NFSv4 semantics, which in-
clude a level of fault tolerance as well as close-to-open cache consistency. As described
in Section 7.4, NFSv4 fault tolerance semantics can decrease write performance by push-
ing data aggressively onto stable storage. In addition, some applications or data access
techniques, e.g., collective I/O, require more control over the NFSv4 client data cache.
Mandating that an application must close and re-open a file to refresh its data cache (or
acquire a lock on the file, which has the same effect) can increase delay and reduce per-
formance. Maintaining consistent semantics across remote parallel file systems is impor-
tant, but a distributed file system should allow applications to tune these semantics to
meet their needs.
8.3. Beyond NFSv4
This dissertation focuses on the Network File System (NFS) protocol, which is dis-
tinguished by its precise definition by the IETF, availability of open source implementa-
tions, and support on virtually every modern operating system. It is natural to ask
whether the scalability and performance benefits of the pNFS architecture can be realized
by other distributed file systems such as AFS or CIFS. In other words, is it possible to
design and engineer pAFS, pCIFS, or even pNFSv3? If so, what are the necessary re-
quirements of a distributed file system that permit this transformation?
To answer these questions, let us first review the pNFS architecture:
1. pNFS client. pNFS extends the standard NFSv4 client by delegating application
I/O requests to a storage-specific layout driver. Either the pNFS client or each in-
dividual layout driver can manage and cache layout information.
2. pNFS server. pNFS extends the standard NFSv4 server with the ability to relay
client file layout requests to the underlying parallel file system and respond with
the resultant opaque layout information. In addition, the pNFS server tracks out-
113
standing layout information so that it can be recalled in case a file is renamed or
its layout information is modified5.
3. pNFS metadata protocol. The base NFSv4 metadata protocol enables clients
and servers to request, update, and recall file system and file metadata informa-
tion, e.g., vnode information, and request file locks. pNFS extends this protocol
to request, return, and recall file layout information.
Fundamentally, pNFS extends NFSv4 in its ability to retrieve, manage, and utilize file
layout information. Any distributed file system with a client, server, and a metadata pro-
tocol that can be extended to retrieve layout information is a candidate for pNFS trans-
formation (although the implementation details will vary). A single server is not neces-
sary along the control path, but a distributed file system must have the ability to fulfill
metadata requests. For example, some file systems do not centralize file size on a meta-
data server, but dynamically build the file size through queries to the data servers.
pNFS does not rely on distributed file system support of a storage protocol since it
isolates the storage protocol in the layout driver. However, it is important to note that
Split-Server NFSv4, Direct-pNFS, and file-based pNFS use the NFSv4 storage protocol
along the data path, and a distributed file system must support its own storage protocol to
realize the benefits of these architectures.
The stateful nature of NFSv4, which enables server callbacks to the clients, is not
strictly required to support the pNFS protocol. Instead, a layout driver could discover
that its layout information is invalid through error messages returned by the data servers.
Unfortunately, block-based data servers such as a Fibre Channel disks cannot run NFSv4
servers and therefore cannot return pNFS error codes. This would prohibit stateless dis-
tributed file systems, such as NFSv3, from supporting block-based layout drivers.
A core benefit of pNFS is its ability to provide remote access to parallel file systems.
pNFS achieves this level of independence by leveraging the NFS Vnode definition and
VFS interface. DFS, SMB, and the AppleTalk Filing Protocol all use their own mecha-
5 File layout information can change if it is restriped due to load balancing, lack of free disk space on a par-ticular data server, or server intervention.
114
nisms to achieve this level of independence. Other distributed file systems, such as AFS,
do not currently support remote data access to existing data stores6.
Most NFSv4 implementations also include additional features such as locks, reada-
head, a write back cache, and a data cache. These features provide many benefits but are
unrelated to the support of the pNFS protocol.
8.4. Extensions
This section presents some research themes that hold the potential to further improve
the overall performance of remote data access.
8.4.1. I/O performance
MPI is the dominant tool for inter-node communication while MPI-IO is the nascent
tool for cluster I/O. Native support for MPI-IO implementations (e.g., ROMIO) by re-
mote access file systems such as pNFS is critical. Figure 8.1 offers an example of an in-
ter-cluster data transfer over the WAN. The application cluster is running an MPI appli-
cation that wants to read a large amount of data from the server cluster and perhaps write
to its backend. The MPI head node obtains the data location from the server cluster and
distributes portions of the data location information (via MPI) to other application cluster
nodes, enabling direct access to server cluster storage devices. The application then uses
MPI-IO to read data in parallel from the server cluster across the WAN, processes the
data, and directs output to the application cluster backend. A natural use case for this ar-
chitecture is a visualization application processing the results of a scientific application
run on the server cluster. Another use case is an application making a local copy of data
from the server cluster on the application cluster.
Regarding NFSv4, several improvements to the protocol could be beneficial to appli-
cations. For example, List I/O support has shown great application performance benefits
with other storage protocols [163, 164]. In addition, many NFSv4 implementations have
6 HostAFSd, currently under development, will allow AFS servers to export the local file system.
115
a maximum data transfer size of 64 KB to 1 MB, while many parallel file systems sup-
port a maximum transfer size of 4 MB and larger. Efficient implementation of larger
NFSv4 data transfer sizes is vital for high-performance access to parallel file systems.
Finally, this dissertation uses a distributed file system to improve I/O throughput to
parallel file systems, but leaves the architecture of the parallel file system largely un-
touched. Parallel file system architecture modifications may also improve remote data
access performance, e.g., data and metadata replication and caching [165, 166], and vari-
ous readahead algorithms.
8.4.2. Metadata management
This dissertation focuses on improving I/O throughput for scientific applications.
Some research has focused on improving metadata performance in high-performance
computing [116, 167, 168], but most parallel file systems implement proprietary metadata
management schemes7.
To offload overburdened metadata servers, many parallel file systems distribute
metadata information among multiple servers and employ their clients to perform tasks
that involve communication with storage, e.g., file creation. Most distributed file systems
Figure 8.1: pNFS and inter-cluster data transfers across the WAN. A pNFS cluster retrieves data from a remote storage system, processes the data, and writes to its local storage system. The MPI head node distributes layout information to pNFS clients.
116
use a single metadata server, creating a metadata management bottleneck that threatens to
eliminate the benefits of the parallel file system’s metadata management distribution.
Stateful distributed file servers only exacerbate the problem. Additional research is
needed to explore how distributed file systems can optimize metadata management with-
out introducing these new bottlenecks and inefficiencies.
7 I am unaware of any additional research beyond this dissertation into the interaction between distributed and parallel file system metadata subsystems.
117
Bibliography
118
Bibliography
[1] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, "Design and Im-plementation of the Sun Network Filesystem," in Proceedings of the Summer USENIX Technical Conference, Portland, OR, 1985.
[2] Common Internet File System File Access Protocol (CIFS), msdn.microsoft.com/library/en-us/cifs/protocol/cifs.asp.
[3] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. Eisler, and D. Noveck, NFS Version 4 Protocol Specification. RFC 3530, 2003.
[4] B. Allcock, J. Bester, J. Bresnahan, A.L. Chervenak, I. Foster, C. Kesselman, S. Meder, V. Nefedova, D. Quesnal, and S. Tuecke., "Data Management and Transfer in High-Performance Computational Grid Environments," Parallel Computing, 28(5):749-771, 2002.
[5] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, "Query by Image and Video Content: The QBIC System," IEEE Computer, 28(9):23-32, 1995.
[6] Biowulf at the NIH, biowulf.nih.gov/apps/blast.
[7] S. Berchtold, C. Boehm, D.A. Keim, and H. Kriegel, "A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space," in Proceedings of the 16th ACM PODS Symposium, Tucson, AZ, 1997.
[8] P. Caulk, "The Design of a Petabyte Archive and Distribution System for the NASA ECS Project," in Proceedings of the 4th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 1995.
[9] J. Behnke and A. Lake, "EOSDIS: Archive and Distribution Systems in the Year 2000," in Proceedings of the 17th IEEE/8th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2000.
[10] J. Behnke, T.H. Watts, B. Kobler, D. Lowe, S. Fox, and R. Meyer, "EOSDIS Petabyte Archives: Tenth Anniversary," in Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies, Monterey, CA, 2005.
119
[11] D. Strauss, "Linux Helps Bring Titanic to Life," Linux Journal, 46, 1998.
[13] Petascale Data Storage Institute, www.pdl.cmu.edu/PDSI.
[14] Serial ATA Workgroup, "Serial ATA: High Speed Serialized AT Attachment," Rev. 1, 2001.
[15] Adaptec Inc., "Ultra320 SCSI: New Technology-Still SCSI," www.adaptec.com.
[16] T.M. Anderson and R.S. Cornelius, "High-Performance Switching with Fibre Channel," in Proceedings of the 37th IEEE International Conference on COMPCON, San Francisco, CA, 1992.
[17] K. Salem and H. Garcia-Molina, "Disk Striping," in Proceedings of the 2nd Inter-national Conference on Data Engineering, Los Angeles, CA, 1986.
[19] D.A. Patterson, G.A. Gibson, and R.H. Katz, "A Case for Redundant Arrays of In-expensive Disks (RAID)," in Proceedings of the ACM SIGMOD Conference on Management of Data, Chicago, IL, 1988.
[20] Small Computer Serial Interface (SCSI) Specification. ANSI X3.131-1986, www.t10.org, 1986.
[22] J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner, Internet Small Computer Systems Interface (iSCSI). RFC 3720, 2001.
[23] K.Z. Meth and J. Satran, "Design of the iSCSI Protocol," in Proceedings 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, San Diego, CA, 2003.
[24] J. Postel and J. Reynolds, File Transfer Protocol (FTP). RFC 765, 1985.
[25] B. Callaghan, NFS Illustrated. Essex, UK: Addison-Wesley, 2000.
[26] PASC, "IEEE Standard Portable Operating System Interface for Computer Envi-ronments," IEEE Std 1003.1-1988, 1988.
120
[27] D.R. Brownbridge, L.F. Marshall, and B. Randell, "The Newcastle Connection or UNIXes of the World Unite!," Software-Practice and Experience, 12(12):1147-1162, 1982.
[28] P.J. Leach, P.H. Levine, J.A. Hamilton, and B.L. Stumpf, "The File System of an Integrated Local Network," in Proceedings of the 13th ACM Annual Conference on Computer Science, New Orleans, LA, 1985.
[29] P.H. Levine, The Apollo DOMAIN Distributed File System. New York, NY: Springer-Verlag, 1987.
[30] P.J. Leach, B.L. Stumpf, J.A. Hamilton, and P.H. Levine, "UIDs as Internal Names in a Distributed File System," in Proceedings of the 1st Symposium on Principles of Distributed Computing, Ottawa, Canada, 1982.
[31] G. Popek, The LOCUS Distributed System Architecture. Cambridge, MA: MIT Press, 1986.
[32] A.P. Rifkin, M.P. Forbes, R.L. Hamilton, M. Sabrio, S. Shah, and K. Yueh, "RFS Architectural Overview," in Proceedings of the Summer USENIX Technical Con-ference, Atlanta, GA, 1986.
[33] S.R. Kleiman, "Vnodes: An Architecture for Multiple File System Types in Sun UNIX," in Proceedings of the Summer USENIX Technical Conference, Altanta, GA, 1986.
[34] R. Srinivasan, RPC: Remote Procedure Call Protocol Specification Version 2. RFC 1831, 1995.
[35] R. Srinivasan, XDR: External Data Representation Standard. RFC 1832, 1995.
[36] Sun Microsystems Inc., NFS: Network File System Protocol Specification. RFC 1094, 1989.
[37] B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz, "NFS Version 3 Design and Implementation," in Proceedings of the Summer USENIX Technical Conference, Boston, MA, 1994.
[38] B. Callaghan, B. Pawlowski, and P. Staubach, NFS Version 3 Protocol Specifica-tion. RFC 1813, 1995.
[39] B. Callaghan and T. Lyon, "The Automounter," in Proceedings of the Winter USENIX Technical Conference, San Diego, CA, 1989.
[40] J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West, "Scale and Performance in a Distributed File System," ACM Transac-tions on Computer Systems, 6(1):51-81, 1988.
121
[41] J.G. Steiner, C. Neuman, and J.I. Schiller, "Kerberos: An Authentication Service for Open Network Systems," in Proceedings of the Winter USENIX Technical Con-ference, Dallas, TX, 1988.
[42] M. Satyanarayanan, J.J. Kistler, P. Kumar, M.E. Okasaki, E.H. Siegel, and D.C. Steere, "Coda: A Highly Available File System for a Distributed Workstation Envi-ronment," IEEE Transactions on Computers, 39(4):447-459, 1990.
[43] M.L. Kazar, B.W. Leverett, O.T. Anderson, V. Apostolides, B.A. Bottos, S. Chu-tani, C.F. Everhart, W.A. Mason, S. Tu, and R. Zayas, "DEcorum File System Ar-chitectural Overview," in Proceedings of the Summer USENIX Technical Confer-ence, Anaheim, CA, 1990.
[44] S. Chutani, O.T. Anderson, M.L. Kazar, B.W. Leverett, W.A. Mason, and R.N. Sidebotham, "The Episode File System," in Proceedings of the Winter USENIX Technical Conference, Berkeley, CA, 1992.
[45] C. Gray and D. Cheriton, "Leases: an Efficient Fault-tolerant Mechanism for Dis-tributed File Cache Consistency," in Proceedings of the 12th ACM Symposium on Operating Systems Principles, Litchfield Park, AZ, 1989.
[47] International Standards Organization, Information Processing Systems - Open Sys-tems Interconnection - Basic Reference Model. Draft International Standard 7498, 1984.
[48] Protocol Standard for a NetBIOS Service on a TCP/UDP Transport: Detailed Specifications. RFC 1002, 1987.
[49] Protocol Standard for a NetBIOS Service on a TCP/UDP Transport: Concepts and Methods. RFC 1001, 1987.
[50] J.D. Blair, SAMBA, Integrating UNIX and Windows,. Specialized Systems Consult-ants Inc., 1998.
[51] J. K. Ousterhout, A. R. Cherenson, F. Douglis, M. N. Nelson, and B. B. Welch, "The Sprite Network Operating System," IEEE Computer, 21(2):23-36, 1988.
[52] M.N. Nelson, B.B. Welch, and J.K. Ousterhout, "Caching in the Sprite Network File System," ACM Transactions on Computer Systems, 6(1):134-154, 1988.
[53] V. Srinivasan and J. Mogul, "Spritely NFS: Experiments With Cache-Consistency Protocols," in Proceedings of the 12th ACM Symposium on Operating Systems Principles, 1989.
122
[54] R. Macklem, "Not Quite NFS, Soft Cache Consistency for NFS," in Proceedings of the Winter USENIX Technical Conference, San Fransisco, CA, 1994.
[59] S. Habata, M. Yokokawa, and S. Kitawaki, "The Earth Simulator System," NEC Research and Development Journal, 44(1):21-26, 2003.
[60] ASCI Red, www.sandia.gov/ASCI/Red.
[61] M. Satyanarayanan, "A Study of File Sizes and Functional Lifetimes," in Proceed-ings of the 8th ACM Symposium on Operating System Principles, Pacific Grove, CA, 1981.
[62] D. Ellard, J. Ledlie, P. Malkani, and M. Seltzer, "Passive NFS Tracing of Email and Research Workloads," in Proceedings of the USENIX Conference on File and Stor-age Technologies, San Francisco, CA, 2003.
[63] M.G. Baker, J.H. Hartman, M.D. Kupfer, K.W. Shirriff, and J.K. Ousterhout, "Measurements of a Distributed File System," in Proceedings of the 13th ACM Symposium on Operating Systems Principles, Pacific Grove, CA, 1991.
[64] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M. Kupfer, and J.G. Thompson, "A Trace-Driven Analysis of the UNIX 4.2 BSD File System," in Pro-ceedings of the 10th ACM Symposium on Operating Systems Principles, Orcas Is-land, WA, 1985.
[65] B. Fryxell, K. Olson, P. Ricker, F.X. Timmes, M. Zingale, D.Q. Lamb, P. MacNeice, R. Rosner, and H. Tufo, "FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes," Astrophysical Journal Supplement, 131:273-334, 2000.
[66] A. Darling, L. Carey, and W. Feng, "The Design, Implementation, and Evaluation of mpiBLAST," in Proceedings of the ClusterWorld Conference and Expo, in con-junction with the 4th International Conference on Linux Clusters: The HPC Revo-lution, San Jose, CA, 2003.
123
[67] E.L. Miller and R.H. Katz, "Input/Output Behavior of Supercomputing Applica-tions," in Proceedings of Supercomputing '91, Albuquerque, NM, 1991.
[68] B. Schroeder and G. Gibson, "A Large Scale Study of Failures in High-Performance-Computing Systems," in Proceedings of the International Conference on Dependable Systems and Networks, Philadelphia, PA, 2006.
[69] G. Grider, L. Ward, R. Ross, and G. Gibson, "A Business Case for Extensions to the POSIX I/O API for High End, Clustered, and Highly Concurrent Computing," www.opengroup.org/platform/hecewg, 2006.
[70] A.T. Wong, L. Oliker, W.T.C. Kramer, T.L. Kaltz, and D.H. Bailey, "ESP: A Sys-tem Utilization Benchmark," in Proceedings of Supercomputing '00, Dallas, TX, 2000.
[71] B.K. Pasquale and G.C. Polyzos, "Dynamic I/O Characterization of I/O Intensive Scientific Applications," in Proceedings of Supercomputing '94, Washington, D.C., 1994.
[72] D. Kotz and N. Nieuwejaar, "Dynamic File-Access Characteristics of a Production Parallel Scientific Workload," in Proceedings of Supercomputing '94, Washington, D.C., 1994.
[73] A. Purakayastha, C. Schlatter Ellis, D. Kotz, N. Nieuwejaar, and M. Best, "Charac-terizing Parallel File-Access Patterns on a Large-Scale Multiprocessor," in Pro-ceedings of the Ninth International Parallel Processing Symposium, Santa Barbara, CA, 1995.
[74] N. Nieuwejaar, D. Kotz, A. Purakayastha, C. Schlatter Ellis, and M. Best, "File-Access Characteristics of Parallel Scientific Workloads," IEEE Transactions on Parallel and Distributed Systems, 7(10):1075-1089, 1996.
[75] E. Smirni and D.A. Reed, "Workload Characterization of Input/Output Intensive Parallel Applications," in Proceedings of the Conference on Modeling Techniques and Tools for Computer Performance Evaluation, Saint Malo, France, 1997.
[76] E. Smirni, R.A. Aydt, A.A. Chien, and D.A. Reed, "I/O Requirements of Scientific Applications: An Evolutionary View," in Proceedings of the 5th IEEE Conference on High Performance Distributed Computing, Syracuse, NY, 1996.
[77] P.E. Crandall, R.A. Aydt, A.A. Chien, and D.A. Reed, "Input/Output Characteris-tics of Scalable Parallel Applications," in Proceedings of Supercomputing '95, San Diego, CA, 1995.
[78] F. Wang, Q. Xin, B. Hong, S.A. Brandt, E.L. Miller, D.D.E Long, and T.T. McLarty, "File System Workload Analysis For Large Scale Scientific Computing
124
Applications," in Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2004.
[79] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, and J. Dongarra, MPI: The Complete Reference. Cambridge, MA: MIT Press, 1996.
[80] W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir, and M. Snir, MPI: The Complete Reference, volume 2--The MPI-2 Extensions. Cam-bridge, MA: MIT Press, 1998.
[81] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, "A High-Performance, Portable Im-plementation of the MPI Message Passing Interface Standard," Parallel Computing, 22(6):789-828, 1996.
[82] R. Thakur, W. Gropp, and E. Lusk, "Data Sieving and Collective I/O in ROMIO," in Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Com-putation, 1999.
[83] J.P Prost, R. Treumann, R. Hedges, B. Jia, and A.E. Koniges, "MPI-IO/GPFS, an Optimized Implementation of MPI-IO on top of GPFS," in Proceedings of Super-computing '01, Denver, CO, 2001.
[84] R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi, "Passion: Optimized I/O for Parallel Applications," IEEE Computer, 29(6):70-78, 1996.
[85] D. Kotz, "Disk-directed I/O for MIMD Multiprocessors," ACM Transactions on Computer Systems, 15(1):41-74, 1997.
[86] K. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett, "Server-Directed Col-lective I/O in Panda," in Proceedings of Supercomputing '95, 1995.
[87] J. del Rosario, R. Bordawekar, and A. Choudhary, "Improved Parallel I/O via a Two-Phase Run-time Access Strategy," in Proceedings of the Workshop on I/O in Parallel Computer Systems at IPPS '93, Newport Beach, CA, 1993.
[88] OpenMP Consortium, "OpenMP C and C++ Application Program Interface, Ver-sion 1.0," www.openmp.org, 1997.
[89] High Performance Fortran Forum, "High Performance Fortran language specifica-tion version 2.0," hpff.rice.edu/versions/hpf2/hpf-v20, 1997.
[90] C. Juszczak, "Improving the Write Performance of an NFS Server," in Proceedings of the Winter USENIX Technical Conference, San Fransisco, CA, 1994.
[91] P. Lombard and Y. Denneulin, "nfsp: A Distributed NFS Server for Clusters of Workstations," in Proceedings of the 16th International Parallel and Distributed Processing Symposium, Fort Lauderdale, FL, 2002.
125
[92] C.A. Thekkath, T. Mann, and E.K. Lee, "Frangipani: A Scalable Distributed File System," in Proceedings of the 16th ACM Symposium on Operating Systems Prin-ciples, Saint-Malo, France, 1997.
[93] R.A. Coyne and H. Hulen, "An Introduction to the Mass Storage System Reference Model, Version 5," in Proceedings of the 12th IEEE Symposium on Mass Storage Systems, Monterey, CA, 1993.
[94] S.W. Miller, A Reference Model for Mass Storage Systems. San Diego, CA: Aca-demic Press Professional, Inc., 1988.
[95] A.L. Drapeau, K. Shirriff, E.K. Lee, J.H. Hartman, E.L. Miller, S. Seshan, R.H. Katz, K. Lutz, D.A. Patterson, P.H. Chen, and G.A. Gibson, "RAID-II: A High-Bandwidth Network File Server," in Proceedings of the 21st International Sympo-sium on Computer Architecture, Chicago, IL, 1994.
[96] L. Cabrera and D.D.E. Long, "SWIFT: Using Distributed Disk Striping To Provide High I/O Data Rates," Computing Systems, 4(4):405-436, 1991.
[97] F. Schmuck and R. Haskin, "GPFS: A Shared-Disk File System for Large Comput-ing Clusters," in Proceedings of the USENIX Conference on File and Storage Technologies, San Francisco, CA, 2002.
[98] Red Hat Software Inc., "Red Hat Global File System," www.redhat.com/ software/rha/gfs.
[99] S.R. Soltis, T.M. Ruwart, and M.T. O'Keefe, "The Global File System," in Pro-ceedings of the 5th NASA Goddard Conference on Mass Storage Systems, College Park, MD, 1996.
[100] M. Fasheh, "OCFS2: The Oracle Clustered File System, Version 2," in Proceedings of the Linux Symposium, Ottawa, Canada, 2006.
[101] Polyserve Inc., "Matrix Server Architecture," www.polyserve.com.
[102] C. Brooks, H. Dachuan, D. Jackson, M.A. Miller, and M. Resichini, "IBM Total-Storage: Introducing the SAN File System," IBM Redbooks, 2003, www.redbooks.ibm.com/redbooks/pdfs/sg247057.pdf.
[106] D.D.E. Long, B.R. Montague, and L. Cabrera, "Swift/RAID: A Distributed RAID System," Computing Systems, 7(3):333-359, 1994.
[107] P.H. Carns, W.B. Ligon III, R.B. Ross, and R. Thakur, "PVFS: A Parallel File Sys-tem for Linux Clusters," in Proceedings of the 4th Annual Linux Showcase and Conference, Atlanta, GA, 2000.
[108] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and V. Paxson, Stream Control Transmission Protocol. RFC 2960, 2000.
[109] RDMA Consortium, www.rdmaconsortium.org.
[110] B. Callaghan and S. Singh, "The Autofs Automounter," in Proceedings of the Summer USENIX Technical Conference, Cincinnati, OH, 1993.
[111] S. Tweedie, "Ext3, Journaling Filesystem," in Proceedings of the Linux Symposium, Ottawa, Canada, 2000.
[112] ReiserFS, www.namesys.com.
[113] F. Garcia-Carballeira, A. Calderon, J. Carretero, J. Fernandez, and J.M. Perez, "The Design of the Expand File System," International Journal of High Performance Computing Applications, 17(1):21-37, 2003.
[114] G.H. Kim, R.G. Minnich, and L. McVoy, "Bigfoot-NFS: A Parallel File-Striping NFS Server (Extended Abstract)," 1994, www.bitmover.com/lm.
[115] A. Butt, T.A. Johnson, Y. Zheng, and Y.C. Hu, "Kosha: A Peer-to-Peer Enhance-ment for the Network File System," in Proceedings of Supercomputing '04, Pitts-burgh, PA, 2004.
[116] D.C. Anderson, J.S. Chase, and A.M. Vahdat, "Interposed Request Routing for Scalable Network Storage," in Proceedings of the 4th Symposium on Operating System Design and Implementation, San Diego, CA, 2000.
[117] M. Eisler, A. Chiu, and L. Ling, RPCSEC_GSS Protocol Specification. RFC 2203, 1997.
[118] M. Eisler, LIPKEY - A Low Infrastructure Public Key Mechanism Using SPKM. RFC 2847, 2000.
[119] J. Linn, The Kerberos Version 5 GSS-API Mechanism. RFC 1964, 1996.
[120] NFS Extensions for Parallel Storage (NEPS), www.citi.umich.edu/NEPS.
[121] G. Gibson and P. Corbett, pNFS Problem Statement. Internet Draft, 2004.
127
[122] G. Gibson, B. Welch, G. Goodson, and P. Corbett, Parallel NFS Requirements and Design Considerations. Internet Draft, 2004.
[123] S. Shepler, M. Eisler, and D. Noveck, NFSv4 Minor Version 1. Internet Draft, 2006.
[124] A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G. Peck, "Scal-ability in the XFS File System," in Proceedings of the USENIX Annual Technical Conference, San Diego, CA, USA, 1996.
[125] IEEE Storage System Standards Working Group (SSSWG) (Project 1244), "Refer-ence Model for Open Storage Systems Interconnection - Mass Storage System Ref-erence Model Version 5," ssswg.org/public_documents/MSSRM/V5-pref.html, 1994.
[126] R.A. Coyne, H. Hulen, and R. Watson, "The High Performance Storage System," in Proceedings of Supercomputing '93, Portland, OR, 1993.
[127] TeraGrid, www.teragrid.org.
[128] W.D. Norcott and D. Capps, "IOZone Filesystem Benchmark," 2003, www.iozone.org.
[129] Parallel Virtual File System - Version 2, www.pvfs.org.
[130] C. Baru, R. Moore, A. Rajasekar, and M. Wan, "The SDSC Storage Resource Bro-ker," in Proceedings of the Conference of the Centre for Advanced Studies on Col-laborative Research, Toronto, Canada, 1998.
[131] T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patterson, D.S. Roselli, and R.Y. Wang, "Serverless Network File Systems," in Proceedings of the 15th ACM Sym-posium on Operating System Principles, Copper Mountain Resort, CO, 1995.
[132] B.S. White, M. Walker, M. Humphrey, and A.S. Grimshaw, "LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applica-tions," in Proceedings of Supercomputing '01, Denver CO, 2001.
[133] O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, and S. Sekiguchi, "Grid Datafarm Ar-chitecture for Petascale Data Intensive Computing," in Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, Berlin, Germany, 2002.
[134] P. Andrews, C. Jordan, and W. Pfeiffer, "Marching Towards Nirvana: Configura-tions for Very High Performance Parallel File Systems," in Proceedings of the HiperIO Workshop, Barcelona, Spain, 2006.
128
[135] P. Andrews, C. Jordan, and H. Lederer, "Design, Implementation, and Production Experiences of a Global Storage Grid," in Proceedings of the 23rd IEEE/14th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2006.
[139] Internet Assigned Numbers Authority, www.iana.org.
[140] M. Olson, K. Bostic, and M. Seltzer, "Berkeley DB," in Proceedings of the Summer USENIX Technical Conference, FREENIX track, Monterey, CA, 1999.
[141] A. Bhide, A. Engineer, A. Kanetkar, A. Kini, C. Karamanolis, D. Muntz, Z. Zhang, and G. Thunquest, "File Virtualization with DirectNFS," in Proceedings of the 19th IEEE/10th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2002.
[142] R. Rew and G. Davis, "The Unidata netCDF: Software for Scientific Data Access," in Proceedings of the 6th International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Anaheim, CA, 1990.
[143] NCSA, "HDF5 ", hdf.ncsa.uiuc.edu/HDF5.
[144] Unidata Program Center, "Where is NetCDF Used?," www.unidata.ucar.edu/software/netcdf/usage.html.
[145] J. Katcher, "PostMark: A New File System Benchmark," Network Appliance, Technical Report TR3022, 1997.
[150] ATLAS Development Team (private communication), 2005.
129
[151] M. Rosenblum and J.K. Ousterhout, "The Design and Implementation of a Log-Structured File System," ACM Transactions on Computer Systems, 10(1):26-52, 1992.
[152] J.H. Hartman and J.K. Ousterhout, "The Zebra Striped Network File System," ACM Transactions on Computer Systems, 13(3):274-310, 1995.
[153] P.F. Corbett and D.G. Feitelson, "The Vesta Parallel File System," ACM Transac-tions on Computer Systems, 14(3):225-264, 1996.
[154] B. Halevy, B. Welch, J. Zelenka, and T. Pisek, Object-based pNFS Operations. Internet Draft, 2006.
[155] D. Hildebrand and P. Honeyman, "Exporting Storage Systems in a Scalable Manner with pNFS," in Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies, Monterey, CA, 2005.
[156] IBRIX Fusion, www.ibrix.com.
[157] S.V. Anastasiadis, K. C. Sevcik, and M. Stumm, "Disk Striping Scalability in the Exedra Media Server," in Proceedings of the ACM/SPIE Multimedia Computing and Networking, San Jose, CA, 2001.
[158] F. Isaila and W.F. Tichy, "Clusterfile: A Flexible Physical Layout Parallel File Sys-tem," in Proceedings of the IEEE International Conference on Cluster Computing, Newport Beach, CA, 2001.
[159] M. Seltzer, G. Ganger, M.K. McKusick, K. Smith, C. Soules, and C. Stein., "Jour-naling versus Soft Updates: Asynchronous Meta-data Protection in File Systems.," in Proceedings of the USENIX Annual Technical Conference, San Diego, CA, 2000.
[160] OpenSSH, www.openssh.org.
[161] M. Enserink and G. Vogel, "Deferring Competition, Global Net Closes In on SARS," Science, 300(5617):224-225, 2003.
[162] H. Newman, J. Bunn, R. Cavanaugh, I. Legrand, S. Low, S. McKee, D. Nae, S. Ra-vot, C. Steenberg, X. Su, M. Thomas, F. van Lingen, and Y. Xia, "The UltraLight Project: The Network as an Integrated and Managed Resource in Grid Systems for High Energy Physics and Data Intensive Science," Computing in Science and Engi-neering, 7(6), 2005.
[163] A. Ching, W. Feng, H. Lin, X. Ma, and A. Choudhary, "Exploring I/O Strategies for Parallel Sequence Database Search Tools with S3aSim," in Proceedings of the International Symposium on High Performance Distributed Computing, Paris, France, 2006.
130
[164] A. Ching, A. Choudhary, W.K. Liao, R. Ross, and W. Gropp, "Noncontiguous I/O through PVFS," in Proceedings of the IEEE International Conference on Cluster Computing, Chicago, IL, 2002.
[165] S. Well, S.A. Brandt, E.L. Miller, and C. Maltzahn, "CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data," in Proceedings of Supercomputing '06, Tampa, FL, 2006.
[166] S. Ghemawat, H. Gobioff, and S.T. Leung, "The Google File System," in Proceed-ings of the 19th ACM Symposium on Operating Systems Principles, Bolton Land-ing, NY, 2003.
[167] S. Well, K.T. Pollack, S.A. Brandt, and E.L. Miller, "Dynamic Metadata Manage-ment for Petabyte-scale File Systems," in Proceedings of Supercomputing '04, Pittsburgh, PA, 2004.
[168] R.B. Avila, P.O.A. Navaux, P. Lombard, A. Lebre, and Y. Denneulin, "Perform-ance Evaluation of a Prototype Distributed NFS Server," in Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing, Foz do Iguaçu, Brazil, 2004.