GPFS in Today's HPC Processing Center This is not your ... · GPFS in Today's HPC Processing CenterGPFS in Today's HPC Processing Center "This is not your father's GPFS""This is not

GPFS in Today's HPC Processing CenterGPFS in Today's HPC Processing Center"This is not your father's GPFS""This is not your father's GPFS"

Raymond L. Paden, Ph.D.HPC Technical Architect

Deep Computing

3 June 2005

[email protected] - 940 - 1084

This presentation was produced in the United States. IBM may not offer the products, programs, services or features discussed herein in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information on the products, programs, services, and features available in your area. Any reference to an IBM product, program, service or feature is not intended to state or imply that only IBM's product, program, service or feature may be used. Any functionally equivalent product, program, service or feature that does not infringe on any of IBM's intellectual property rights may be used instead of the IBM product, program, service or feature.

Information in this presentation concerning non-IBM products was obtained from the suppliers of these products, published announcement material or other publicly available sources. Sources for non-IBM list prices and performance numbers are taken from publicly available information including D.H. Brown, vendor announcements, vendor WWW Home Pages, SPEC Home Page, GPC (Graphics Processing Council) Home Page and TPC (Transaction Processing Performance Council) Home Page. IBM has not tested these products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

IBM may have patents or pending patent applications covering subject matter in this presentation. The furnishing of this presentation does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA.

All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of a specific Statement of General Direction.

The information contained in this presentation has not been submitted to any formal IBM test and is distributed "AS IS". While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. The use of this information or the implementation of any techniques described herein is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. Customers attempting to adapt these techniques to their own environments do so at their own risk.

IBM is not responsible for printing errors in this presentation that result in pricing or information inaccuracies.

The information contained in this presentation represents the current views of IBM on the issues discussed as of the date of publication. IBM cannot guarantee the accuracy of any information presented after the date of publication.

IBM products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.

Any performance data contained in this presentation was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements quoted in this presentation may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this presentation may have been estimated through extrapolation. Actual results may vary. Users of this presentation should verify the applicable data for their specific environment.

Microsoft, Windows, Windows NT and the Windows logo are registered trademarks of Microsoft Corporation in the United States and/or other countries.

UNIX is a registered trademark in the United States and other countries licensed exclusively through The Open Group.

LINUX is a registered trademark of Linus Torvalds. Intel and Pentium are registered trademarks and MMX,Itanium, Pentium II Xeon and Pentium III Xeon are trademarks of Intel Corporation in the United States and/or other countries.

Other company, product and service names may be trademarks or service marks of others.

Special Notices from IBM Legal

��®

GPFS has matured significantly over the years since its version 1.x releases for AIX, yet many HPC practitioners do not fully realize GPFS's flexibility and ease of use today. This presentation examines these new features including multi-clustering, GPFS in a mixed AIX/Linux environment, its ability to work with a wider variety of disk vendors, and other things. Sample configurations with benchmark results will be given. The presentation will include a cursory review of the GPFS roadmap.

AbstractAbstract

��®

Some say people have evolved...

��®

Some say people have evolvedinto something intelligent?!?!

��®

But what about HPC systems?

How have they evolved?

��®

A Common HPC Evolutionary PathA Common HPC Evolutionary Path

Mainframe IBM mainframe with attached vector processor(s)

Vector ProcessorCray (with attached mainframe?)

��®


Proprietary ClustersRoom full of SPs



After n steps...

��®



Then some renegade starts experimenting with Beowulf clusters...



After n steps...

starting like this...

��®



Then some renegade starts experimenting with Beowulf clusters...



After n steps...

starting like this...then evolving into this...

��®



Rack mounted Linux NodesIBM Cluster 1350

Blades (using Linux)IBM BladeCenter

Proprietary SMP ClustersRoom full of IBM p690s

Beowulf clusterThen some renegade starts experimenting with Beowulf clusters...



After n steps...

��®

What Next?What Next?

Cluster 1 File System

Cluster 1 Nodes

Site 1 SAN

Global SAN Interconnect

Cluster 2 Nodes

Site 2 SAN

Site 3 SAN

Visualization System

Remote disk access

Remote disk access

Local disk access

Rack mounted Linux NodesIBM Cluster 1350

Blades (using Linux)IBM BladeCenter

Proprietary SMP ClustersRoom full of IBM p690s

... and everybody is talking about grids!

��®

But Where Does Storage I/O Fit?But Where Does Storage I/O Fit?

Storage I/O...The oft forgotten stepchild

��®

But Where Does Storage I/O Fit?But Where Does Storage I/O Fit?

Early adopters of proprietary clusters (e.g., IBM SP) generally adopted vendor storage solutions (e.g., SSA and GPFS or JFS)

GPFS is NOT the same beast it used to be!

Early adopters of commodity clusters approached storage I/O from a potpouri of approaches (e.g., NFS)

There are alternatives to NFS

Customers trying to integrate proprietary and commodity systems often feel forced to use NFS

There are alternatives to NFS

And what about grids?

��®

Lets take a closer look at this.

I will begin with the Linux clusters perspective.

I will get to the SP to pSeries perspective in a moment.

He who does not study history is pre-destined to relive it... errr, but is NFS really history?

��®

Common First StepCommon First StepFor Something SmallFor Something Small

For something like this, it is common to do NFS and FTP between servers over a GbE or 100 MbE network.

samIntellistationM Pro2 CPU4 GBLinux 2.4.21 (SuSE)

frodop6152 CPU4 GBAIX 5.2

gandalfp6152 CPU4 GBAIX 5.2

hdisk0(scsi)hdisk1(scsi)hdisk2(scsi)hdisk3(scsi)


sda (scsi)

Ethernet (GbE)

local/fs_sam

NFS/fs_frodo/fs_gandalf

local/fs_frodo

NFS/fs_sam/fs_gandalf

local/fs_gandalf

NFS/fs_sam/fs_frodo

��®

Common Second StepCommon Second StepMake the Small Solution BIGGERMake the Small Solution BIGGER

"Client" Nodese.g., x336 (Linux)access to NFS mounted file system from head nodeinternal SCSI used for local scratch

FTP files as necessary between clients

Head Nodee.g., x346 (Linux) file system based on internal storage and NFS exported to client

Node SwitchGenerally, IP based network

GbEMyrinetHPSIB?

Common Application OrganizationUse IP network to distribute data via MPI, NFS and/or other "home-grown mid-layer" codes

works well for applications using minimal or no parallel I/Odo application developers (i.e., computational scientists) want to become computer scientists?

Node

Switch

Head NodeInternal SCSI or SATA

NFS export local file system

client nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient node

��®

Common Third StepCommon Third StepCreate "Islands of Nodes" as You Get Even BIGGERCreate "Islands of Nodes" as You Get Even BIGGER

Node

Switch




Node

Switch




Higher Level Switch (LAN or WAN) Clusters are connected via a heirarchical switching network

COMMENTSProvides "any to any" connectivity

the poor man's wayWorks well when I/O model is not parallel, but may require aggregating filesISLs can be bottlenecksInadequacy of NFS semantics (especially for parallel writes)Poor I/O performanceLimited storage capacity

can add more head nodesStorage BW limited by headnode switch adapterInconvenience of ftp or other utilities to manually move filesCommon fabric for message passing and storage I/ONaturally generalizes to a grid with all of its issues, but is compounded by varibilities of geographic seperation!

��®

Common Third StepCommon Third StepCreate "Islands of Nodes" as You Get Even BIGGERCreate "Islands of Nodes" as You Get Even BIGGER

Node

Switch

NFS server nodeNFS server nodeNFS server nodeNFS server nodeNFS server nodeNFS server node

client nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient node

Node

Switch

iSCSI based storage


Higher Level Switch (LAN or WAN) Clusters are connected via a heirarchical switching network

Disk Controller

Disk Enclosure

Disk Enclosure

More sophisticated storage systems can be adopted to work within this NFS/IP based model over a WAN/grid

iSCSI based systemsNFS not necessarily required"plain vanilla" iSCSI can be used, but more sophisticated schemes are being investigated (e.g., file replication, Univ of Tokyo)

SAN based file system provides local I/O but the local file systems are NFS exported over the WAN

still must deal with NFS shortcommings

SAN

Switch

In this SAN example, the SAN attached storage would typically be distributed over the 6 servers with 6 or more different NFS exported file systems (at least 1 file system per NFS server).

��®

The Final(?) StepThe Final(?) StepGlobal Storage with Parallel File SystemGlobal Storage with Parallel File System

Storage servers with external storage using the node switch

COMMENTShomogenous networkparallel FS fascilitates the effective use of this architecturedisks are accessed via the LAN/WAN

virtual disksperformance scales linearly in the number of servers

increasing the number of servers will increase BW

can add capacity without increasing the number of serversserver switch adapters can become a bottleneckcan inexpensively scale out the number of nodes

largest GPFS cluster with a single file system 2300 nodes

natural model for grid based file system

SAN

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

Node Switch (LAN or WAN)

storage node

storage node

storage node

storage node

diskcontroller

disk enclosure

disk enclosure

disk enclosure

disk enclosure

disk enclosure

��®

The Final(?) StepThe Final(?) StepAnother Global Storage Parallel File System ModelAnother Global Storage Parallel File System Model

SAN attached storage

COMMENTSseperate switch fabricsparallel FS fascilitates the effective use of this architectureperformace scales in the number of disk controllerscan add capacity without increasing the number controllersscaling out the number of direct attach nodes is limited by the SAN

largest SAN cluster is 200+ nodes scaling larger requires remote nodes accessing storage over IP networkdirect attach nodes get better file system BW

BW is not restricted by server node switch adaptes (typically, a FC HBA is faster the GbE... but does this change with IB?)Allows greater aggregate BW

e.g., 15 GB/s on 40 nodesSAN works well in a processing centerUse a LAN/WAN to scale out beyond SAN limits

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

client node

SAN Switch

diskcontroller

disk enclosure

disk enclosure

disk enclosure

disk enclosure

disk enclosure

Node Switch (LAN or WAN)

��®

In the early days of HPC clusters, there were limited choices for parallel/global file systems... and generally it was necessary to use the vendor's file system. Today there are other choices (at least 18 at my last count) that have been enabled by the development of Linux based clusters.

In order to more clearly understand how GPFS fits into this environment, the following pages discuss a coarse HPC storage architecture taxonomy covering range of file systems used on an HPC systems... this is a work in progress!

��®

The following pages examine an architectural taxonomy of storage I/O architectures commonly used in HPC systems. They support varying degrees of parallel I/O and do not represent mutually exclusive choices.

Conventional I/O

Asynchronous I/O

Network File Systems

Basic Parallel I/OSingle Component Architecture

Centralized Metadata Server with SAN Attached DiskDual Component Architecture

Recent Developments Triple Component Architecture

High Level Parallel I/O

HPC Storage Architecture TaxonomyHPC Storage Architecture Taxonomy

��®

Local file systemsBasic, "no frills, out of the box" file systemJournal, extent based semantics

journaling: to log information about operations performed on the file system meta-data as atomic transactions. In the event of a system failure, a file system is restored to a consistent state by replaying the log and applying log records for the appropriate transactions. extent: a sequence of contiguous blocks allocated to a file as a unit and is described by a triple consisting of <logical offset, length, physical>

If they are a native FS, they are integrated into the OS (e.g., caching done via VMM)

more favorable toward temporal than spatial locality

Intra-node process parallelismDisk level parallelism possible via stripingNot truly a parallel file systemExamples: Ext3, JFS2, XFS

Conventional I/OConventional I/O

COMMENT: GPFS cache (i.e., pagepool) is more favorable toward spatial than temporal locality. Very large pagepools (up to 8 GB using 64 bit OS) may do better with temporal locality.

��®

Asynchronous I/OAsynchronous I/O

Abstractions allowing multiple threads/tasks to safely and simultaneously access a common fileBuilt on top of a base file systemParallelism available if its supported in the base file systemPart of POSIX 4, but not supported on all unix based file systems (e.g., Linux 2.4, but Linux 2.6 now includes it?)AIX, Irix, Solaris supports AIO

��®

Networked File System (NFS)Networked File System (NFS)

Disk access from remote nodes via network access (e.g., TCP/IP over Ethernet)NFS is ubiquitous and the most common example

it is not truly parallel old versions are not cache coherent (is V3 or V4 truly safe?)write requires O_SYNC and -noac options to be safe

poorer performance for I/O intensive HPC jobswrite: only 90 MB/s on system capable of 400 MB/s (4 tasks)

read: only 381 MB/s on system capable of 740 MB/s (16 tasks)

uses POSIX I/O API, but not its semanticsuseful for on-line interactive access to smaller fileswhile NFS is not designed for general parallel file access on an HPC system, by placing restrictions on an application's storage I/O model, some customers get "good enough" performance from it

COMMENT: enhancements have been proposed for NFS V4 under AIX that should improve NFS parallel writes.

��®

Parallelizes file, metadata and control operationssingle component architecture: does not require distinction between metadata, storage and client nodes

POSIX I/O model with extensionsbyte stream using API with read(), write(), open(), close(), lseek(), stat(), etc.extends POSIX model to support safe parallel data access semantics

these options guarantee portability to other POSIX based file systems for applications using the POSIX I/O API

generally has API extensions, but these compromise portability

Good performance for large volume, I/O intensive jobsWorks best for large block, sequential access patterns, but vendors can add optimizations for other patterns Example: GPFS (IBM...best of class), GFS (Sistina/Redhat)

Basic Parallel I/OBasic Parallel I/OSingle Component ArchitectureSingle Component Architecture

��®

Centralized Metadata Servers with SAN Attached DiskCentralized Metadata Servers with SAN Attached DiskDual Component ArchitectureDual Component Architecture

Parallel user data semantics, but non-parallel metadata semantics

Support POSIX API, but with parallel data access semantics

Dual component architecture (storage client/server, metadata server)Metadata maintained and accessed from a single common server

Failover features allow a backup metadata server to takeover if the primary failsUses Ethernet (100 MbE or 1 GbE) for metadata access

Potential scaling bottleneck (but SANs already limit scaling). Latency more than BW is potential issue.

All "disks" connected to all client nodes via the SANfile data accessed via the SAN, not the node network

removes need for expensive node network (e.g., Myrinet)

inhibits scaling due to cost of FC Switch Tree (i.e., SAN)

Ideal for smaller numbers of nodesSNFS advertises up to 50 clients (and can go as high as 100 nodes), and is capable of very high BW

on a very carefully configured/tuned p690, GPFS, JFS2 and SNFS all got 15 GB/s

CXFS scales only to 10-12 servers for some users, perhaps at most 30?

good enough for large SMPs?

Example: CXFS (SGI), SNFS (ADIC), SanFS (IBM)SanFS and SNFS place heavy emphasis on storage virtualization

��®

Recent DevelopmentsRecent DevelopmentsTriple Component ArchitectureTriple Component Architecture

Lustre and Panasas, are 2 recently developed HPC style parallel file systems which began "from a clean sheet of paper" in their design that distinguishes them from other file systems in this taxonomy. They have a number of architectural similarities.

Triple component architecturestorage clients, storage servers, metadata serversfile data access over the node network between storage clients and servers (e.g., GbE, Myrinet)

Object oriented architectureobject oriented disks are not generally available yet, so the current implementation is in SW and not fully generalizedOO design is blind to the application (i.e., uses POSIX API with parallel semantics)

Designed to fascilitate storage management (i.e., "storage virtualization")

Focus on Linux/COTS environments

��®

Higher Level Parallel I/OHigher Level Parallel I/O

High level abstraction layer providing parallel modelBuilt on top of a base file system (conventional or parallel)MPI-I/O is the ubiquitous model

parallel disk I/O extension to MPI in the MPI-2 standardsemantically richer APIportable

Requires significant source code modification for use in legacy codes, but it has the adavantages of being a standard (e.g., syntactic portability)

��®

Which Architecture is Best?Which Architecture is Best?

There is no concise answer to this question. It is application/customer specific.All of them serve specific needs. All of them work well if properly deployed and used according to their design specs.Issues to consider areapplication requirements

often requires compromise between competing needshow the product implements a specific architecture

��®

What Others Say About GPFSWhat Others Say About GPFS

Two recent papers comparing/contrasting parallel file systems.

Margo, Kovatch, Andrews, Banister. "An analysis of State-of-the-Art Parallel File Systems for Linux.", 5th International Conf on Linux Clusters: HPC Revolution 2004, Austin, TX, May 2004

Compared GPFS, Lustre, PVFSCriteria: performance, system administration, redundancy, special features"In both SAN and NSD modes, GPFS performed the best. It was also easy to install and had numerous redundancy and special features."

Cope, Oberg, Tufo, Woitaszek. "Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments.", 6th International Conf on Linux Clusters: HPC Revolution 2005, Chapel Hill, NC, April 2005

Compared GPFS, Lustre, NFS, PVFS2, TerraFSCriteria: performance, usability, stability, special features"Our experiences with GPFS were very positive, and we found it to be a very powerful filesystem with well documented administrative tools."

��®

What is GPFS?

This question no longer has simple answer.

��®

node 16

node 14

node 12

node 10

node 8

node 6

node 4

node 2

node 15

node 13

node 11

node 9

node 7

node 5

node 3

node 1

node 32

node 30

node 28

node 26

node 24

node 22

node 20

node 18

node 31

node 29

node 27

node 25

node 23

node 21

node 19

node 17 node 33node 34node 35node 36

GPFS in The "Good Old Days"GPFS in The "Good Old Days"A Typical "Winterhawk" Config with GPFS 1.3A Typical "Winterhawk" Config with GPFS 1.3

SP Switch

SSA Disk

Compute clients Compute clients

hdisks 1..64

VSD Servers

All nodes run AIXThin nodes are compute clients, wide nodes are VSD serversGPFS packets transit the switchThe disk is SSAPeak Aggregate BW for this configuration

at most 440 MB/sTook experienced sysadm a day or 2 to configure

��®

... but GPFS can't do that!?!?!

Old ideas die hard. GPFS is far more versitile than what it was in its early days. The following pages highlight many of these newer features.

��®

GPFS TodayGPFS TodayGPFS under Linux with GbEGPFS under Linux with GbE

Client System

x345-1

x345-2

x345-3

x345-4

x345-5

x345-6

x345-7

x345-8

x345-9

x345-10

x345-11

x345-12

x345-13

x345-14

x345-15

x345-16

Ethernet Switch

x345-17

x345-18

x345-19

x345-20

x345-21

x345-22

x345-23

x345-24

x345-25

x345-26

x345-27

x345-28

1 GbE

1 GbE

Scaling Out: Actual GbE based designes have been extended upto 1100 Linux nodes (e.g., Intel or Opteron) with GPFS 2.2Current designed maximum for GPFS 2.3: 2000+ nodes

1 GbE

While not officially supported, 100 MbE can also be used among the client nodes instead of 1 GbE.

FAStT 900-2

EXP 700

EXP 700

EXP 700

EXP 700

EXP 700

EXP 700

File Server System

x345-29 (NSD)

x345-30 (NSD)

x345-31 (NSD)

x345-32 (NSD)

SANBrocade 2109-F16

Benchmark results on storage nodesI/O Benchmark (IBM)

command line./ibm_vg{w | r} /gpfs/xxx -nrec 4k -bsz 1m -pattern seq -ntasks 4

summarywrite rate: 491.4 MB/s*

read rate: 533.0 MB/s

iozonecommand line

./iozone.206 -c -R -b output.xls -C -r 32k -s 1024m -i 0 -i 1 -i 2 -i 5 -i 6

-i 7 -i 8 -W -t 4 -+m hostlist.cfs2a

summaryInitial write: 554.1 MB/s*Rewrite: 264.1 MB/s* Read: 526.7 MB/sRe-read: 533.6 MB/s Stride read: 31.3 MB/s Random read: 11.4 MB/s Random mix: 26.2 MB/s Random write: 54.0 MB/s

* write caching = on

Red Hat Linux 9.0 (Kernel 2.4.24-st2)GPFS 2.2

Benchmark results on "client nodes"I/O Benchmark (IBM), write caching offBW constrained by single GbE adapter on each NSD server1 client

write = 92.3 MB/s, read = 111 MB/s8 clients

write = 360 MB/s, read = 384 MB/s

Using Myrinet on 8 clients and 2 FAStT900s

write = 397 MB/sread = 585 MB/s

A second FAStT900 was needed since peak read BW exceeded the ability of 1 FAStT900.

��®

GPFS TodayGPFS TodayMixed AIX/Linux ConfigMixed AIX/Linux Config

p615CSM server

Existing User Network

SAN-1 (16 ports)

FAStT900-1

EXP700

EXP700

EXP700

EXP700

FAStT900-2

EXP700

EXP700

EXP700

EXP700

tape_1 tape_2

tape_3 tape_4

LTO Tape Library

256 x e325 nodesp690-1

RIO-1.1

RIO-1.2

p690-2

RIO-2.1

RIO-2.2

GbE Switch

Cluster Management Node -2

Cluster Management Node -1

p615-2TSM Client/Server

SAN-2 (16 ports)

EXP700

EXP700

EXP700

EXP700

EXP700 EXP700

Application and SchedulingNodes e325-3..6

p690s run AIX, e325s run Linuxp690s are NSD servers and compute clientse325s are compute clientsGPFS mounted on TSM/HSM server GPFS packets transit GbE networkThe disk is FCOverall Peak Aggregate BW < 800+ MB/sPeak Aggregate BW on e325 < 640 MB/s

��®

GPFS TodayGPFS TodayMixed AIX/Linux ConfigMixed AIX/Linux Config

Natural Aggregate

Harmonic Aggregate

Harmonic Mean

Number of Nodes

Number of tasks per node

Write Rate MB/s

Read Rate MB/s

Write Rate MB/s Read Rate MB/s

write read sizeof(file) GB

1 1 640.10 765.28 41 2 641.07 760.75 648.02 761.71 324.01 380.85 81 4 631.84 739.31 646.62 758.13 161.65 189.53 161 8 649.14 724.88 670.83 755.63 83.85 94.45 321 16 646.97 721.26 671.29 788.48 41.96 49.28 64

Natural Aggregate

Harmonic Aggregate

Harmonic Mean

Number of Nodes


Write Rate MB/s

Read Rate MB/s



1 1 109.27 110.89 42 1 198.92 218.65 198.92 219.51 99.46 109.76 84 1 269.91 435.37 269.95 437.68 67.49 109.42 168 1 282.44 624.19 283.44 626.81 35.43 78.35 32

16 1 253.23 595.50 281.78 598.18 17.61 37.39 6432 1 269.77 577.53 269.83 581.23 8.43 18.16 128

Natural Aggregate

Harmonic Aggregate

Harmonic aggregate over x335 and p690

Number of Nodes


Write Rate MB/s

Read Rate MB/s



x335 only *** 4 1 267.33 435.72 268.29 437.23 N/A N/A 16p690 only 1 4 632.45 771.00 650.98 788.25 N/A N/A 16x335 with p690 4 1 233.03 386.27 233.03 389.61 707.48* 801.66* 24**p690 with x335 1 4 470.22 407.79 477.30 412.81 32**

p690; write caching = off, pattern = sequential, bsz = 1 MB

x335; write caching = off, pattern = sequential, bsz = 1 MB

mixed nodes; write caching = off, pattern = sequential, bsz = 1 MB, ntasks = 4* Job times are nerely identical; therefore, iostat measured rate was very close to harmonic aggregate rate.** Size of combined files from each job; file sizes adjusted so job times were approximately equal. Combined files for write = 24 GB, combined files for the read = 32 GB.*** x335 aggregate read rates were gated by the 4 GbE at a little over 100 MB/s per adapter.

Benchmark Config

1 p69032 x335s

��®HS20

(GPFS NSD)

FC Ports

01 02 03 04 05 06 07 08 09 10 11 12 13 14

GbE Ports

EXP 710

EXP 710

DS4500

EXP 710

EXP 710

Disk Server Systems

x345 (GPFS SAN)

x345 (GPFS SAN)

x345 (GPFS SAN)

x345 (GPFS SAN)

BladeCenterChasis

SAN Switch (optional)Brocade 2109-F16

SYSTEM ANALYSIS

1. DS4500 - sustained peak performance < 540 MB/s2. FC Network - sustained peak performance < 600 MB/s3. GbE Network (adapter aggregate measured over all 4 x345s) - sustained peak performance < 360 MB/s4. Aggregate x345 Rate - sustained peak performance < 500 MB/s5. Predicted Aggregate HS20 Rate - sustained peak performance < 360 MB/s

Comments- HS20 performance constrained by limited GbE Ports- The rates for items 1-4 are based on benchmark tests- The SAN switch is optional; using it may reduce load on GbE network and reduce aggregate application disk I/O bandwidth

Lower Cost/Bandwidth AlternativeIf less file access bandwidth is required or a lower cost solution is required, then the x345/FAStT900 system can be replaced with the following:- 2 disk servers (x345) each with 1 GbE and 1 FC HBA- 1 FAStT600 and 1 disk enclosure (EXP700) - SAN switch is optional- sustained peak performance < 200 MB/s

Global File System over Multiple HS20 SystemsThis stand alone system can be replicated N times. By routing HS20 and x345 GbE traffic through a switch, the NSD layer in GPFS will enable all blades to see all LUNs; i.e., multiple HS20 systems can all safely mount the same GPFS file system and performance will scale linearly.

WARNING:Do not connect a FC controller to the FC ports on the blade chassis... its not supported.

EXP 710

GPFS TodayGPFS TodayGPFS in a BladeCenter - Standard ConfigurationGPFS in a BladeCenter - Standard Configuration

��®

GPFS TodayGPFS TodayGPFS in a BladeCenter - Alternative ConfigurationGPFS in a BladeCenter - Alternative Configuration

Blade Center- Internal GbE network highlighted in blue

blade

01

blade

02

blade

03

blade

04

blade

05

blade

06

blade

07

blade

08

blade

09

blade

10

blade

11

blade

12

blade

13

blade

14

ExternalGbE Ports

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

I D E

BENCHMARK ANALYSIS

- A private internal GbE network connects all blades- Each blade has a single GbE adapter

effective BW = 80 to 100 MB/s

- Baseline performance (read from a single local disk using ext2)application read rate = 30 MB/s

- Single task GPFS performanceapplication read rate = 80 MB/s2.7 X faster than single disk rate

- Further analysisassume active task is on blade 01GPFS stripes over all 28 IDE drives in this configurationGPFS uses GbE network for striping activity, therefore single blade GPFS performance limited to GbE adapter BW (i.e., upto 100 MB/s)each blade

has only one GbE adapter used by GPFS and general systemacts as a disk server and GPFS clientsingle task performance = 80 MB/s

in a similar test on x345 wherethere were seperate GPFS clients and disk serversthere were 2 GbE adapters per node (one for GPFS and one for everything else)single task performance = 100 MB/s

- Aggregate rate (1 task per blade)read rate = 560 MB/s

- Analysiseach blade is acting as a disk server (as well as client)since GPFS scales linearly in the number disk servers, it yields good aggregate performancein a Winterhawk 2 system using SSA disk with each node acting as a GPFS client and disk server, the aggregate rate over 14 nodes was < 420 MB/s

GbE Switch

NOTE:This is not recomended configuration by GPFS since the disks are not twin tailed... "but it works".

��®

Frame-1540 NodesLinux/Intelintra-node: GbE?3 FC HBAs/node

1 connection via each FC switch

3 Switches2 Gb FC

All disks directly attached to servers via FC switches.

switch 01

switch 02

switch 03

Frame-014 x FAStT6008 x EXP700

FAStT600-01

EXP700

EXP700

FAStT600-02

EXP700

EXP700

FAStT600-03

EXP700

EXP700

FAStT600-04

EXP700

EXP700

. ..

to Frame-018 connections

.

.

.

to Frame-158 connections

Aggregate BW = 15 GB/s (sustained)goto: http://www.sdsc.edu/Press/2004/11/111204_SC04.html

GPFS TodayGPFS TodaySDSC/IBM StorCloud ArchitectureSDSC/IBM StorCloud Architecture

��®

40 Linux Nodes

3 FC HBAs per Node

15 Storage frames

60 FAStT600s

2520 disks

240 LUNs

8+P

4 LUNs per FAStT600

73 GB/disk @ 15 Krpm

Sustained Aggregate Rate15 GB/s380 MB/s per node256 MB/s per FAStT600

GPFS TodayGPFS TodaySelected SDSC/IBM StorCloud StatisticsSelected SDSC/IBM StorCloud Statistics

COMMENT: IP vs FCWith today's technology, direct attached disk models (e.g., SAN attached) can yielder greater per node BW than IP based models.

IP based systems are limited by ethernet adapter (e.g., 80 MB/s for 1 GbE, 120 MB/s for dual bonded GbE)direct attached systems can have multiple FC HBAs (e.g., with 3 HBA/node the BW is 380 MB/s)Will 10 GbE change this?Will IB change this?

��®

GPFS TodayGPFS TodayStorage PoolsStorage Pools

MotivationSome newer file systems implement a concept called "storage pools"; GPFS supports a form of this.

Disks present themselves to GPFS as LUNs

A GPFS file system can mount a FS on any subset of these LUNsThere is 1 storage pool per FSMax: 32 file systems per GPFS cluster

ExampleMonolythic disk architecture (e.g., SATA)Access is "bursty"To avoid striping over all disks and stressing all disks, divide disks into 16 disjoint subsets and hence 16 file systems. File striping is confined to a file system. When a file system not in use, GPFS is not spinning disks. (n.b., All FS's are seen by all nodes in the GPFS cluster... upto 2000+ nodes)

Example2 classes of disk: FC disk and SATA disk1 FS for FC disk used for constant access1 FS for SATA disk used infrequently

��®

GbE Switch

p575-4p575-3p575-2p575-1

blade center #1

San Switch

p575-8p575-7p575-6p575-5

p575-12p575-11p575-10p575-9

blade center #2 blade center #3 blade center #4

FAStT900 FAStT100

Federation Switch

GPFS TodayGPFS TodayStorage Pools in a Mixed BladeCenter/pSeries ClusterStorage Pools in a Mixed BladeCenter/pSeries Cluster

��®

Problem: nodes outside the cluster need access to GPFS files

Solution: allow nodes outside the cluster to mount the file system“Owning” cluster responsible for

admin, managing locking, recovery, …

Separately administered remote nodes have limited status

Can request locks and other metadata operations

Can do I/O to file system disks over global SAN (IP, Fibre Channel, …)

Are trusted to enforce access control, map user Ids, …

Uses:High-speed data ingestion,

postprocessing (e.g. visualization)

Sharing data among clustersSeparate data and compute sites

(Grid)Forming multiple clusters into a

“supercluster” for grand challenge problems

Cluster 1 File System

Cluster 1 Nodes

Site 1 SAN

Global SAN Interconnect

Cluster 2 Nodes

Site 2 SAN

Site 3 SAN

Visualization System

Remote disk access

Remote disk access

Local disk access

GPFS TodayGPFS TodayCross-cluster MountsCross-cluster Mounts

��®

GPFS TodayGPFS TodayCross-cluster Mounts -- ExampleCross-cluster Mounts -- Example

IP based Switch

Compute Node nsd_A4nsd_A3nsd_A2nsd_A1Compute NodeCompute NodeCompute NodeCompute Node

Compute NodeCompute NodeCompute NodeCompute NodeCompute Node



San Switch

Cluster_A/fsAHome Cluster

IP based Switch

NSD_B2NSD_B1Compute NodeCompute NodeCompute NodeCompute NodeCompute NodeCompute NodeCompute NodeCompute NodeCompute NodeCompute Node

San Switch

Cluster_B/fsAonBRemote Cluster

Inter-Switch Links (at least GbE speed!)

COMMENTS: Cluster_B accesses /fsA from Cluster_A via the NSD nodes

see example on next page

Cluster_B mounts /fsA locally as /fsAonBOpenSSL (secure socket layer) provides secure access between clusters

UID MAPPING EXAMPLE (i.e., Credential Mapping)

1. pass Cluster_B UID/GID(s) from I/O thread node to mmuid2name

2. map UID to GUN(s) (Globally Unique Name)3. send GUN(s) to mmname2uid on node in Cluster_A4. generate corresponding CLUSTER_A UID/GID(s)5. send Cluster_A UID/GIDs back to Cluster_B node

runing I/O thread (for duration of I/O request)

COMMENTS: mmuid2name and mmname2uid are user written scripts made available to all users in /var/mmfs/etc; these scripts are called ID remapping helper functions (IRHF) and implement access policiessimple strategies (e.g, text based file with UID <-> GUN mappings) or 3rd party packages (e.g., Globus Security Infrastruction from Teragrid) can be used to implement the remapping procedures

UID/GIDB

mmuid2name

GUN

mmname2uid

UID/GIDA

See http://www-1.ibm.com/servers/eserver/clusters/whitepapers/uid_gpfs.html for details.

��®

GPFS TodayGPFS TodayCross-cluster Mounts -- Sysadm CommandsCross-cluster Mounts -- Sysadm Commands

On Cluster_A

1. Generate public/private key pairmmauth genkeyCOMMENTS

creates public key file with default name id_rsa.pubstart GPFS daemons after this command!

2. Enable authorizationmmchconfig cipherList=AUTHONLY

3. Sysadm gives following file to Cluster_B/var/mmfs/ssl/id_rsa.pub

COMMET: rename as cluster_A.pub7. Authorize Cluster_B to mount FS owned by Cluster_A

mmauth add cluster_B -k cluster_B.pub

On Cluster_B

4. Generate public/private key pairmmauth genkeyCOMMENTS

creates public key file with default name id_rsa.pubstart GPFS daemons after this command!

5. Enable authorizationmmchconfig cipherList=AUTHONLY

6. Sysadm gives following file to Cluster_A/var/mmfs/ssl/id_rsa.pub

COMMENT: rename as cluster_B.pub8. Define cluster name, contact nodes and public key for cluster_A

mmremotecluster add cluster_A -n

nsd_A1,nsd_A2,nsd_A3,nsd_A4 -k Cluster_A.pub

9. Identify the FS to be accessed on cluster_Ammremotefs add /dev/fsAonB -f /dev/fsA -CCluster_A -T /dev/fsAonB

10. mount FS locallymount /fsAonB

Mount a GPFS file system from Cluster_A onto Cluster_B assume diagram from the previous page

See http://publib.boulder.ibm.com/clresctr/docs/gpfs/gpfs23/200412/bl1adm10/bl1adm1031.html#admmcch for details.

��®

GPFS TodayGPFS TodayGPFS is Easier to AdministerGPFS is Easier to Administer

samnsd clientIntellistationM Pro2 CPU4 GBLinux 2.4.20GPFS 2.2

frodonsd serverp6152 CPU4 GBAIX 5.2GPFS 2.2

gandalfp6152 CPU4 GBAIX 5.2GPFS 2.2



disk0 (scsi)

SSA16 disks9 GB10 Krpm

EXP30014 disksSCSI36 GB15 Krpm

pdisks 0..6

pdisks 7..13

1 VG over hdisk 4..108 LVs over 1 VG

1 VG over hdisk 4..108 LVs over 1 VG

pdisks 0..15NSDs

hdisk 11..26NSDs

hdisk 11..26

Ethernet (GbE)

NSD 11..26

To build the file system*, do the following on gandalf...1. mmcrcluster2. mmstartup3. mmcrnsd

specify primary, secondary, client, server nodes in disk descriptor file

4. mmcrfs5. mount /<FS name>

* GPFS 2.3

COMMENT: This could be a FAStTdisk controller.

COMMENTS: Once the SAN zoning and low level disk formats are complete, one can build GPFS in under 5 minutes on smaller systems.For StorCloud, it took ~= 30 minutes, but this was over a 135 TB file system (n.b., 240 LUNs or 2520 disks)Other dymanic features...

mmadddisk, mmaddnode, mmdeldisk, mmdelnodemmchattr, mmchfs, mmchcluster, mmchconfig, mmchnsdmmpmon

��®

So what is GPFS?... in one line or less

��®

So what is GPFS?It is IBM’s shared disk, parallel file system for AIX and Linux clusters.

��®

What is GPFS?What is GPFS?

GPFS = General Parallel File SystemIBM’s shared disk, parallel file system for AIX, Linux clusters

Cluster: 2300+ nodes (tested), fast reliable communication, common admin domain

Shared disk: all data and metadata on disk accessible from any node through disk I/O interface (i.e., "any to any" connectivity)

Parallel: data and metadata flows from all of the nodes to all of the disks in parallel

RAS: reliability, accessibility, serviceability

General: supports wide range of HPC application needs over a wide range of configurations

��®

GPFS FeaturesGPFS Features

1. General Parallel File Systemmature IBM product generally available for 7 years

2. Clustered, shared disk, parallel file system for AIX and Linux3. Adaptable to many customer environmnets by supporting a wide

range of basic configurations and disk technologies4. Provides safe, high BW access using the POSIX I/O API5. Provides non-POSIX advanced features

e.g., DMAPI, data-shipping, multiple access hints (also used by MPI-IO)

6. Provides good performance for large volume, I/O intensive jobs7. Works best for large record, sequential access patterns, has

optimizations for other patterns (e.g., strided, backward)8. Strong RAS features (reliability, accessibility, serviceability)9. Converting to GPFS does not require application code changes

provided the code works in a POSIX compatible environment

��®


GPFS Performance Features1. striping2. large blocks (with support for sub-blocks)3. byte range locking (rather than file or extent locking)4. access pattern optimizations5. file caching (i.e., pagepool) that optimizes streaming access6. prefetch, write-behind7. multi-threading 8. distributed management functions (e.g., metadata, tokens)9. multi-pathing (i.e., multiple, independent paths to the same

file data from anywhere in the cluster)

��®


GPFS provides many of its own RAS features and exploits RAS features provided by various subsystems1. If a node fails providing GPFS management functions, an alternative

node assumes responsibility reducing risk of loosing the file system.2. When using dedicated NSD servers with "twin tailed disks",

specifying primary and secondary nodes lets the secondary node provide access to the disk if the primary node fails.

WARNING: internal SCSI and IDE drives are not twin tailed

3. In SAN environment, failover reduces risk of lost access to data.4. GPFS on RAID architectures reduces risk of lost data.5. Online and dynamic system management allows file system

modifications without bringing down the file system.mmadddisk, mmdeldisk, mmaddnode, mmdelnode, mmchconfig, mmchfs

��®

GPFS FeaturesGPFS FeaturesOther Features1. Disk scaling allowing large, single instantiation global file

systems (100's of TB now, PB in future)2. Node scaling (2300+ nodes) allowing large clusters and

high BW (many GB/s)3. Multi-cluster architecture (i.e., grid)4. Journaling (logging) File System - logs information about

operations performed on the file system meta-data as atomic transactions that can be replayed

5. Data Management API (DMAPI) - Industry-standard interface allows third-party applications (e.g. TSM) to implement hierarchical storage management

��®

Why is GPFS needed?Why is GPFS needed?

Clustered applications impose new requirements on the file system

"Any to any" accessany node in the cluster has access to any data in the cluster

Parallel applications need access to the same data from multiple nodes

Serial applications dynamically assigned to processors based on loadneed high-performance access to their data from wherever they run

Require both good availability of data and normal file system semantics

Scalability to large numbers of nodes

GPFS supports this via:

Uniform access – single-system image across cluster

Conventional Posix API – no program modification

High capacity – large files, 100TB + file system

High throughput – wide striping, large blocks, many GB/sec to one file

Parallel data and metadata access – shared disk and distributed locking

Reliability and fault-tolerance - node and disk failures

Online system management – dynamic configuration and monitoring

��®

Customer applications that require fast, scalable access to large amounts of file data. These applications may be serial or parallel, reading or writing.

Applications that serve data to visualization enginesSeismic data acquisition processing for serial or parallel reading/writing of files

Environments with very large data, especially when single file servers (such as NFS) reach capacity limits

Digital library file servingAccess to large CAD/CAM file setsData mining applications Data cleansing applications preparing data for data warehousesOracle RAC

Applications requiring data rates which exceed what can be delivered by other file systems

Large aggregate scratch space for commercial or scientific applicationsInternet serving of content to users with balanced performance

Applications with high availability (HA) file system requirements

Example GPFS ApplicationsExample GPFS Applications

��®

Selected New Features in GPFS 2.3Selected New Features in GPFS 2.3

Scale out to larger clusters (over 2000)Current largest production cluster is 2300 blades

Bigger file systems100's of TB

Bigger LUNsThere is no GPFS limitation; it is now an OS and disk limitation only

over 2 TB on AIX in 64 bit mode, up to 2 TB in "other supported" OS's

Depends less on disk protocols for many features (e.g., SCSI persistent reserve)

therefore GPFS is portable to a wider variety of disk hardware

No longer requires RSCT

Only one cluster type (n.b., no need to specify sp, rpd or hacmp)

Simpler quorum definition rules

GPFS specific performance monitorMeasures latency and bandwidth

GPFS is easier to administer and use

GPFS in Today's HPC Processing Center This is not your ... · GPFS in Today's HPC Processing CenterGPFS in Today's HPC Processing Center "This is not your father's GPFS""This is not

Documents