Scalable Performance of the Panasas Parallel File System

www.panasas.com

Go Faster. Go Parallel.

Scalable Performance of the Panasas Parallel File System

Brent WelchDirector of Software ArchitecturePanasas, Inc.

NSC 08

www.panasas.com


Scalable Performance of the Panasas Parallel File System

Brent Welch, Marc Unangst, Zainul Abbasi, Garth Gibson*, Brian Mueller, Jason Small, Jim Zelenka, Bin ZhouPanasas Inc* Carnegie Mellon and Panasas Inc

USENIX FAST 08 Conference

Slide 3 | FAST08 Panasas, Inc.

Outline

Panasas Background, Hardware and Software

Per-File, Client Driven RAID

Declustering and Scalable Rebuild

Metadata management and performance


Founded 1999 By Prof. Garth Gibson, Co-Inventor of RAIDTechnology Parallel File System and Parallel Storage Appliance Locations US: HQ in Fremont, CA, USA

R&D centers in Pittsburgh & Minneapolis

EMEA: UK, DE, FR, IT, ES, BE, Russia

APAC: China, Japan, Korea, India, AustraliaCustomers FCS October 2003, deployed at 200+ customersMarket

Focus

Alliances

Energy Academia Government Life Sciences Manufacturing Finance ISVs: Resellers:

Primary Investors

Panasas Company Overview


Accelerating Enterprise Parallel Storage Adoption

http://images.google.com/imgres?imgurl=http://www.ims.demokritos.gr/IPPW-3/boeing.jpg&imgrefurl=http://www.ims.demokritos.gr/IPPW-3/Sponsors.htm&h=186&w=600&sz=9&hl=en&start=3&tbnid=IXROHHfbfhRlpM:&tbnh=42&tbnw=135&prev=/images?q=boeing+logo&gbv=2&svnum=10&hl=en&sa=G

http://images.google.com/imgres?imgurl=http://www.alumnice.com/images/logo_BNP_Paribas.jpg&imgrefurl=http://www.alumnice.com/association.html&h=540&w=1024&sz=57&hl=en&start=1&um=1&tbnid=fmVUsJzmFhahJM:&tbnh=79&tbnw=150&prev=/images?q=bnp+paribas+logo&svnum=10&um=1&hl=en&sa=G

http://images.google.com/imgres?imgurl=http://www.neurology.ukevents.org/images/Boehringer%20Logo%20(pantone%20288).jpg&imgrefurl=http://www.neurology.ukevents.org/&h=307&w=945&sz=54&hl=en&start=1&tbnid=J9ofk58M5htAXM:&tbnh=48&tbnw=148&prev=/images?q=boehringer+logo&gbv=2&hl=en&sa=G

http://images.google.com/imgres?imgurl=http://www.airfinancejournal.com/docs/GE-logo(Blue-PMS7455).jpg&imgrefurl=http://www.airfinancejournal.com/default.asp?page=1&hid=184&h=473&w=474&sz=76&hl=en&start=2&um=1&tbnid=stZAwipW7cw0jM:&tbnh=129&tbnw=129&prev=/images?q=ge+logo&um=1&hl=en

http://www.bp.com/home.do?categoryId=1&contentId=2006973

http://images.google.com/imgres?imgurl=http://upload.wikimedia.org/wikipedia/en/thumb/2/24/Samsung_Logo.svg/800px-Samsung_Logo.svg.png&imgrefurl=http://articulo.mercadolibre.com.ar/MLA-32658801-monitor-lcd-17-samsung-732nw-wide-garantia-factura--_JM&h=268&w=800&sz=21&hl=en&start=3&tbnid=XI3_OErRU6u7zM:&tbnh=48&tbnw=143&prev=/images?q=samsung+logo&gbv=2&hl=en&sa=G


Panasas Architecture

Cluster technology provides scalable capacity and performance: capacity scales symmetrically with processor, caching, and network bandwidth

Scalable performance with commodity parts provides excellent price/performance

Object-based storage provides additional scalability and security advantages over block-based SAN file systems

Automatic management of storage resources to balance load across the cluster

Shared file system (POSIX) with the advantages of NAS, with direct-to-storage performance advantages of DAS and SAN

DiskCPUMemoryNetwork


Panasas Blade Hardware

DirectorBlade StorageBlade

Integrated GE Switch

Shelf Front1 DB, 10 SB Shelf Rear

Midplane routes GE, power

Battery Module(2 Power units)


Panasas Product Advantages

Proven implementation with appliance-like ease of use/deployment

Running mission-critical workloads at global F500 companies

Scalable performance with Object-based RAID

No degradation as the storage system scales in size

Unmatched RAID rebuild rates – parallel reconstruction

Unique data integrity features

Vertical parity on drives to mitigate media errors and silent corruptions

Per-file RAID provides scalable rebuild and per-file fault isolation

Network verified parity for end-to-end data verification at the client

Scalable system size with integrated cluster management

Storage clusters scaling to 1000+ storage nodes, 100+ metadata managers

Simultaneous access from over 12000 servers


Internal cluster management makes a large collection of blades work as a single system

iSCSI/OSD

OSDFS

StorageNodes

SysMgr

PanFS

NFS/CIFS

Client

Manager Nodes

Client

Compute Nodes

RPC

Out of Band architecture with direct, parallel paths from clients to storage nodes

Up to 12,000

1,000+

100+


Proven Panasas Scalability

Storage Cluster Sizes Today (e.g.)

Boeing, 50 DirectorBlades, 500 StorageBlades in one system. (plus 25 DirectorBlades and 250 StorageBlades each in two other smaller systems.)

LANL RoadRunner. 100 DirectorBlades, 1000 StorageBlades in one system today, planning to increase to 144 shelves next year.

Intel has 5,000 active DF clients against 10-shelf systems, with even more clients mounting DirectorBlades via NFS. They have qualified a 12,000 client version of 2.3, and will deploy “lots” of compute nodes against 3.2 later this year.

BP uses 200 StorageBlade storage pools as their building block

LLNL, two realms, each 60 DirectorBlades (NFS) and 160 StorageBlades

Most customers run systems in the 100 to 200 blade size range


Linear Performance Scaling

Breakthrough data throughput AND random I/O

Performance and scalability for all workloads


Scaling the system

Scale the system and clients at the same time (N-to-N IOzone)

IOZone Read and Write Sequential I/O Performance (4MB Block Size)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

10/8 40/32 80/64 120/96

Number of StorageBlades/Clients

Ag

gre

gat

e B

and

wid

th (

MB

/sec

)

Write

Read

Thunderbird


Scaling Clients

IOzone Multi-Shelf Performance Test (4GB sequential write/read)

0

500

1,000

1,500

2,000

2,500

3,000

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 155 200

# clients

MB

/sec

1-shelf write

1-shelf read

2-shelf write

2-shelf read

4-shelf write

4-shelf read

8-shelf write

8-shelf read


IOR Segmented IO

IOR -a POSIX -C -i 3 -t 4M -b $num_clients

0

100

200

300

400

500

600

700

800

900

1000

0 4 8 12 16 20 24 28 32 36

# Clients

MB

/sec

Shared File Read

Shared File Write

Separate File Read

Separate File Write


Panasas Parallel Storage Outperforms Clustered NFS

0

50

100

150

200

250

300

350

400

450

NFS DirectFLOW DirectFLOW

Tim

e (m

in)

7 hours 17 mins

Av. ReadBW

300MB/s

5 hours 16 mins

Av. ReadBW

350MB/s2 hours 51 mins

Av. ReadBW

650MB/s

2.5X faster(less time)

4 Shelves 4 Shelves1 Shelf

Source: Paradigm & Panasas, February 2007

Paradigm GeoDepth Prestack Migration


Number of Cores

2541

1318

779

4568

2680

1790

0

1000

2000

3000

4000

5000

64 128 256

PanFS -- FLUENT 12

NFS -- FLUENT 6.3

Truck Aero111M Cells

Scalability of Solver + Data File WriteTi

me

(Sec

onds

) of S

olve

r + D

ata

File

Writ

e

Time of Solver + Data File Write

Lower is

better

1.7x

1.5x

1.9x1.7x

FLUENT Comparison of PanFS vs. NFS on University of Cambridge Cluster

NOTE: Read times are not included in

these results


Number of cells

111,091,452

Solver PBNS, DES, Unsteady

Iterations5 time steps, 100 total

iters - data save after last iteration

Output size:FLUENT v6.3 (serial I/O; size of .dat file)

FLUENT v12 (serial I/O; size of .dat file)

FLUENT v12 (parallel I/O; size of .pdat file)

14,808 MB

16,145 MB

19, 683 MB

Unsteady external aero for 111 MM cell truck; 5 time steps with 100 iterations, and a single .dat file write

Univ of Cambridge DARWIN ClusterLocation: University of Cambridge http://www.hpc.cam.ac.uk

Vendor: Dell ; 585 nodes; 2340 cores; 8 GB per node; 4.6 TB total memCPU: Intel Woodcrest DC, 3.0 GHz / 4MB L2 cache

Interconnect: InfiniPath QLE7140 SDR HCAs; Silverstorm 9080 and 9240 switches,File System: Panasas PanFS, 4 shelves, 20 TB capacity

Operating System: Scientific Linux CERN SLC release 4.6

DARWIN 585 nodes; 2340 cores

Panasas: 4 Shelves, 20 TB

Truck111M Cells

Details of the FLUENT 111M Cell Model

mailto:[email protected]

mailto:[email protected]


Automatic per-file RAID

System assigns RAID level based on file size

<= 64 KB RAID 1 for efficient space allocation

> 64 KB RAID 5 for optimum system performance

> 1 GB two-level RAID-5 for scalable performance

RAID-1 and RAID-10 for optimized small writes

Automatic transition from RAID 1 to 5 without re-striping

Programmatic control for application-specific layout optimizations

Create with layout hint

Inherit layout from parent directory

Small File

RAID 1 Mirroring

RAID 5 Striping

Large File

Very Large File

2-level RAID 5 Striping

Clients are responsible for writing data and its parity


H G

k E

Declustered RAID

Files are striped across component objects on different StorageBlades

Component objects include file data and file parity for reconstruction

File attributes are replicated with two component objects

Declustered, randomized placement distributes RAID workload

C F

E

2-shelfBladeSet

Mirroredor 9-OSDParityStripes

FAIL

Read abouthalf of eachsurviving OSDWrite a littleto each OSD

Scales linearly


Scalable RAID Rebuild

Shorter repair time in larger storage pools

Customers report 30 minute rebuilds for 800GB in 40+ shelf blade set

Variability at 12 shelves due to uneven utilization of DirectorBlade modules

Larger numbers of smaller files was better

Reduced rebuild at 8 and 10 shelves because of wider parity stripe

Rebuild bandwidth is the rate at which data is regenerated (writes)

Overall system throughput is N times higher because of the necessary reads

Use multiple “RAID engines” (DirectorBlades) to rebuild files in parallel

Declustering spreads disk I/O over more disk arms (StorageBlades)

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14

# Shelves

One Volume, 1G Files

One Volume, 100MB Files

N Volumes, 1GB Files

N Volumes, 100MB Files

Rebuild MB/sec


RAID Rebuild vs Stripe Width

Panasas system automatically selects stripe width up to 11 wide

8 to 11 wide is best for bandwidth performance

System packs an even number of stripes into Bladeset, leaving at least one spare

Narrower stripes rebuild faster

Less data to read to reconstruct writes

More DirectorBlades helps

1, 2, or 3 per shelf

50+ in a single system

Rebuild MB/sec

0

20

40

60

80

100

120

140

160

1+1 2+1 3+1 4+1 5+1 6+1 7+1RAID-5 Stripe Configuration

4 mgr +18 storage3 mgr + 8 storage


Scalable rebuild is mandatory

Having more drives increases risk, just like having more light bulbs increases the odds one will be burnt out at any given time

Larger storage pools must mitigate their risk by decreasing repair times

The math saysif (e.g.) 100 drives are in 10 RAID sets of 10 drives each and

each RAID set has a rebuild time of N hours

The risk is the same if you have a single RAID set of 100 drives

and the rebuild time is N/10

Block-based RAID scales the wrong direction for this to work Bigger RAID sets repair more slowly because more data must be read

Only declustering provides scalable rebuild rates

Total number of drives Drives per RAID set Repair time


Client

MetadataServer

oplog

caplog

21. Create

3. Create

4

Replycache

5

6. Reply

7. Write

8

OSDs

Txn_log

Creating a File in 2 milliseconds

NVRAM,replicatedto backupin 90 usecover 1GE

Directories are objects with lists of name to object ID and location hint mappings


Metarate operations/sec

0

500

1000

1500

2000

2500

3000

3500

4000

DB-100 DF

DB-100aDF

DB-100NFS

DB-100aNFS

Linux NFS

CreateUtime



0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15# Clients

UtimeCreate



0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

# Clients

Reported rate is speed of slowest client times number of clients, which is an MPI metric

(Cache hits in the client that created all the files)


MPI Coordinated File Operations

Panasas mdtest Performance

01000

2000300040005000

600070008000

900010000

CreateFile

Stat File RemoveFile

CreateDir

Stat Dir RemoveDir

Op

era

tio

ns

/Se

co

nd

Unique Directory

Single Process

Shared Directory

mpirun -n 64 mdtest -d $dir -n 100 -i 3 -N 1 -v -u


Summary

Per-file, object-based RAID gives scalable on-line performance

Offloads the metadata server

Parallel block allocation among the storage nodes

Declustered parity group placement yields linear increase in rebuild rates with the size of the storage pool

May become the only way to effectively handle large capacity drives

Metadata is stored as attributes on objects

File create is complex, but made fast with efficient journal implementation

Coarse-grained metadata workload distribution is a simple way to scale


Technology Review

Turn-key deployment and automatic resource configuration

Scalable Object RAID

Very fast RAID rebuild

Vertical Parity to trap silent corruptions

Network parity for end-to-end data verification

Distributed system platform with quorum-based fault tolerance

Coarse grain metadata clustering

Metadata fail over

Automatic capacity load leveling

Storage Clusters scaling to ~1000 nodes today

Compute clusters scaling to 12,000 nodes today

Blade-based hardware with 1Gb/sec building block

Bigger building block going forward


The pNFS Standard

The pNFS standard defines the NFSv4.1 protocol extensions between the server and client

The I/O protocol between the client and storage is specified elsewhere, for example:

SCSI Block Commands (SBC) over Fibre Channel (FC)

SCSI Object-based Storage Device (OSD) over iSCSI

Network File System (NFS)

The control protocol between the server and storage devices is also specified elsewhere, for example:

SCSI Object-based Storage Device (OSD) over iSCSI

ClientStorage

NFS 4.1 Server


Key pNFS Participants

Panasas (Objects)

Network Appliance (Files over NFSv4)

IBM (Files, based on GPFS)

EMC (Blocks, HighRoad MPFSi)

Sun (Files over NFSv4)

U of Michigan/CITI (Files over PVFS2)


pNFS Status

pNFS is part of the IETF NFSv4 minor version 1 standard draft

Working group is passing draft up to IETF area directors, expect RFC later in ’08

Prototype interoperability continuesSan Jose Connect-a-thon March ’06, February ’07, May ‘08

Ann Arbor NFS Bake-a-thon September ’06, October ’07

Dallas pNFS inter-op, June ’07, Austin February ’08, (Sept ’08)

AvailabilityTBD – gated behind NFSv4 adoption and working implementations of pNFS

Patch sets to be submitted to Linux NFS maintainer starting “soon”

Vendor announcements in 2008

Early adoptors in 2009

Production ready in 2010

www.panasas.com


Questions?

Thank you for your time!


Panasas Global Storage Model

Client Node

Panasas System A Panasas System B

BladeSet 1

VolX

VolY

VolZ

/panfs/sysa/delta/file2

BladeSet 2

delta

home

BladeSet 3

VolM

VolN

VolL

Client Node Client Node Client Node

/panfs/sysb/volm/proj38

TCP/IP network

PhysicalStorage

Pool

LogicalQuotaTree

DNS Name


IB and other network fabrics

Panasas is a TCP/IP, GE-based storage product

Universal deployment, Universal routability

Commodity price curve

Panasas customers use IB, Myrinet, Quadrics, …

Cluster interconnect du jour for performance, not necessarily cost

IO routers connect cluster fabric to GE backbone

Analogous to an “IO node”, but just does TCP/IP routing (no storage)

Robust connectivity through IP multipath routing

Scalable throughput at approx 650 MB/sec IO router (PCI-e class)

Working on a 1GB/sec IO router

IB-GE switching platforms

Cisco/Voltare switch provides wire-speed bridging


To Site Network

Multi-Cluster sharing: scalable BW with fail over

NFS

DNS1

KRB

Cluster C

Compute Nodes

Archive

Layer 2 switches

Panasas

Storage

Cluster A

I/O Nodes

Cluster B

Colors depict

subnets


New and Unique: Network Parity

Horizontal Parity

Vertical Parity

Network Parity

Extends parity capability across the data path to the client or server node

Enables End-to-End data integrity validation

Protects from errors introduced by disks, firmware, server hardware, server software, network components and transmission

Client either receives valid data or an error notification

New

Scalable Performance of the Panasas Parallel File System

Documents