Top Banner
www.panasas.com Go Faster. Go Parallel. Scalable Performance of the Panasas Parallel File System Brent Welch Director of Software Architecture Panasas, Inc. NSC 08
37

Scalable Performance of the Panasas Parallel File System

Sep 12, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalable Performance of the Panasas Parallel File System

www.panasas.com

Go Faster. Go Parallel.

Scalable Performance of the Panasas Parallel File System

Brent WelchDirector of Software ArchitecturePanasas, Inc.

NSC 08

Page 2: Scalable Performance of the Panasas Parallel File System

www.panasas.com

Go Faster. Go Parallel.

Scalable Performance of the Panasas Parallel File System

Brent Welch, Marc Unangst, Zainul Abbasi, Garth Gibson*, Brian Mueller, Jason Small, Jim Zelenka, Bin ZhouPanasas Inc* Carnegie Mellon and Panasas Inc

USENIX FAST 08 Conference

Page 3: Scalable Performance of the Panasas Parallel File System

Slide 3 | FAST08 Panasas, Inc.

Outline

Panasas Background, Hardware and Software

Per-File, Client Driven RAID

Declustering and Scalable Rebuild

Metadata management and performance

Page 4: Scalable Performance of the Panasas Parallel File System

Slide 4 | FAST08 Panasas, Inc.

Founded 1999 By Prof. Garth Gibson, Co-Inventor of RAIDTechnology Parallel File System and Parallel Storage Appliance Locations US: HQ in Fremont, CA, USA

R&D centers in Pittsburgh & Minneapolis

EMEA: UK, DE, FR, IT, ES, BE, Russia

APAC: China, Japan, Korea, India, AustraliaCustomers FCS October 2003, deployed at 200+ customersMarket

Focus

Alliances

Energy Academia Government Life Sciences Manufacturing Finance ISVs: Resellers:

Primary Investors

Panasas Company Overview

Page 5: Scalable Performance of the Panasas Parallel File System

Slide 5 | FAST08 Panasas, Inc.

Accelerating Enterprise Parallel Storage Adoption

Page 6: Scalable Performance of the Panasas Parallel File System

Slide 6 | FAST08 Panasas, Inc.

Panasas Architecture

Cluster technology provides scalable capacity and performance: capacity scales symmetrically with processor, caching, and network bandwidth

Scalable performance with commodity parts provides excellent price/performance

Object-based storage provides additional scalability and security advantages over block-based SAN file systems

Automatic management of storage resources to balance load across the cluster

Shared file system (POSIX) with the advantages of NAS, with direct-to-storage performance advantages of DAS and SAN

DiskCPUMemoryNetwork

Page 7: Scalable Performance of the Panasas Parallel File System

Slide 7 | FAST08 Panasas, Inc.

Panasas Blade Hardware

DirectorBlade StorageBlade

Integrated GE Switch

Shelf Front1 DB, 10 SB Shelf Rear

Midplane routes GE, power

Battery Module(2 Power units)

Page 8: Scalable Performance of the Panasas Parallel File System

Slide 8 | FAST08 Panasas, Inc.

Panasas Product Advantages

Proven implementation with appliance-like ease of use/deployment

Running mission-critical workloads at global F500 companies

Scalable performance with Object-based RAID

No degradation as the storage system scales in size

Unmatched RAID rebuild rates – parallel reconstruction

Unique data integrity features

Vertical parity on drives to mitigate media errors and silent corruptions

Per-file RAID provides scalable rebuild and per-file fault isolation

Network verified parity for end-to-end data verification at the client

Scalable system size with integrated cluster management

Storage clusters scaling to 1000+ storage nodes, 100+ metadata managers

Simultaneous access from over 12000 servers

Page 9: Scalable Performance of the Panasas Parallel File System

Slide 9 | FAST08 Panasas, Inc.

Internal cluster management makes a large collection of blades work as a single system

iSCSI/OSD

OSDFS

StorageNodes

SysMgr

PanFS

NFS/CIFS

Client

Manager Nodes

Client

Compute Nodes

RPC

Out of Band architecture with direct, parallel paths from clients to storage nodes

Up to 12,000

1,000+

100+

Page 10: Scalable Performance of the Panasas Parallel File System

Slide 10 | FAST08 Panasas, Inc.

Proven Panasas Scalability

Storage Cluster Sizes Today (e.g.)

Boeing, 50 DirectorBlades, 500 StorageBlades in one system. (plus 25 DirectorBlades and 250 StorageBlades each in two other smaller systems.)

LANL RoadRunner. 100 DirectorBlades, 1000 StorageBlades in one system today, planning to increase to 144 shelves next year.

Intel has 5,000 active DF clients against 10-shelf systems, with even more clients mounting DirectorBlades via NFS. They have qualified a 12,000 client version of 2.3, and will deploy “lots” of compute nodes against 3.2 later this year.

BP uses 200 StorageBlade storage pools as their building block

LLNL, two realms, each 60 DirectorBlades (NFS) and 160 StorageBlades

Most customers run systems in the 100 to 200 blade size range

Page 11: Scalable Performance of the Panasas Parallel File System

Slide 11 | FAST08 Panasas, Inc.

Linear Performance Scaling

Breakthrough data throughput AND random I/O

Performance and scalability for all workloads

Page 12: Scalable Performance of the Panasas Parallel File System

Slide 12 | FAST08 Panasas, Inc.

Scaling the system

Scale the system and clients at the same time (N-to-N IOzone)

IOZone Read and Write Sequential I/O Performance (4MB Block Size)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

10/8 40/32 80/64 120/96

Number of StorageBlades/Clients

Ag

gre

gat

e B

and

wid

th (

MB

/sec

)

Write

Read

Thunderbird

Page 13: Scalable Performance of the Panasas Parallel File System

Slide 13 | FAST08 Panasas, Inc.

Scaling Clients

IOzone Multi-Shelf Performance Test (4GB sequential write/read)

0

500

1,000

1,500

2,000

2,500

3,000

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 155 200

# clients

MB

/sec

1-shelf write

1-shelf read

2-shelf write

2-shelf read

4-shelf write

4-shelf read

8-shelf write

8-shelf read

Page 14: Scalable Performance of the Panasas Parallel File System

Slide 14 | FAST08 Panasas, Inc.

IOR Segmented IO

IOR -a POSIX -C -i 3 -t 4M -b $num_clients

0

100

200

300

400

500

600

700

800

900

1000

0 4 8 12 16 20 24 28 32 36

# Clients

MB

/sec

Shared File Read

Shared File Write

Separate File Read

Separate File Write

Page 15: Scalable Performance of the Panasas Parallel File System

Slide 15 | FAST08 Panasas, Inc.

Panasas Parallel Storage Outperforms Clustered NFS

0

50

100

150

200

250

300

350

400

450

NFS DirectFLOW DirectFLOW

Tim

e (m

in)

7 hours 17 mins

Av. ReadBW

300MB/s

5 hours 16 mins

Av. ReadBW

350MB/s2 hours 51 mins

Av. ReadBW

650MB/s

2.5X faster(less time)

4 Shelves 4 Shelves1 Shelf

Source: Paradigm & Panasas, February 2007

Paradigm GeoDepth Prestack Migration

Page 16: Scalable Performance of the Panasas Parallel File System

Slide 16 | FAST08 Panasas, Inc.

Number of Cores

2541

1318

779

4568

2680

1790

0

1000

2000

3000

4000

5000

64 128 256

PanFS -- FLUENT 12

NFS -- FLUENT 6.3

Truck Aero111M Cells

Scalability of Solver + Data File WriteTi

me

(Sec

onds

) of S

olve

r + D

ata

File

Writ

e

Time of Solver + Data File Write

Lower is

better

1.7x

1.5x

1.9x1.7x

FLUENT Comparison of PanFS vs. NFS on University of Cambridge Cluster

NOTE: Read times are not included in

these results

Page 17: Scalable Performance of the Panasas Parallel File System

Slide 17 | FAST08 Panasas, Inc.

Number of cells

111,091,452

Solver PBNS, DES, Unsteady

Iterations5 time steps, 100 total

iters - data save after last iteration

Output size:FLUENT v6.3 (serial I/O; size of .dat file)

FLUENT v12 (serial I/O; size of .dat file)

FLUENT v12 (parallel I/O; size of .pdat file)

14,808 MB

16,145 MB

19, 683 MB

Unsteady external aero for 111 MM cell truck; 5 time steps with 100 iterations, and a single .dat file write

Univ of Cambridge DARWIN ClusterLocation: University of Cambridge http://www.hpc.cam.ac.uk

Vendor: Dell ; 585 nodes; 2340 cores; 8 GB per node; 4.6 TB total memCPU: Intel Woodcrest DC, 3.0 GHz / 4MB L2 cache

Interconnect: InfiniPath QLE7140 SDR HCAs; Silverstorm 9080 and 9240 switches,File System: Panasas PanFS, 4 shelves, 20 TB capacity

Operating System: Scientific Linux CERN SLC release 4.6

DARWIN  585 nodes; 2340 cores 

Panasas: 4 Shelves, 20 TB

Truck111M Cells

Details of the FLUENT 111M Cell Model

Page 18: Scalable Performance of the Panasas Parallel File System

Slide 18 | FAST08 Panasas, Inc.

Automatic per-file RAID

System assigns RAID level based on file size

<= 64 KB RAID 1 for efficient space allocation

> 64 KB RAID 5 for optimum system performance

> 1 GB two-level RAID-5 for scalable performance

RAID-1 and RAID-10 for optimized small writes

Automatic transition from RAID 1 to 5 without re-striping

Programmatic control for application-specific layout optimizations

Create with layout hint

Inherit layout from parent directory

Small File

RAID 1 Mirroring

RAID 5 Striping

Large File

Very Large File

2-level RAID 5 Striping

Clients are responsible for writing data and its parity

Page 19: Scalable Performance of the Panasas Parallel File System

Slide 19 | FAST08 Panasas, Inc.

H G

k E

Declustered RAID

Files are striped across component objects on different StorageBlades

Component objects include file data and file parity for reconstruction

File attributes are replicated with two component objects

Declustered, randomized placement distributes RAID workload

C F

E

2-shelfBladeSet

Mirroredor 9-OSDParityStripes

FAIL

Read abouthalf of eachsurviving OSDWrite a littleto each OSD

Scales linearly

Page 20: Scalable Performance of the Panasas Parallel File System

Slide 20 | FAST08 Panasas, Inc.

Scalable RAID Rebuild

Shorter repair time in larger storage pools

Customers report 30 minute rebuilds for 800GB in 40+ shelf blade set

Variability at 12 shelves due to uneven utilization of DirectorBlade modules

Larger numbers of smaller files was better

Reduced rebuild at 8 and 10 shelves because of wider parity stripe

Rebuild bandwidth is the rate at which data is regenerated (writes)

Overall system throughput is N times higher because of the necessary reads

Use multiple “RAID engines” (DirectorBlades) to rebuild files in parallel

Declustering spreads disk I/O over more disk arms (StorageBlades)

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14

# Shelves

One Volume, 1G Files

One Volume, 100MB Files

N Volumes, 1GB Files

N Volumes, 100MB Files

Rebuild MB/sec

Page 21: Scalable Performance of the Panasas Parallel File System

Slide 21 | FAST08 Panasas, Inc.

RAID Rebuild vs Stripe Width

Panasas system automatically selects stripe width up to 11 wide

8 to 11 wide is best for bandwidth performance

System packs an even number of stripes into Bladeset, leaving at least one spare

Narrower stripes rebuild faster

Less data to read to reconstruct writes

More DirectorBlades helps

1, 2, or 3 per shelf

50+ in a single system

Rebuild MB/sec

0

20

40

60

80

100

120

140

160

1+1 2+1 3+1 4+1 5+1 6+1 7+1RAID-5 Stripe Configuration

4 mgr +18 storage3 mgr + 8 storage

Page 22: Scalable Performance of the Panasas Parallel File System

Slide 22 | FAST08 Panasas, Inc.

Scalable rebuild is mandatory

Having more drives increases risk, just like having more light bulbs increases the odds one will be burnt out at any given time

Larger storage pools must mitigate their risk by decreasing repair times

The math saysif (e.g.) 100 drives are in 10 RAID sets of 10 drives each and

each RAID set has a rebuild time of N hours

The risk is the same if you have a single RAID set of 100 drives

and the rebuild time is N/10

Block-based RAID scales the wrong direction for this to work Bigger RAID sets repair more slowly because more data must be read

Only declustering provides scalable rebuild rates

Total number of drives Drives per RAID set Repair time

Page 23: Scalable Performance of the Panasas Parallel File System

Slide 23 | FAST08 Panasas, Inc.

Client

MetadataServer

oplog

caplog

21. Create

3. Create

4

Replycache

5

6. Reply

7. Write

8

OSDs

Txn_log

Creating a File in 2 milliseconds

NVRAM,replicatedto backupin 90 usecover 1GE

Directories are objects with lists of name to object ID and location hint mappings

Page 24: Scalable Performance of the Panasas Parallel File System

Slide 24 | FAST08 Panasas, Inc.

Metarate operations/sec

0

500

1000

1500

2000

2500

3000

3500

4000

DB-100 DF

DB-100aDF

DB-100NFS

DB-100aNFS

Linux NFS

CreateUtime

Page 25: Scalable Performance of the Panasas Parallel File System

Slide 25 | FAST08 Panasas, Inc.

Metarate operations/sec

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15# Clients

UtimeCreate

Page 26: Scalable Performance of the Panasas Parallel File System

Slide 26 | FAST08 Panasas, Inc.

Metarate operations/sec

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

# Clients

Reported rate is speed of slowest client times number of clients, which is an MPI metric

(Cache hits in the client that created all the files)

Page 27: Scalable Performance of the Panasas Parallel File System

Slide 27 | FAST08 Panasas, Inc.

MPI Coordinated File Operations

Panasas mdtest Performance

01000

2000300040005000

600070008000

900010000

CreateFile

Stat File RemoveFile

CreateDir

Stat Dir RemoveDir

Op

era

tio

ns

/Se

co

nd

Unique Directory

Single Process

Shared Directory

mpirun -n 64 mdtest -d $dir -n 100 -i 3 -N 1 -v -u

Page 28: Scalable Performance of the Panasas Parallel File System

Slide 28 | FAST08 Panasas, Inc.

Summary

Per-file, object-based RAID gives scalable on-line performance

Offloads the metadata server

Parallel block allocation among the storage nodes

Declustered parity group placement yields linear increase in rebuild rates with the size of the storage pool

May become the only way to effectively handle large capacity drives

Metadata is stored as attributes on objects

File create is complex, but made fast with efficient journal implementation

Coarse-grained metadata workload distribution is a simple way to scale

Page 29: Scalable Performance of the Panasas Parallel File System

Slide 29 | FAST08 Panasas, Inc.

Technology Review

Turn-key deployment and automatic resource configuration

Scalable Object RAID

Very fast RAID rebuild

Vertical Parity to trap silent corruptions

Network parity for end-to-end data verification

Distributed system platform with quorum-based fault tolerance

Coarse grain metadata clustering

Metadata fail over

Automatic capacity load leveling

Storage Clusters scaling to ~1000 nodes today

Compute clusters scaling to 12,000 nodes today

Blade-based hardware with 1Gb/sec building block

Bigger building block going forward

Page 30: Scalable Performance of the Panasas Parallel File System

Slide 30 | FAST08 Panasas, Inc.

The pNFS Standard

The pNFS standard defines the NFSv4.1 protocol extensions between the server and client

The I/O protocol between the client and storage is specified elsewhere, for example:

SCSI Block Commands (SBC) over Fibre Channel (FC)

SCSI Object-based Storage Device (OSD) over iSCSI

Network File System (NFS)

The control protocol between the server and storage devices is also specified elsewhere, for example:

SCSI Object-based Storage Device (OSD) over iSCSI

ClientStorage

NFS 4.1 Server

Page 31: Scalable Performance of the Panasas Parallel File System

Slide 31 | FAST08 Panasas, Inc.

Key pNFS Participants

Panasas (Objects)

Network Appliance (Files over NFSv4)

IBM (Files, based on GPFS)

EMC (Blocks, HighRoad MPFSi)

Sun (Files over NFSv4)

U of Michigan/CITI (Files over PVFS2)

Page 32: Scalable Performance of the Panasas Parallel File System

Slide 32 | FAST08 Panasas, Inc.

pNFS Status

pNFS is part of the IETF NFSv4 minor version 1 standard draft

Working group is passing draft up to IETF area directors, expect RFC later in ’08

Prototype interoperability continuesSan Jose Connect-a-thon March ’06, February ’07, May ‘08

Ann Arbor NFS Bake-a-thon September ’06, October ’07

Dallas pNFS inter-op, June ’07, Austin February ’08, (Sept ’08)

AvailabilityTBD – gated behind NFSv4 adoption and working implementations of pNFS

Patch sets to be submitted to Linux NFS maintainer starting “soon”

Vendor announcements in 2008

Early adoptors in 2009

Production ready in 2010

Page 33: Scalable Performance of the Panasas Parallel File System

www.panasas.com

Go Faster. Go Parallel.

Questions?

Thank you for your time!

Page 34: Scalable Performance of the Panasas Parallel File System

Slide 34 | FAST08 Panasas, Inc.

Panasas Global Storage Model

Client Node

Panasas System A Panasas System B

BladeSet 1

VolX

VolY

VolZ

/panfs/sysa/delta/file2

BladeSet 2

delta

home

BladeSet 3

VolM

VolN

VolL

Client Node Client Node Client Node

/panfs/sysb/volm/proj38

TCP/IP network

PhysicalStorage

Pool

LogicalQuotaTree

DNS Name

Page 35: Scalable Performance of the Panasas Parallel File System

Slide 35 | FAST08 Panasas, Inc.

IB and other network fabrics

Panasas is a TCP/IP, GE-based storage product

Universal deployment, Universal routability

Commodity price curve

Panasas customers use IB, Myrinet, Quadrics, …

Cluster interconnect du jour for performance, not necessarily cost

IO routers connect cluster fabric to GE backbone

Analogous to an “IO node”, but just does TCP/IP routing (no storage)

Robust connectivity through IP multipath routing

Scalable throughput at approx 650 MB/sec IO router (PCI-e class)

Working on a 1GB/sec IO router

IB-GE switching platforms

Cisco/Voltare switch provides wire-speed bridging

Page 36: Scalable Performance of the Panasas Parallel File System

Slide 36 | FAST08 Panasas, Inc.

To Site Network

Multi-Cluster sharing: scalable BW with fail over

NFS

DNS1

KRB

Cluster C

Compute Nodes

Archive

Layer 2 switches

Panasas

Storage

Cluster A

I/O Nodes

Cluster B

Colors depict

subnets

Page 37: Scalable Performance of the Panasas Parallel File System

Slide 37 | FAST08 Panasas, Inc.

New and Unique: Network Parity

Horizontal Parity

Vertical Parity

Network Parity

Extends parity capability across the data path to the client or server node

Enables End-to-End data integrity validation

Protects from errors introduced by disks, firmware, server hardware, server software, network components and transmission

Client either receives valid data or an error notification

New