The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003

Post on 30-Jan-2016

29 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Peter J. BraamTim Reddin braam@clusterfs.com tim.reddin@hp.com http://www.clusterfs.com. The Lustre Storage Architecture Linux Clusters for Super Computing Link ö ping 2003. Topics. History of project High level picture Networking Devices and fundamental API’s File I/O - PowerPoint PPT Presentation

Transcript

6/10/2001 1Cluster File Systems, Inc

Peter J. Braam Tim Reddinbraam@clusterfs.com tim.reddin@hp.com

http://www.clusterfs.com

The Lustre Storage Architecture

Linux Clusters for Super Computing

Linköping 2003

2 - NSC 2003

Topics

History of project High level picture Networking Devices and fundamental API’s File I/O Metadata & recovery Project status Cluster File Systems, Inc

3 - NSC 2003

Lustre’s History

4 - NSC 2003

Project history

1999 CMU & Seagate Worked with Seagate for one year Storage management, clustering Built prototypes, much design Much survives today

5 - NSC 2003

2000-2002 File system challenge

First put forward Sep 1999 Santa Fe New architecture for National Labs Characteristics:

100’s GB’s/sec of I/O throughput trillions of files 10,000’s of nodes Petabytes

From start Garth & Peter in the running

6 - NSC 2003

2002 – 2003 fast lane

3 year ASCI Path Forward contract with HP and Intel

MCR & ALC, 2x 1000 node Linux Clusters PNNL HP IA64, 1000 node Linux cluster Red Storm, Sandia (8000 nodes, Cray) Lustre Lite 1.0 Many partnerships (HP, Dell, DDN, …)

7 - NSC 2003

2003 – Production, perfomance

Spring and summer LLNL MCR from no, to partial, to full time use PNNL similar Stability much improved

Performance Summer 2003: I/O problems tackled Metadata much faster

Dec/Jan Lustre 1.0

8 - NSC 2003

High level picture

9 - NSC 2003

Lustre Systems – Major Components

Clients Have access to file system Typical role: compute server

OST Object storage targets Handle (stripes of, references to) file data

MDS Metadata request transaction engine.

Also: LDAP, Kerberos, routers etc.

10 - NSC 2003

OST 1

OST 2

OST 7

OST 3

OST 6

OST 5

OST 4

GigE

QSW Net

Lustre Clients(1,000 Lustre Lite)

Up to 10,000’s

MDS 1(active)

MDS 2(failover)

Lustre Object Storage Targets (OST)

Linux OST

Servers with disk arrays

3rd party OST Appliances

SAN

11 - NSC 2003

Clients

Object Storage Targets(OST)

Meta-data Server(MDS)

LDAP Server

configuration information, network connection details,

& security management

file I/O & file locking

recovery, file status, & file creation

directory operations, meta-data, & concurrency

12 - NSC 2003

Networking

13 - NSC 2003

Lustre Networking

Currently runs over: TCP Quadrics Elan 3 & 4 Lustre can route & can use heterogeneous nets

Beta Myrinet, SCI

Under development SAN (FC/iSCSI), I/B

Planned: SCTP, some special NUMA and other nets

14 - NSC 2003

Lustre Network Stack - Portals

Device Library (Elan,Myrinet,TCP,...)

Portal NAL’s

Portal Library

NIO API

Lustre Request Processing

Network Abstraction Layer forTCP, QSW, etc. Small & hard

Includes routing api.

Sandia’s API,CFS improved impl.

Move small & large buffers,Remote DMA handling,

Generate events

0-copy marshalling libraries,Service framework,Client request dispatch,Connection & address naming,Generic recovery infrastructure

15 - NSC 2003

Devices and API’s

16 - NSC 2003

Lustre has numerous driver modules One API - very different implementations Driver binds to named device Stacking devices is key Generalized “object devices”

Drivers currently export several API’s Infrastructure - a mandatory API Object Storage Metadata Handling Locking Recovery

Lustre Devices & API’s

17 - NSC 2003

Lustre File System (Linux) or Lustre Library (Win, Unix, Micro Kernels)

Logical Object Volume(LOV driver)

OSC1 … OSCn

Data Object & Lock

MDC

Metadata & Lock

MDC …

Clustered MDdriver

Lustre Clients & API’s

18 - NSC 2003

Object Storage Api

Objects are (usually) unnamed files Improves on the block device api

create, destroy, setattr, getattr, read, write OBD driver does block/extent allocation Implementation:

Linux drivers, using a file system backend

19 - NSC 2003

Networking

Object-Based DiskServer (OBD server

LockServer

Fibre Channel

Reco

very

Load Balancing

MDSServer

LockServer

Ext3, Reiser, XFS, … FS

Reco

very

Networking

LockClient

DirectoryMetadata &Concurrency MDSMDS

Lustre Client FileSystem Metadata

WB cache

OSC’s MDCLockClient

Networking

Reco

very

Device (Elan,TCP,…)Portal NAL’s

Portal LibraryNIO API

Request Processing

System &Parallel File I/O,

File Locking

Recovery,File Status,

File Creation

OSOSTT

Fibre Channel

Ext3, Reiser, XFS,… FS

Bringing it all together

20 - NSC 2003

File I/O

21 - NSC 2003

File I/O – Write Operation

Open file on meta-data server Get information on all objects that are part of file:

Objects id’s What storage controllers (OST) What part of the file (offset) Striping pattern

Create LOV, OSC drivers Use connection to OST

Object writes to OST No MDS involvement at all

22 - NSC 2003

MDC

Lustre Client Meta-data Server

MDS

OST 1 OST 2 OST 3

OSC 2

File meta-data

Inode A {(O1,obj1),(O3, obj2)}

File open request

Write (obj 1) Write (obj 2)

OSC 1

LOV

File system

23 - NSC 2003

I/O bandwidth

100’s GB/sec => saturate many100’s OSTs OST’s:

Do ext3 extent allocation, non-caching direct I/O Lock management spread over cluster

Achieve 90-95% of network throughput Single client, single thread Elan3: W 269MB/sec OST’s handle up to 260MB/sec W/O extent code, on 2 way 2.4GHz Xeon

24 - NSC 2003

Metadata

25 - NSC 2003

Intent locks & Write Back caching

Clients – MDS: protocol adaptation Low concurrency - write back caching

Client in memory updates delayed replay to MDS

High concurrency (mostly merged in 2.6) Single network request per transaction No lock revocations to clients Intent based lock includes complete request

26 - NSC 2003

Client File ServerNetwork

a) Conventional mkdir

lookup

mkdir

LustreClient

lookup

Lustre_mkdir

Meta-dataServer

lock module

Mds_mkdir

Network

b) Lustre mkdir

lookup mkdir

lookup intent mkdir

exercise the intent

Client

lookup

mkdir

File Server

lookup

mkdir

Network

a) Conventional mkdir

create dir

lookup

mkdir

27 - NSC 2003

Lustre 1.0

Only has high concurrency model Aggregate throughput (1,000 clients):

Achieve ~5000 file creations (open/close) /sec Achieve ~7800 stat’s in 10 x1M file directories

Single client: Around 1500 creations or stat’s /sec

Handling 10M file directories is effortless Many changes to ext3 (all merged in 2.6)

28 - NSC 2003

Metadata Future

Lustre 2.0 – 2004

Metadata clustering Common operations will parallelize

100% WB caching in memory or on disk Like AFS

30 - NSC 2003

Recovery

31 - NSC 2003

Recovery approach

Keep it simple! Based on failover circles:

Use existing failover software Left working neighbor is failover node for you

At HP we use failover pairs Simplify storage connectivity

I/O failure triggers Peer node serves failed OST Retry from client routed to new OST node

32 - NSC 2003

OST Server – redundant pair

C2C1C1 C2

OST1 OST 2

FC Switch

FC Switch

33 - NSC 2003

Configuration

34 - NSC 2003

Lustre 1.0

Good tools to build configuration Configuration is recorded on MDS

Or on dedicated management server Configuration can be changed,

1.0 requires downtime

Clients auto configure mount –t lustre –o … mds://fileset/sub/dir /mnt/pt

SNMP support

35 - NSC 2003

Futures

36 - NSC 2003

Advanced Management

Snapshots All features you might expect

Global namespace Combine best of AFS & autofs4

HSM, hot migration Driven by customer demand (we plan XDSM)

Online 0-downtime re-configuration Part of Lustre 2.0

38 - NSC 2003

Authentication POSIX style authorization NASD style OST authorization

Refinement: use OST ACL’s and cookies File crypting with group key service

STK secure file system

Security

43 - NSC 2003

Project status

44 - NSC 2003

Lustre Feature Roadmap

Lustre (Lite) 1.0

(Linux 2.4 & 2.6)

Lustre 2.0 (2.6) Lustre 3.0

2003 2004 2005

Failover MDS Metadata cluster Metadata cluster

Basic Unix security Basic Unix security Advanced Security

File I/O very fast

(~100’s OST’s)

Collaborative read cache

Storage management

Intent based scalable metadata

Write back metadata Load balanced MD

POSIX compliant Parallel I/O Global namespace

45 - NSC 2003

Cluster File Systems, Inc.

46 - NSC 2003

Cluster File Systems

Small service company: 20-30 people Software development & service (95% Lustre) contract work for Government labs OSS but defense contracts

Extremely specialized and extreme expertise we only do file systems and storage

Investments - not needed. Profitable. Partners: HP, Dell, DDN, Cray

47 - NSC 2003

Lustre – conclusions

Great vehicle for advanced storage software Things are done differently Protocols & design from Coda & InterMezzo Stacking & DB recovery theory applied

Leverage existing components Initial signs promising

48 - NSC 2003

HP & Lustre

Two projects ASCI PathForward – Hendrix Lustre Storage product

Field trial in Q1 of 04

49 - NSC 2003

Questions?

top related