Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4 , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Gfarm v2: A Grid file system that supports high-performance

distributed and parallel data computing

Osamu TatebeOsamu Tatebe11, Noriyuki Soda, Noriyuki Soda22, Youhei Morita, Youhei Morita33, Satosh, Satoshi Matsuokai Matsuoka4 4 , Satoshi Sekiguchi, Satoshi Sekiguchi11

11Grid Technology Research Center, AISTGrid Technology Research Center, AIST22SRA, Inc, SRA, Inc, 33KEK, KEK, 44Tokyo Institute of Technology / NIITokyo Institute of Technology / NII

CHEP 04Sep 27, 2004Interlaken, Switzerland


[Background] Petascale Data Intensive Computing

Detector forALICE experiment

Detector forLHCb experiment

High Energy Physics CERN LHC, KEK-B Belle

~MB/collision,

100 collisions/sec ~PB/year2000 physicists, 35 countries

Astronomical Data Analysis data analysis of the whole data TB~PB/year/telescope Subaru telescope

10 GB/night, 3 TB/year


Petascale Data-intensive ComputingRequirements

Peta/Exabyte scale files, millions of millions of filesPeta/Exabyte scale files, millions of millions of filesScalable computational powerScalable computational power

> 1TFLOPS, hopefully > 10TFLOPS Scalable parallel I/O throughputScalable parallel I/O throughput

> 100GB/s, hopefully > 1TB/s within a system and between systems

Efficiently global sharingEfficiently global sharing with group-oriented authentication with group-oriented authentication and access controland access controlFault ToleranceFault Tolerance / Dynamic re-configuration / Dynamic re-configurationResource Management and SchedulingResource Management and SchedulingSystem monitoring and administrationSystem monitoring and administrationGlobal Computing EnvironmentGlobal Computing Environment


Goal and feature of Grid Datafarm

GoalGoalDependable data sharing among multiple organizationsHigh-speed data access, High-performance data computing

Grid DatafarmGrid DatafarmGfarm File System – Global dependable virtual file system

Federates scratch disks in PCsParallel & distributed data computing

Associates Computational Grid with Data GridFeaturesFeatures

Secured based on Grid Security InfrastructureScalable depending on data size and usage scenariosData location transparent data accessAutomatic and transparent replica selection for fault toleranceHigh-performance data access and computing by accessing multiple dispersed storages in parallel (file affinity scheduling)


Grid Datafarm (1): Gfarm file system - World-wide virtual file system [CCGrid 2002]

Transparent access to dispersed file data in a GridTransparent access to dispersed file data in a GridPOSIX I/O APIsApplications can access Gfarm file system without any modification as if it is mounted at /gfarmAutomatic and transparent replica selection for fault tolerance and access-concentration avoidance

Gfarm File System

/gfarm

ggf jp

aist gtrc

file1 file3file2 file4

file1 file2

File replica creation

Virtual DirectoryTree

mapping

File system metadata


Grid Datafarm (2): High-performance data access and computing support [CCGrid 2002]

Do not separateStorage and CPU

Parallel anddistributed file I/O


Scientific Application

ATLAS Data ProductionATLAS Data ProductionDistribution kit (binary)Atlfast – fast simulation

Input data stored in Gfarmfile system not NFS

G4sim – full simulation(Collaboration with ICEPP, KEK)

Belle Monte-Carlo ProductionBelle Monte-Carlo Production30 TB data needs to be generated3 M events (60 GB) / day is beinggenerated using a 50-node PC cluster Simulation data will be generateddistributedly in tens of universitiesand KEK

(Collaboration with KEK, U-Tokyo)


GfarmTM v1

Open source developmentOpen source developmentGfarmTM version 1.0.3.1 released on July 5, 2004 (http://datafarm.apgrid.org/)scp, GridFTP server 、 samba server, . . .

Application

Gfarm library

Metadata server

CPU CPU CPU CPU

. . .gfsd gfsd gfsd gfsd

＊ Existing applicationscan access Gfarm file systemwithout any modification using LD_PRELOAD

gfmd slapd

Compute and file system nodes


Problems of GfarmTM v1Functionality of file accessFunctionality of file access

File open in read-write mode*, file locking(* supported in version 1.0.4)

RobustnessRobustnessConsistency between metadata and physical file

at unexpected application crashat unexpected modification of physical files

SecuritySecurityAccess control of filesystem metadataAccess control of files by group

File model of Gfarm file - group of files (collection, containeFile model of Gfarm file - group of files (collection, container)r)

Flexibility of file grouping


Design of GfarmTM v2

Supports more than ten thousands of clients and fiSupports more than ten thousands of clients and file server nodesle server nodesProvides scalable file I/O performanceProvides scalable file I/O performance

Gfarm v2 – towards *true* global virtual file systeGfarm v2 – towards *true* global virtual file systemm

POSIX compliant - supports read-write mode, advisory file locking, . . .Robust, dependabe, and secureCan be substituted for NFS, AFS, . . .


Related work (1)

LustreLustre>1,000 clientsObject (file) based management, placed in any OSTNo replica management, Writeback cache, Collaborative read cache (planned)GSSAPI, ACL,StorageTek SFSKernel module

http://www.lustre.org/docs/ols2003.pdf


Related work (2)

Google File SystemGoogle File System>1,000 storage nodesFixed-size chunk, placed in any chunkserverby default, three replicasUser client library, no client and server cachenot POSIX API, support for Google’s data processing needs

[SOSP’03]


Opening files in read-write mode (1)

Semantics (the same as AFS)Semantics (the same as AFS)[without advisory file locking]Updated content is available only when opening the file after a writing process closes the file[with advisory file locking]Among processes that locks a file, up-to-date content is available in the locked region.This is not ensured when a process writes the same file without file locking.


Opening file in read-write mode (2)

/grid

ggf jp

file1 file2

Process 1 Process 2

fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”)Metadata server

FSN1 FSN2

file2

FSN1File access

file2FSN2

File accessfclose()

Before closing,any file copycan be accessed

Delete invalidfile copyin metadata,but file accessis continued

fclose()


Advisory file locking

/grid

ggf jp

file1 file2

Process 1 Process 2

fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”)Metadata server

FSN1 FSN2

file2

FSN1File access

file2FSN2

File access

Read lock request

FSN1File access

Cache flushDisable caching


Consistent update of metadata (1)

Application

Gfarm library

Metadata server

File system node

FSN1

open

FSN1close

Update metadata

Metadata is not updatedat unexpected application crash

Gfarm v1 – Gfarm library updates metadata


Consistent update of metadata (2)

Application

Gfarm library

Metadata server

File system node

FSN1

open

FSN1

closeor broken pipe

Update metadata

Metadata is updated by file system nodeeven at unexpected application crash

Gfarm v2 – file system node updates metadata


Generalization of file grouping model

.

.

.

.

.

.

.

.

.

.

.

.

. . .10

files

N sets

Image files taken bySubaru telescope

• 10 files executed in parallel• N files executed in parallel• 10 x N files executed in parallel


File grouping by directory

night1

shot1 shot2

ccd0 ccd1 ccd9

shotN

. . .

. . .

night1-ccd1

shot1 shot2 shotN. . .

Symlink/hardlink to night1/shot2/ccd1

gfs_pio_open(“night1/shot2”, &gf) Open a Gfarm file thatconcatenates ccd0, . . ., ccd9

gfs_pio_set_view_section(gf, “ccd1”) Set file view to ccd1 section

gfs_pio_open(“night1”, &gf) Open a Gfarm file thatConcatenates shot1/ccd0, . . .,and shotN/ccd9


Summary and future work

GfarmGfarmTMTM v2 aims at global virtual file system having v2 aims at global virtual file system havingScalability up to more than ten thousands of clients and file system nodesScalable file I/O performancePOSIX complience (read-write mode, file locking, . . .)Fault tolerance, robustness and dependability.

Design and implementation is discussedDesign and implementation is discussedFuture workFuture work

Implementation and performance evaluationImplementation and performance evaluationEvaluation for scalability up to more than ten thousands of Evaluation for scalability up to more than ten thousands of nodesnodesData preservation, automatic replica creationData preservation, automatic replica creation

Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4 , Satoshi Sekiguchi 1

Documents