Top Banner
National Institute of Advanced Industrial Science and Technology Gfarm v2: A Grid file system that supports high-performance distributed and parallel data c omputing Osamu Tatebe Osamu Tatebe 1 , Noriyuki Soda , Noriyuki Soda 2 , Youhei Mor , Youhei Mor ita ita 3 , Satoshi Matsuoka , Satoshi Matsuoka 4 4 , Satoshi Sekiguch , Satoshi Sekiguch i i 1 1 Grid Technology Research Center, AIST Grid Technology Research Center, AIST 2 SRA, Inc, SRA, Inc, 3 KEK, KEK, 4 Tokyo Institute of Techn Tokyo Institute of Techn ology / NII ology / NII HEP 04 ep 27, 2004 nterlaken, Switzerland
20

Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4 , Satoshi Sekiguchi 1

Jan 26, 2016

Download

Documents

misae

CHEP 04 Sep 27, 2004 Interlaken, Switzerland. Gfarm v2: A Grid file system that supports high-performance distributed and parallel data computing. Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4 , Satoshi Sekiguchi 1 1 Grid Technology Research Center, AIST - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Gfarm v2: A Grid file system that supports high-performance

distributed and parallel data computing

Osamu TatebeOsamu Tatebe11, Noriyuki Soda, Noriyuki Soda22, Youhei Morita, Youhei Morita33, Satosh, Satoshi Matsuokai Matsuoka4 4 , Satoshi Sekiguchi, Satoshi Sekiguchi11

11Grid Technology Research Center, AISTGrid Technology Research Center, AIST22SRA, Inc, SRA, Inc, 33KEK, KEK, 44Tokyo Institute of Technology / NIITokyo Institute of Technology / NII

CHEP 04Sep 27, 2004Interlaken, Switzerland

Page 2: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

[Background] Petascale Data Intensive Computing

Detector forALICE experiment

Detector forLHCb experiment

High Energy Physics CERN LHC, KEK-B Belle

~MB/collision,

100 collisions/sec ~PB/year2000 physicists, 35 countries

Astronomical Data Analysis data analysis of the whole data TB~PB/year/telescope Subaru telescope

10 GB/night, 3 TB/year

Page 3: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Petascale Data-intensive ComputingRequirements

Peta/Exabyte scale files, millions of millions of filesPeta/Exabyte scale files, millions of millions of filesScalable computational powerScalable computational power

> 1TFLOPS, hopefully > 10TFLOPS Scalable parallel I/O throughputScalable parallel I/O throughput

> 100GB/s, hopefully > 1TB/s within a system and between systems

Efficiently global sharingEfficiently global sharing with group-oriented authentication with group-oriented authentication and access controland access controlFault ToleranceFault Tolerance / Dynamic re-configuration / Dynamic re-configurationResource Management and SchedulingResource Management and SchedulingSystem monitoring and administrationSystem monitoring and administrationGlobal Computing EnvironmentGlobal Computing Environment

Page 4: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Goal and feature of Grid Datafarm

GoalGoalDependable data sharing among multiple organizationsHigh-speed data access, High-performance data computing

Grid DatafarmGrid DatafarmGfarm File System – Global dependable virtual file system

Federates scratch disks in PCsParallel & distributed data computing

Associates Computational Grid with Data GridFeaturesFeatures

Secured based on Grid Security InfrastructureScalable depending on data size and usage scenariosData location transparent data accessAutomatic and transparent replica selection for fault toleranceHigh-performance data access and computing by accessing multiple dispersed storages in parallel (file affinity scheduling)

Page 5: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Grid Datafarm (1): Gfarm file system - World-wide virtual file system [CCGrid 2002]

Transparent access to dispersed file data in a GridTransparent access to dispersed file data in a GridPOSIX I/O APIsApplications can access Gfarm file system without any modification as if it is mounted at /gfarmAutomatic and transparent replica selection for fault tolerance and access-concentration avoidance

Gfarm File System

/gfarm

ggf jp

aist gtrc

file1 file3file2 file4

file1 file2

File replica creation

Virtual DirectoryTree

mapping

File system metadata

Page 6: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Grid Datafarm (2): High-performance data access and computing support [CCGrid 2002]

Do not separateStorage and CPU

Parallel anddistributed file I/O

Page 7: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Scientific Application

ATLAS Data ProductionATLAS Data ProductionDistribution kit (binary)Atlfast – fast simulation

Input data stored in Gfarmfile system not NFS

G4sim – full simulation(Collaboration with ICEPP, KEK)

Belle Monte-Carlo ProductionBelle Monte-Carlo Production30 TB data needs to be generated3 M events (60 GB) / day is beinggenerated using a 50-node PC cluster Simulation data will be generateddistributedly in tens of universitiesand KEK

(Collaboration with KEK, U-Tokyo)

Page 8: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

GfarmTM v1

Open source developmentOpen source developmentGfarmTM version 1.0.3.1 released on July 5, 2004 (http://datafarm.apgrid.org/)scp, GridFTP server 、 samba server, . . .

Application

Gfarm library

Metadata server

CPU CPU CPU CPU

. . .gfsd gfsd gfsd gfsd

* Existing applicationscan access Gfarm file systemwithout any modification using LD_PRELOAD

gfmd slapd

Compute and file system nodes

Page 9: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Problems of GfarmTM v1Functionality of file accessFunctionality of file access

File open in read-write mode*, file locking(* supported in version 1.0.4)

RobustnessRobustnessConsistency between metadata and physical file

at unexpected application crashat unexpected modification of physical files

SecuritySecurityAccess control of filesystem metadataAccess control of files by group

File model of Gfarm file - group of files (collection, containeFile model of Gfarm file - group of files (collection, container)r)

Flexibility of file grouping

Page 10: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Design of GfarmTM v2

Supports more than ten thousands of clients and fiSupports more than ten thousands of clients and file server nodesle server nodesProvides scalable file I/O performanceProvides scalable file I/O performance

Gfarm v2 – towards *true* global virtual file systeGfarm v2 – towards *true* global virtual file systemm

POSIX compliant - supports read-write mode, advisory file locking, . . .Robust, dependabe, and secureCan be substituted for NFS, AFS, . . .

Page 11: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Related work (1)

LustreLustre>1,000 clientsObject (file) based management, placed in any OSTNo replica management, Writeback cache, Collaborative read cache (planned)GSSAPI, ACL,StorageTek SFSKernel module

http://www.lustre.org/docs/ols2003.pdf

Page 12: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Related work (2)

Google File SystemGoogle File System>1,000 storage nodesFixed-size chunk, placed in any chunkserverby default, three replicasUser client library, no client and server cachenot POSIX API, support for Google’s data processing needs

[SOSP’03]

Page 13: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Opening files in read-write mode (1)

Semantics (the same as AFS)Semantics (the same as AFS)[without advisory file locking]Updated content is available only when opening the file after a writing process closes the file[with advisory file locking]Among processes that locks a file, up-to-date content is available in the locked region.This is not ensured when a process writes the same file without file locking.

Page 14: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Opening file in read-write mode (2)

/grid

ggf jp

file1 file2

Process 1 Process 2

fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”)Metadata server

FSN1 FSN2

file2

FSN1File access

file2FSN2

File accessfclose()

Before closing,any file copycan be accessed

Delete invalidfile copyin metadata,but file accessis continued

fclose()

Page 15: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Advisory file locking

/grid

ggf jp

file1 file2

Process 1 Process 2

fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”)Metadata server

FSN1 FSN2

file2

FSN1File access

file2FSN2

File access

Read lock request

FSN1File access

Cache flushDisable caching

Page 16: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Consistent update of metadata (1)

Application

Gfarm library

Metadata server

File system node

FSN1

open

FSN1close

Update metadata

Metadata is not updatedat unexpected application crash

Gfarm v1 – Gfarm library updates metadata

Page 17: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Consistent update of metadata (2)

Application

Gfarm library

Metadata server

File system node

FSN1

open

FSN1

closeor broken pipe

Update metadata

Metadata is updated by file system nodeeven at unexpected application crash

Gfarm v2 – file system node updates metadata

Page 18: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Generalization of file grouping model

.

.

.

.

.

.

.

.

.

.

.

.

. . .10

files

N sets

Image files taken bySubaru telescope

• 10 files executed in parallel• N files executed in parallel• 10 x N files executed in parallel

Page 19: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

File grouping by directory

night1

shot1 shot2

ccd0 ccd1 ccd9

shotN

. . .

. . .

night1-ccd1

shot1 shot2 shotN. . .

Symlink/hardlink to night1/shot2/ccd1

gfs_pio_open(“night1/shot2”, &gf) Open a Gfarm file thatconcatenates ccd0, . . ., ccd9

gfs_pio_set_view_section(gf, “ccd1”) Set file view to ccd1 section

gfs_pio_open(“night1”, &gf) Open a Gfarm file thatConcatenates shot1/ccd0, . . .,and shotN/ccd9

Page 20: Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4  , Satoshi Sekiguchi 1

National Institute of Advanced Industrial Science and Technology

Summary and future work

GfarmGfarmTMTM v2 aims at global virtual file system having v2 aims at global virtual file system havingScalability up to more than ten thousands of clients and file system nodesScalable file I/O performancePOSIX complience (read-write mode, file locking, . . .)Fault tolerance, robustness and dependability.

Design and implementation is discussedDesign and implementation is discussedFuture workFuture work

Implementation and performance evaluationImplementation and performance evaluationEvaluation for scalability up to more than ten thousands of Evaluation for scalability up to more than ten thousands of nodesnodesData preservation, automatic replica creationData preservation, automatic replica creation