National Institute of Advanced Industrial Science and Technology Gfarm v2: A Grid file system that supports high-performance distributed and parallel data c omputing Osamu Tatebe Osamu Tatebe 1 , Noriyuki Soda , Noriyuki Soda 2 , Youhei Mor , Youhei Mor ita ita 3 , Satoshi Matsuoka , Satoshi Matsuoka 4 4 , Satoshi Sekiguch , Satoshi Sekiguch i i 1 1 Grid Technology Research Center, AIST Grid Technology Research Center, AIST 2 SRA, Inc, SRA, Inc, 3 KEK, KEK, 4 Tokyo Institute of Techn Tokyo Institute of Techn ology / NII ology / NII HEP 04 ep 27, 2004 nterlaken, Switzerland
CHEP 04 Sep 27, 2004 Interlaken, Switzerland. Gfarm v2: A Grid file system that supports high-performance distributed and parallel data computing. Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4 , Satoshi Sekiguchi 1 1 Grid Technology Research Center, AIST - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
National Institute of Advanced Industrial Science and Technology
Gfarm v2: A Grid file system that supports high-performance
11Grid Technology Research Center, AISTGrid Technology Research Center, AIST22SRA, Inc, SRA, Inc, 33KEK, KEK, 44Tokyo Institute of Technology / NIITokyo Institute of Technology / NII
CHEP 04Sep 27, 2004Interlaken, Switzerland
National Institute of Advanced Industrial Science and Technology
[Background] Petascale Data Intensive Computing
Detector forALICE experiment
Detector forLHCb experiment
High Energy Physics CERN LHC, KEK-B Belle
~MB/collision,
100 collisions/sec ~PB/year2000 physicists, 35 countries
Astronomical Data Analysis data analysis of the whole data TB~PB/year/telescope Subaru telescope
10 GB/night, 3 TB/year
National Institute of Advanced Industrial Science and Technology
Petascale Data-intensive ComputingRequirements
Peta/Exabyte scale files, millions of millions of filesPeta/Exabyte scale files, millions of millions of filesScalable computational powerScalable computational power
> 100GB/s, hopefully > 1TB/s within a system and between systems
Efficiently global sharingEfficiently global sharing with group-oriented authentication with group-oriented authentication and access controland access controlFault ToleranceFault Tolerance / Dynamic re-configuration / Dynamic re-configurationResource Management and SchedulingResource Management and SchedulingSystem monitoring and administrationSystem monitoring and administrationGlobal Computing EnvironmentGlobal Computing Environment
National Institute of Advanced Industrial Science and Technology
Goal and feature of Grid Datafarm
GoalGoalDependable data sharing among multiple organizationsHigh-speed data access, High-performance data computing
Grid DatafarmGrid DatafarmGfarm File System – Global dependable virtual file system
Federates scratch disks in PCsParallel & distributed data computing
Associates Computational Grid with Data GridFeaturesFeatures
Secured based on Grid Security InfrastructureScalable depending on data size and usage scenariosData location transparent data accessAutomatic and transparent replica selection for fault toleranceHigh-performance data access and computing by accessing multiple dispersed storages in parallel (file affinity scheduling)
National Institute of Advanced Industrial Science and Technology
Grid Datafarm (1): Gfarm file system - World-wide virtual file system [CCGrid 2002]
Transparent access to dispersed file data in a GridTransparent access to dispersed file data in a GridPOSIX I/O APIsApplications can access Gfarm file system without any modification as if it is mounted at /gfarmAutomatic and transparent replica selection for fault tolerance and access-concentration avoidance
Gfarm File System
/gfarm
ggf jp
aist gtrc
file1 file3file2 file4
file1 file2
File replica creation
Virtual DirectoryTree
mapping
File system metadata
National Institute of Advanced Industrial Science and Technology
Grid Datafarm (2): High-performance data access and computing support [CCGrid 2002]
Do not separateStorage and CPU
Parallel anddistributed file I/O
National Institute of Advanced Industrial Science and Technology
Scientific Application
ATLAS Data ProductionATLAS Data ProductionDistribution kit (binary)Atlfast – fast simulation
Input data stored in Gfarmfile system not NFS
G4sim – full simulation(Collaboration with ICEPP, KEK)
Belle Monte-Carlo ProductionBelle Monte-Carlo Production30 TB data needs to be generated3 M events (60 GB) / day is beinggenerated using a 50-node PC cluster Simulation data will be generateddistributedly in tens of universitiesand KEK
(Collaboration with KEK, U-Tokyo)
National Institute of Advanced Industrial Science and Technology
GfarmTM v1
Open source developmentOpen source developmentGfarmTM version 1.0.3.1 released on July 5, 2004 (http://datafarm.apgrid.org/)scp, GridFTP server 、 samba server, . . .
Application
Gfarm library
Metadata server
CPU CPU CPU CPU
. . .gfsd gfsd gfsd gfsd
* Existing applicationscan access Gfarm file systemwithout any modification using LD_PRELOAD
gfmd slapd
Compute and file system nodes
National Institute of Advanced Industrial Science and Technology
Problems of GfarmTM v1Functionality of file accessFunctionality of file access
File open in read-write mode*, file locking(* supported in version 1.0.4)
RobustnessRobustnessConsistency between metadata and physical file
at unexpected application crashat unexpected modification of physical files
SecuritySecurityAccess control of filesystem metadataAccess control of files by group
File model of Gfarm file - group of files (collection, containeFile model of Gfarm file - group of files (collection, container)r)
Flexibility of file grouping
National Institute of Advanced Industrial Science and Technology
Design of GfarmTM v2
Supports more than ten thousands of clients and fiSupports more than ten thousands of clients and file server nodesle server nodesProvides scalable file I/O performanceProvides scalable file I/O performance
Gfarm v2 – towards *true* global virtual file systeGfarm v2 – towards *true* global virtual file systemm
POSIX compliant - supports read-write mode, advisory file locking, . . .Robust, dependabe, and secureCan be substituted for NFS, AFS, . . .
National Institute of Advanced Industrial Science and Technology
Related work (1)
LustreLustre>1,000 clientsObject (file) based management, placed in any OSTNo replica management, Writeback cache, Collaborative read cache (planned)GSSAPI, ACL,StorageTek SFSKernel module
http://www.lustre.org/docs/ols2003.pdf
National Institute of Advanced Industrial Science and Technology
Related work (2)
Google File SystemGoogle File System>1,000 storage nodesFixed-size chunk, placed in any chunkserverby default, three replicasUser client library, no client and server cachenot POSIX API, support for Google’s data processing needs
[SOSP’03]
National Institute of Advanced Industrial Science and Technology
Opening files in read-write mode (1)
Semantics (the same as AFS)Semantics (the same as AFS)[without advisory file locking]Updated content is available only when opening the file after a writing process closes the file[with advisory file locking]Among processes that locks a file, up-to-date content is available in the locked region.This is not ensured when a process writes the same file without file locking.
National Institute of Advanced Industrial Science and Technology
Opening file in read-write mode (2)
/grid
ggf jp
file1 file2
Process 1 Process 2
fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”)Metadata server
FSN1 FSN2
file2
FSN1File access
file2FSN2
File accessfclose()
Before closing,any file copycan be accessed
Delete invalidfile copyin metadata,but file accessis continued
fclose()
National Institute of Advanced Industrial Science and Technology
Advisory file locking
/grid
ggf jp
file1 file2
Process 1 Process 2
fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”)Metadata server
FSN1 FSN2
file2
FSN1File access
file2FSN2
File access
Read lock request
FSN1File access
Cache flushDisable caching
National Institute of Advanced Industrial Science and Technology
Consistent update of metadata (1)
Application
Gfarm library
Metadata server
File system node
FSN1
open
FSN1close
Update metadata
Metadata is not updatedat unexpected application crash
Gfarm v1 – Gfarm library updates metadata
National Institute of Advanced Industrial Science and Technology
Consistent update of metadata (2)
Application
Gfarm library
Metadata server
File system node
FSN1
open
FSN1
closeor broken pipe
Update metadata
Metadata is updated by file system nodeeven at unexpected application crash
Gfarm v2 – file system node updates metadata
National Institute of Advanced Industrial Science and Technology
Generalization of file grouping model
.
.
.
.
.
.
.
.
.
.
.
.
. . .10
files
N sets
Image files taken bySubaru telescope
• 10 files executed in parallel• N files executed in parallel• 10 x N files executed in parallel
National Institute of Advanced Industrial Science and Technology
File grouping by directory
night1
shot1 shot2
ccd0 ccd1 ccd9
shotN
. . .
. . .
night1-ccd1
shot1 shot2 shotN. . .
Symlink/hardlink to night1/shot2/ccd1
gfs_pio_open(“night1/shot2”, &gf) Open a Gfarm file thatconcatenates ccd0, . . ., ccd9
gfs_pio_set_view_section(gf, “ccd1”) Set file view to ccd1 section
gfs_pio_open(“night1”, &gf) Open a Gfarm file thatConcatenates shot1/ccd0, . . .,and shotN/ccd9
National Institute of Advanced Industrial Science and Technology
Summary and future work
GfarmGfarmTMTM v2 aims at global virtual file system having v2 aims at global virtual file system havingScalability up to more than ten thousands of clients and file system nodesScalable file I/O performancePOSIX complience (read-write mode, file locking, . . .)Fault tolerance, robustness and dependability.
Design and implementation is discussedDesign and implementation is discussedFuture workFuture work
Implementation and performance evaluationImplementation and performance evaluationEvaluation for scalability up to more than ten thousands of Evaluation for scalability up to more than ten thousands of nodesnodesData preservation, automatic replica creationData preservation, automatic replica creation