Grid Datafarm Architecture for Petascale Data Intensiv e Computing Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Gfarm project On behalf of the Gfarm project http:// http:// datafarm.apgrid.org datafarm.apgrid.org / / ACAT 2002 June 28, 2002 Moscow, Russia
28
Embed
Grid Datafarm Architecture for Petascale Data Intensive Computing Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Gfarm project
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Grid Datafarm Architecture for Petascale Data Intensive Computing
Grid Datafarm Architecture for Petascale Data Intensive Computing
Osamu TatebeGrid Technology Research Center, AIST
On behalf of the Gfarm projectOn behalf of the Gfarm projecthttp://http://datafarm.apgrid.orgdatafarm.apgrid.org//
Petascale Data Intensive Computing /Large-scale Data Analysis
Petascale Data Intensive Computing /Large-scale Data Analysis
Data intensive computing, large-scale data Data intensive computing, large-scale data analysis, data mininganalysis, data mining High Energy PhysicsHigh Energy Physics Astronomical Observation, Earth ScienceAstronomical Observation, Earth Science Bioinformatics…Bioinformatics… Good support still neededGood support still needed
Large-scale database search, data miningLarge-scale database search, data mining E-Government, E-Commerce, Data warehouseE-Government, E-Commerce, Data warehouse Search EnginesSearch Engines Other Commercial StuffOther Commercial Stuff
Example: Large Hadron Collider Accelerator at CERNExample: Large Hadron Collider Accelerator at CERN
Detector for ALICE experiment
Detector forLHCb experiment
Truck
ATLAS Detector40mx20m7000 Tons
LHCPerimeter 26.7km
~2000~2000 physicists physicists from from 35 35 countriescountries
Peta/Exascale Data Intensive Computing RequirementsPeta/Exascale Data Intensive Computing Requirements
Efficiently global sharingEfficiently global sharing with group-oriented with group-oriented authentication and access controlauthentication and access control
Resource Management and SchedulingResource Management and Scheduling System monitoring and administrationSystem monitoring and administration Fault ToleranceFault Tolerance / Dynamic re-configuration / Dynamic re-configuration Global Computing EnvironmentGlobal Computing Environment
Single system image & Parallel I/OI/O bandwidth limited by IP network
File stripe
For Petabyte-scale ComputingFor Petabyte-scale Computing
Wide-area efficient sharingWide-area efficient sharing Wide-area fast file transferWide-area fast file transfer Wide-area file replica managementWide-area file replica management
Scalable I/O bandwidth, >TB/sScalable I/O bandwidth, >TB/s I/O bandwidth limited by network bandwidthI/O bandwidth limited by network bandwidth Utilize local disk I/O as much as possibleUtilize local disk I/O as much as possible Avoid data movement through network as much as Avoid data movement through network as much as
possiblepossible Fault toleranceFault tolerance
Temporal failure of wide-area network is commonTemporal failure of wide-area network is common Node and disk failures not exceptional cases but commonNode and disk failures not exceptional cases but common
Fundamentally Fundamentally New ParadigmNew Paradigm is necessary is necessary
Our approach : Data Parallel Cluster-of-cluster FilesystemOur approach : Data Parallel Cluster-of-cluster Filesystem
Comp. &I/O
Node
Comp. &I/O
Node
Comp. &I/O
Node
Comp. &I/O
Node
MetadataManager
IP Network
Meta DB
Single system image & Parallel I/OLocal file view and file affinity schedulingto exploit locality of local I/O channels
N x 100MB/s
File fragment
Our approach (2) : Parallel Filesystem for Grid of ClustersOur approach (2) : Parallel Filesystem for Grid of Clusters
Cluster-of-cluster filesystem on the GridCluster-of-cluster filesystem on the Grid File replicas among clustersFile replicas among clusters for for fault tolerancefault tolerance and and loaloa
d balancingd balancing Extension of striping cluster filesystemExtension of striping cluster filesystem
Arbitrary file block lengthArbitrary file block lengthUnified I/O and compute nodeUnified I/O and compute nodeParallel I/O, parallel file transfer, and moreParallel I/O, parallel file transfer, and more
Extreme I/O bandwidth, >TB/sExtreme I/O bandwidth, >TB/s Exploit data access localityExploit data access locality File affinity scheduling and local file viewFile affinity scheduling and local file view
Fault tolerance – file recoveryFault tolerance – file recovery Write-once files can be re-generated using a command Write-once files can be re-generated using a command
history and re-computationhistory and re-computation
Process / resource monitorProcess / resource monitoring, managementing, management
Gfmd – metaserver and prGfmd – metaserver and process manager running at ocess manager running at each siteeach site Filesystem metadata managFilesystem metadata manag
ementement Metadata consists ofMetadata consists of
MappingMapping from logical filena from logical filename to physical distributed fme to physical distributed fragment filenamesragment filenames
Replica catalogReplica catalogCommand history Command history for regenfor regen
eration of lost fileseration of lost filesPlatform informationPlatform informationFile status informationFile status information
Petascale file tends to be accessed with access localityPetascale file tends to be accessed with access locality Local I/O aggressively utilized for scalable I/O throughputLocal I/O aggressively utilized for scalable I/O throughput Target architecture – cluster of clusters, each node facilitating larTarget architecture – cluster of clusters, each node facilitating lar
ge-scale fast local disksge-scale fast local disks File affinity process schedulingFile affinity process scheduling
Almost Disk-owner computationAlmost Disk-owner computation Gfarm parallel I/O extension - Gfarm parallel I/O extension - Local file viewLocal file view
MPI-IO insufficient especially for irregular and dynamically distributMPI-IO insufficient especially for irregular and dynamically distributed dataed data
Each parallel process accesses only its own file fragmentEach parallel process accesses only its own file fragment Flexible and portable management in single system imageFlexible and portable management in single system image Grid-aware parallel I/O libraryGrid-aware parallel I/O library
File replicas on an individual fragment basiFile replicas on an individual fragment basiss
Re-generation of lost or needed write-once Re-generation of lost or needed write-once files using a command historyfiles using a command historyProgram and input files stored in fault-tolerant Program and input files stored in fault-tolerant
Gfarm filesystemGfarm filesystemProgram should be deterministicProgram should be deterministicRe-generation also supports GriPhyN virtual datRe-generation also supports GriPhyN virtual dat
a concept a concept
Specifics of Gfarm APIs and commandsSpecifics of Gfarm APIs and commands
(For Details, Please See Paper (For Details, Please See Paper at at http://datafarm.apgrid.org/http://datafarm.apgrid.org/))
gfdfgfdf Displays number of free disDisplays number of free dis
k blocks and filesk blocks and files gfsckgfsck
Check and repair file systeCheck and repair file systemsms
Porting Legacy or Commercial ApplicationsPorting Legacy or Commercial Applications
Hook syscalls open(), close(), write(), . . . to utilizHook syscalls open(), close(), write(), . . . to utilize Gfarm filesysteme Gfarm filesystem Intercepted syscalls executed in local file viewIntercepted syscalls executed in local file view This allows thousands of files to be This allows thousands of files to be groupedgrouped automaticautomatic
allyally and processed in parallel. and processed in parallel. Quick upstart for legacy apps (but some portability proQuick upstart for legacy apps (but some portability pro
blems have to be coped with)blems have to be coped with) gfreg commandgfreg command
After creation of thousands of files, gfreg explicitly grouAfter creation of thousands of files, gfreg explicitly groups files into a single Gfarm file.ps files into a single Gfarm file.
Initial Performance Evaluation– Presto III Gfarm Development Cluster (Prototype)
Initial Performance Evaluation– Presto III Gfarm Development Cluster (Prototype)
Dual Athlon MP 1.2GHz Nodes x128
768 MB, 200GB HDD each Total 98GB Mem, 25TB Stora
ge Myrinet 2K Full Bandwidth, 6
4bit PCI 614 GFLOPS (Peak) 331.7 GFLOPS Linpack for To
p500
Operation from Oct 2001
0
5
10
15
20
25
30
35
40
Gfarm parallelwrite
Unixindependent
write
Gfarm parallelread
Unixindependent
read
Initial Performance Evaluation (2)- parallel I/O (file affinity scheduling and local file view)
Initial Performance Evaluation (2)- parallel I/O (file affinity scheduling and local file view)
1742 MB/s on writes1974 MB/s on reads
Presto III 64 nodes for comp. & IO nodes640 GB of data
HDDsHDDs 105MB/s105MB/s on writes, on writes, 85MB/s85MB/s
on readson reads 10-node experimental cluster 10-node experimental cluster
(will be installed by July 2002)(will be installed by July 2002) 10U + GbE switch10U + GbE switch Totally Totally 5TB5TB RAID with 80 disks RAID with 80 disks
1050MB/s1050MB/s on writes, on writes, 850MB/850MB/ss on reads on reads
GridFTP – Grid security and parallel GridFTP – Grid security and parallel streamsstreams
Replica ManagementReplica Management Replica catalog and GridFTPReplica catalog and GridFTP
Kangaroo – Condor approachKangaroo – Condor approach Latency hiding to utilize local disks Latency hiding to utilize local disks
as a cacheas a cache No solution for bandwidthNo solution for bandwidth
Gfarm is the first attempt of cluster-of-Gfarm is the first attempt of cluster-of-cluster filesystem on the Gridcluster filesystem on the Grid File replicaFile replica File affinity scheduling, . . .File affinity scheduling, . . .
Grid Datafarm Development ScheduleGrid Datafarm Development Schedule
data streamingdata streaming Deploy on Development Gfarm ClusterDeploy on Development Gfarm Cluster
Second Prototype 2002(-2003)Second Prototype 2002(-2003) Grid security infrastructure Grid security infrastructure Load balance, Fault Tolerance, ScalabilityLoad balance, Fault Tolerance, Scalability Multiple metaservers with coherent cacheMultiple metaservers with coherent cache Evaluation in cluster-of-cluster environmentEvaluation in cluster-of-cluster environment Study of replication and scheduling policiesStudy of replication and scheduling policies ATLAS full-geometry Geant4 simulation (1M events) ATLAS full-geometry Geant4 simulation (1M events) Accelerate by National “Advanced Network Computing initiAccelerate by National “Advanced Network Computing initi
ative” (US$10M/5y)ative” (US$10M/5y) Full Production Development (2004-2005 and beyond)Full Production Development (2004-2005 and beyond)
Deploy on Production GFarm clusterDeploy on Production GFarm cluster Petascale online storagePetascale online storage
Synchronize with ATLAS scheduleSynchronize with ATLAS schedule ATLAS-Japan Tier-1 RC “prime customer”ATLAS-Japan Tier-1 RC “prime customer”
5km
KEK
AIST/TACC
10xN Gbps
U-Tokyo (60km)TITECH (80km)
SuperSINET
TsukubaWAN
10 Gbps
SummarySummary
Petascale Data Intensive Computing WavePetascale Data Intensive Computing Wave Key technology: Grid and clusterKey technology: Grid and cluster Grid datafarm is an architecture forGrid datafarm is an architecture for
Online >10PB storage, >TB/s I/O bandwidthOnline >10PB storage, >TB/s I/O bandwidth Efficient sharing on the GridEfficient sharing on the Grid Fault toleranceFault tolerance
Initial performance evaluation shows scalable performancInitial performance evaluation shows scalable performancee 1742 MB/s on writes on 64 cluster nodes1742 MB/s on writes on 64 cluster nodes 1974 MB/s on reads on 64 cluster nodes1974 MB/s on reads on 64 cluster nodes 443 MB/s using 23 parallel streams443 MB/s using 23 parallel streams
Metaserver overhead is negligibleMetaserver overhead is negligible I/O bandwidth limited by not network but disk I/O (good!)I/O bandwidth limited by not network but disk I/O (good!)