Federating Archives in the Federating Archives in the DELAMAN Network DELAMAN Network Reagan W. Moore Reagan W. Moore San Diego Supercomputer Center San Diego Supercomputer Center [email protected]http://www.npaci.edu/DICE/SRB http://www.npaci.edu/DICE/SRB Storage Resource Broker Storage Resource Broker
35
Embed
Federating Archives in the DELAMAN Network Reagan W. Moore San Diego Supercomputer Center [email protected] Storage Resource.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Federating Archives in the Federating Archives in the DELAMAN NetworkDELAMAN Network
Reagan W. MooreReagan W. Moore
San Diego Supercomputer CenterSan Diego Supercomputer Center
• Build a shared collection• Authenticate users independently of the storage
systems• Control access independently of the storage
systems• Organize the file name space independently of
the storage systems• Manage context (metadata) independently of
content (files)• Maintain consistency between context and
operations on content
Distributed Data ManagementDistributed Data Management
Using Data GridsUsing Data Grids
Storage Resource Broker
• Generic distributed data management technology• Data grids - sharing• Digital libraries - publication• Persistent archives - preservation
• Federated server architecture / thin client• 250,000 lines of “C” code• Supports all major compute and storage platforms
• All requirements listed on following Scenario slides are supported
Scenario 1- Data MigrationScenario 1- Data Migration
• Provide URIDs (logical file names) that are independent of storage system
• Provide metadata for each file• Support browse and discovery on collection
hierarchy• Support access interfaces to the data• Support registration of existing files into a
shared collection• Single sign-on environment
• GSI / challenge response / tickets
Managing Distributed DataManaging Distributed Data
Storage Repository
• Storage location
• User name
• File name
• File context (creation date,…)
• Access constraints
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Naming conventions provided by storage systems
Data Grids Provide a Level of Indirection Data Grids Provide a Level of Indirection for Each Naming Conventionfor Each Naming Convention
Storage Repository
• Storage location
• User name
• File name
• File context (creation date,…)
• Access constraints
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space (URID)
• Logical context (metadata)
• Control/consistency constraints
Data Collection
Data Access Methods (C library, Unix, Web Browser)
Data is organized as a shared collection
Provide Context for DataProvide Context for Data
• Properties of files• Provenance - source• Descriptive attributes• State information resulting from operations on files
• Organize properties as metadata in a collection hierarchy• Define operations on file properties• Manage state information - location, replicas, containers, checksums
• Separate context management from content management• Maintain consistency of context as operations are done on content
• Support context management• Schema extension, automated SQL generation, bulk metadata load• Metadata extraction through a remote procedure parsing the file
SRBserver
SRB agent
SRBserver
Federated Server ArchitectureFederated Server Architecture
MCAT
Read Application
SRB agent
1
2
34
6
5
Logical NameOr
Attribute Condition
1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control
• Bulk load• Create a copy of the file on a data grid storage repository
• Bulk unload• Provide containers to hold small files and pointers to each file location
• Bulk delete• Mark as deleted in metadata catalog• After specified interval, delete file
• Bulk metadata load• Support parsing of metadata from a remote file at remote storage
• Requests for bulk operations for access control setting, …
Scenario 3 - Community AccessScenario 3 - Community Access
• Within the shared collection, the digital entities are owned and managed by the data grid• Files, URLs, SQL commands, database binary large objects can
be registered into the shared collection
• Access controls for• Files / metadata / storage systems
• Access controls are defined for multiple roles• Schema extension, create new metadata• Modify metadata• Add annotations• Turn on audit trails• Write data• Read data
• Uniform access mechanisms to data across all storage systems• Support for queries on databases• Support for formatting results (XML, HTML)• Support audit trails, encryption
• Support user-defined collection hierarchy• Soft links (build a logical collection of pointers to data
within the data grid)
• Support for multiple types of discovery• By URID (Logical File Name)• By query on metadata (may be unique to a single file)• By GUID (handle system)
Scenario 5 - EducationScenario 5 - Education
• SRB is used to build digital libraries• Assemble class material• Manage student reports• Display material through web browsers
• Federation of digital libraries• Controlled sharing across independent data grids or
digital libraries• Support for cross-registration of logical name spaces• Authentication done by “home” data grid• Access controls managed by both data grids
FederationFederation
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection B
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection A
Access controls and consistency constraints on cross registration of digital entities
• Comments can be added by owner• Annotations can be added by authorized
persons• Annotations marked by person name, date• Can restrict annotation right by group
• Can choose to create explicit metadata attributes to manage comments• Can store multiple comments per object• Can search across metadata
• Or can use digital library interfaces to manage comments
Sites Using the SRBSites Using the SRBCiteSeer, Penn StateCity Univ. of New YorkGeospatial Environment, UCSDDrexel UniversityEOSDIS Distributed Active, NASA GoddardGeorgia TechKentucky State Libraries & ArchivesLibrary of CongressLos Alamos National LabNASA AmesNASA Goddard Space Flight CenterNCSA Grid Computing NIH (NCI Center for Bioinformatics)Penn State UniversityPittsburgh Supercomputing CenterPurdue University. IndianaStanford UniversityTACC, University of TexasTexas A & MUC Santa CruzUCLAUCSD NeuroscienceUniversity of MarylandUniversity of Michigan, CAC department University of New MexicoUniversity of WashingtonUniversity of WisconsinUSCYale University
Academia Sinica, TaiwanASCC, Computing Centre, TaiwanAustralian National UniversityBedford Oceanography,CanadaBioinformatics Institute, SingaporeCSIRO, AustraliaData Storage Institute, SingaporeEGEE, French National CenterGeoForschungsZentrum, GermanyJames Cook University, AustraliaKEK High Energy Physics, JapanMax Planck Institute, NetherlandsParallab, NorwaySouth Australian Advanced ComputingUIB (Parallab) , NorwayUniversity of AmsterdamUniversity of Cambridge, AstronomyUniversity of Cambridge, e-ScienceUniversity of EdinburghUniversity of Genoa, ItalyUniversity of Hong KongUnivrsity of ManchesterUniversity of OsloUniversity of SouthamptonYork Univ (UK)
Storage Resource Broker Collections at SDSC(11/2/2004)
GBs ofdata
stored
Numberof files
Numberof Users
Data Grid
NSF/ITR - National Virtual Observatory 53,858 9,536,698 80NSF - National Partnership for Advanced Computational Infrastructure 24,738 5,754,890 380
Hayden Planetarium - Evolution of the Solar System visualizations 7,201 113,600 178
NSF/NPACI - Joint Center for Structural Genomics 5,228 652,031 50
NSF/NPACI - Biology and Environmental collections 8,851 33,340 67
NIH - Biomedical Informatics Research Network 6,002 4,107,508 214
Digital Library
NLM - Digital Embryo image collection 720 45,365 23
NSF/NPACI - Long Term Ecological Reserve 253 8,436 36
NSF/NPACI - Grid Portal 2,211 51,227 407
NIH - Alliance for Cell Signaling microarray data 856 62,291 21
NSF - National Science Digital Library SIO Explorer collection 2,080 808,901 27
NSF/NPACI -Transana education research video collection 92 2,387 26
NSF/ITR - Southern California Earthquake Center 91,040 1,791,494 62
Persistent Archive
UCSD Libraries archive 128 204,828 29
NARA- Research Prototype Persistent Archive 166 316,813 58
NSF - National Science Digital Library persistent archive 3,571 26,908,350 122
TOTAL 328 TB 51 million 4,900
Generic InfrastructureGeneric Infrastructure
• SDSC developed the Storage Resource Broker (SRB) to support access to distributed data• Effort started in 1996 as a DARPA funded project• Now support over 30 national/international projects
• Development team of 12 staff is led by• Michael Wan, data management systems• Arcot Rajasekar , information management systems
• Arun Jagatheesan• George Kremenek• Sheau-Yen Chen• Arcot Rajasekar (SRB development
lead)• Reagan Moore (SRB PI)• Michael Wan (SRB architect)• Roman Olschanowsky (BIRN)• Bing Zhu• Charlie Cowart• Lucas Gilbert • Tim Warnock• Wayne Schroeder (SRB product)• Adam Birnbaum (SRB production)• Antoine De Torcy• Vicky Rowley (BIRN)• Marcio Faerman (SCEC)• Students & emeritus
• Erik Vandekieft• Reena Mathew• Xi (Cynthia) Sheng• Allen Ding• Grace Lin• Qiao Xin• Daniel Moore• Ethan Chen• Jon Weinburg
National Virtual ObservatoryNational Virtual Observatory
Provide access to large star catalogs and large image sky surveys
• 2MASS • SDSS• DPOSS• USNO-B• Macho
National Science Digital LibraryNational Science Digital Library
Web Interface to Persistent Archive
Preserve educational material that has been registered into a central repository at Cornell through URLs• Crawl web and retrieve material, 10 levels of indirection• Convert internal URLs into data grid handles• Aggregate files into containers for storage• Preserve using SRB data grid technology• Currently housing over 26 million files
National Archives and Records National Archives and Records Administration - Research Prototype Administration - Research Prototype
Persistent ArchivePersistent Archive
NARA U Md SDSC
MCAT MCAT MCAT
Principle copystored at NARAwith completemetadata catalog
Replicated copyat U Md for improvedaccess, load balancingand disaster recovery
Deep Archive atSDSC, no useraccess, but complete copy
Demonstrate preservation environment • Authenticity• Integrity• Management of technology evolution• Mitigation of risk of data loss
• Replication of data• Federation of catalogs
• Management of preservation metadata• Scalability