Data Grids, Digital Libraries, and Persistent Archives
Post on 07-Jan-2016
43 Views
Preview:
DESCRIPTION
Transcript
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure1
Data Grids, Digital Libraries, and Persistent Archives
Reagan W. MooreSan Diego Supercomputer Center
http://www.npaci.edu/DICEmoore@sdsc.edu
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure2
Archive Definition
• Computer science - archive is the hardware and software infrastructure used to manage data
• Preservation community - archives is the material that is being preserved
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure3
Persistent Archive
• Software system that manages evolution of the hardware and software infrastructure– A persistent archive preserves the authenticity and
integrity of digital entities while the underlying technology evolves
• Combination of the material that is being preserved and the infrastructure used to preserve the material
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure4
Data Grid
• Grid Community definition– The infrastructure used to manage distributed data as a
collection
• Digital library and preservation community definition – The distributed data that is being organized and managed
as a collection
• A data grid is a mechanism to support sharing of data and the collection that is being shared
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure5
Data Sharing
• Management of access controls on local resources to share data– Put controls on resources
• Creation of a collection that is being shared across distributed resources– Put controls on collection
• The SRB data grid does both, enacts controls on both resources and on collections (data and metadata)
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure6
Topics
• Data Grids - managing distributed data– Distributed data management for a project
• Digital Libraries - publication of data– Management of collection hierarchies
• Persistent Archives - preservation of data– Management of technology evolution
• Storage Resource Broker example– Currently supporting all three (seven) data management
environments
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure7
Data Management Systems(Supported by Storage Resource Broker)
• Data collecting– Sensor systems, object ring buffers and portals
• Data organization– Collections, manage data context
• Data sharing– Data grids, manage heterogeneity of resources
• Data publication– Digital libraries, support discovery
• Data preservation– Persistent archives, manage technology evolution
• Data analysis– Processing pipelines, manage knowledge extraction
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure8
Data Management Systems
• Data grid for managing distributed data– Latency management for bulk analyses of collections
– Infrastructure independent name spaces for describing data, resources, users, and state information
• Digital library for managing data context– Curation services for managing collections
– Descriptive metadata for discovery
• Persistent archive to manage technology evolution– Interoperability mechanisms between heterogeneous
storage systems and user access mechanisms
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure9
Provide Context for Data• Properties of files
– Provenance - source– Descriptive attributes– Structure
• Organize properties as metadata in a collection hierarchy– Define operations on file properties– Manage state information - location, replicas, containers
• Separate context management from content management– Maintain consistency of context as operations are done on
content
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure10
Data Grids
• Software systems that manage distributed data• Control global name spaces for
– Resources– Users– Files– Metadata context
• Provide standard operations on each name space• Provide single sign-on authentication, collection
management, latency management, replication, and federation
• Generic distributed data management technology
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure11
Managing Distributed Data
Storage Repository
• Storage location
• User name
• File name
• File context (creation date,…)
• Access constraints
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Naming conventions provided by storage systems
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure12
Data Grids Provide a Level of Indirection for Each Naming Convention
Storage Repository
• Storage location
• User name
• File name
• File context (creation date,…)
• Access constraints
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection
Data Access Methods (C library, Unix, Web Browser)
Data is organized as a collection
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure13
Logical Name Spaces
• Storage resources– Logical names for managing collections of resources
• User names (user-name / domain / data grid)– Distinguished names for users to manage access controls
• Digital Entities (files, blobs, structured data, …)– Logical name space for global identifiers for files
• Context - Metadata attributes– Standard metadata attributes, Dublin Core– State information resulting from data grid operations– User-defined metadata
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure14
Logical Resource Name
• Represents a list of physical resources
• Operations on the logical resource name result in operations on the list of physical resources– Load leveling -write to the next physical resource in the list– Fault tolerance - write to “k” of “n” physical resources– Replication - write to each physical resource– Compound resource - write to the disk cache in front of the tape
archive– Federated resource - write to the controlled resource in another
data grid
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure15
Storage Repository Virtualization
Archive Database File System
User ApplicationHow does one access data stored on multiple systems?
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure16
Storage Repository Virtualization(Standard Operations on Logical Resource Names)
Archive Database File System
Common set of operations for interacting with every type of storage repository
User ApplicationRemote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering QueriesCollective operations Load leveling Fault tolerance Replication
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure17
Logical File Name Abstraction
Archiveat SDSC
DatabaseAt U Md
File Systemat NARA
User ApplicationHow does one identifyfiles stored on multiplesystems?
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure18
Context Abstraction
Archiveat SDSC
DatabaseAt U Md
File Systemat U Texas
Common naming convention and set of attributes for describing digital entities
User Application
Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata
Inter-realm authentication Single sign-on system
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure19
SRBserver
SRB agent
SRBserver
Federated Server Architecture
MCAT
Read Application
SRB agent
1
2
34
6
5
Logical NameOr
Attribute Condition
1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control
Peer-to-peer
Brokering
Server(s) SpawningData
Access
Parallel Data Access
R1R2
5/6
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure20
SRB Latency Management
ReplicationServer-initiated I/O
StreamingParallel I/O
CachingClient-initiated I/O
Remote Proxies,Staging
Data AggregationContainers
SourceDestination
Prefetch
NetworkDestinationNetwork
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure21
Latency Management -Bulk Operations• Bulk register
– Create a logical name for a file
• Bulk load– Create a copy of the file on a data grid storage repository
• Bulk unload– Provide containers to hold small files and pointers to each file location
• Bulk delete– Mark as deleted in metadata catalog– After specified interval, delete file
• Bulk metadata load• Requests for bulk operations for access control setting, …
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure22
Data Grid Federation
• Link multiple independent data grids– Coordinate metadata between independent metadata catalogs
• Provide consistency and access constraints for each of the four logical name spaces (resources, users, files, metadata)– Peer-to-peer federations, data access– Replication federations, shared resources– Hierarchical federations, consistency constraints
• Tune data grid federation by implementing different consistency and access constraints
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure23
Federation
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection B
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection A
Access controls and consistency constraints on cross registration of digital entities
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure24
Replicated Catalog
Deep Archive
Partial User-ID Sharing
Partial Resource Sharing
No Metadata Synch Hierarchical Zone OrganizationOne Shared User-ID
System Managed ReplicationConnection From Any ZoneComplete Resource Sharing
System Set Access ControlsSystem Controlled Complete SynchComplete User-ID Sharing System Managed Replication
System Set Access ControlsSystem Controlled Partial SynchNo Resource Sharing
Super Administrator Zone Control
System Controlled Complete SynchNo User-ID Sharing
Peer-to-Peer Data Grids
Replication Data Grids
Hierarchical Data Grids
Occasional Interchange
Free Floating
Resource Interaction
User and Data Replica
Nomadic
Snow Flake
Master Slave
Replicated Data
Federation Environments
ReplicationConstraints
ConsistencyConstraints
AccessConstraints
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure25
Generic Infrastructure
• SDSC developed the Storage Resource Broker (SRB) to support access to distributed data– Effort started in 1996 as a DARPA funded project– Now support over 30 national/international projects
• Development team of 12 staff is led by– Michael Wan, data management systems– Arcot Rajasekar , information management systems
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure26
Data Grid Capabilities• Data manipulation
– Containers– Parallel I/O– Firewall interactions
• Resource interactions– Fault tolerance– Load leveling– Replication
• HIPAA security requirements– Authentication of all users– Access controls on data and metadata– Audit trails– Data encryption– Centralized control
• Application interfaces– C library, Shell commands, Java, Perl, Python, WSDL, workflow
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure27
Digital Library
• Collection hierarchy for organizing data– User-defined metadata– Collection level metadata
• Metadata manipulation– Schema extension– Bulk metadata processing– Queries on metadata– Access controls on metadata– Views on collections
• Digital library APIs– DSpace, Fedora, OAI-PMH, web browsers– METS metadata XML schema
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure28
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure29
Persistent Archives• Authenticity metadata
– Provenance– User logical name space
• Integrity metadata– Audit trails, checksums– Access controls
• Consistency– Context update on all content operations
• Persistency– Infrastructure independence
• Storage repository abstraction• Information repository abstraction• Access abstraction (standard operations)
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure30
National Archives Persistent Archive
NARA U Md SDSC
MCAT MCAT MCAT
Principle copystored at NARAwith completemetadata catalog
Replicated copyat U Md for improvedaccess, load balancingand disaster recovery
Deep Archive atSDSC, no useraccess, but complete copy
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure31
Unix Shell
Java, NTBrowser
Kepler Actors
OAI,WSDL,WSRF
HTTPDSpace
OpenDAP
Archives - Tape,Sam-QFS, DMF,
HPSS, ADSM,UniTree, ADS
DatabasesDB2, Oracle, Sybase,SQLserver,Postgres,
mySQL, Informix
File SystemsUnix, NT,Mac OSX
Application
ORB
Storage Repository VirtualizationCatalog Abstraction
DatabasesDB2, Oracle, Sybase,
Postgres, mySQL,Informix
C, C++, Java Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency & Metadata Management / Authorization,Authentication,Audit
Linux I/O
DLL /Python,
Perl
Federation Management
Data Grid Federation - zoneSRB
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure32
Examples of Extensibility• Storage Repository Driver evolution
– Initially supported Unix file system– Added archival access - UniTree, HPSS– Added FTP/HTTP– Added database blob access– Added database table interface– Added Windows file system– Added project archives - Dcache, Castor, ADS– Added Object Ring Buffer, Datascope– Adding GridFTP version 3.3
• Database management evolution– Postgres– DB2– Oracle– Informix– Sybase– mySQL (most difficult port - no locks, no views, limited SQL)
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure33
Examples of Extensibility
• The 3 fundamental APIs are C library, shell commands, Java– Other access mechanisms are ported on top of these interfaces
• API evolution– Initial access through C library, Unix shell command– Added iNQ Windows browser (C++ library)– Added mySRB Web browser (C library and shell commands)– Added Java (Jargon)– Added Perl/Python load libraries (shell command)– Added WSDL (Java)– Added OAI-PMH, OpenDAP, DSpace digital library (Java)– Added Kepler actors for dataflow access (Java)– Adding GridFTP version 3.3 (C library)
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure34
Sites Using the SRBCiteSeer, Penn StateCity Univ. of New YorkGeospatial Environment, UCSDDrexel UniversityEOSDIS Distributed Active, NASA GoddardGeorgia TechKentucky State Libraries & ArchivesLibrary of CongressLos Alamos National LabNASA AmesNASA Goddard Space Flight CenterNCSA Grid Computing NIH (NCI Center for Bioinformatics)Penn State UniversityPittsburgh Supercomputing CenterPurdue University. IndianaStanford UniversityTACC, University of TexasTexas A & MUC Santa CruzUCLAUCSD NeuroscienceUniversity of MarylandUniversity of Michigan, CAC department University of New MexicoUniversity of WashingtonUniversity of WisconsinUSCYale University
Academia Sinica, TaiwanASCC, Computing Centre, TaiwanAustralian National UniversityBedford Oceanography,CanadaBioinformatics Institute, SingaporeCSIRO, AustraliaData Storage Institute, SingaporeEGEE, French National CenterGeoForschungsZentrum, GermanyJames Cook University, AustraliaKEK High Energy Physics, JapanMax Planck Institute, NetherlandsParallab, NorwaySouth Australian Advanced ComputingUIB (Parallab) , NorwayUniversity of AmsterdamUniversity of Cambridge, AstronomyUniversity of Cambridge, e-ScienceUniversity of EdinburghUniversity of Genoa, ItalyUniversity of Hong KongUnivrsity of ManchesterUniversity of OsloUniversity of SouthamptonYork Univ (UK)
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure35
Storage Resource Broker Collections at SDSC(11/2/2004)
GBs ofdata
stored
Numberof files
Numberof Users
Data Grid
NSF/ITR - National Virtual Observatory 53,858 9,536,698 80NSF - National Partnership for Advanced Computational Infrastructure 24,738 5,754,890 380
Hayden Planetarium - Evolution of the Solar System visualizations 7,201 113,600 178
NSF/NPACI - Joint Center for Structural Genomics 5,228 652,031 50
NSF/NPACI - Biology and Environmental collections 8,851 33,340 67
NSF - TeraGrid, ENZO Cosmology simulations 121,550 1,096,947 3,247
NIH - Biomedical Informatics Research Network 6,002 4,107,508 214
Digital Library
NLM - Digital Embryo image collection 720 45,365 23
NSF/NPACI - Long Term Ecological Reserve 253 8,436 36
NSF/NPACI - Grid Portal 2,211 51,227 407
NIH - Alliance for Cell Signaling microarray data 856 62,291 21
NSF - National Science Digital Library SIO Explorer collection 2,080 808,901 27
NSF/NPACI -Transana education research video collection 92 2,387 26
NSF/ITR - Southern California Earthquake Center 91,040 1,791,494 62
Persistent Archive
UCSD Libraries archive 128 204,828 29
NARA- Research Prototype Persistent Archive 166 316,813 58
NSF - National Science Digital Library persistent archive 3,571 26,908,350 122
TOTAL 328 TB 51 million 4,900
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure36
Grid Interfaces• GSI, support versions 1, 2, 3, Java• GridFTP version 3.3 interface to SRB collection
– Use GSI certificate to identify the user to the SRB– Reference file by a SRB logical name space– Use SRB access controls for allowed operations– Initially support serial transport– SRB supports 4 different firewall interaction protocols (client-driven
parallel I/O, server-driven parallel I/O, bulk file registration, federated data grid access)
• GridFTP version 3.3 driver for SRB collection– Store data at a remote site under the SRB ID
• Data will be shareable through SRB access controls\
– Store data at a remote site under user GSI certificate• Data will not be shareable through SRB access controls
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure37
Grid Interfaces• Replica Location Service Interface
– Simon Metson <s.metson@bristol.ac.uk>– GMCat mimics the LRC interface, enabling the files registered in an
MCat to appear on the giggle framework (RLS). – Available from http://tuber1.phy.bris.ac.uk:8080/GMCatWS3 – (also linked from the third party software on the SRB page)
• Storage Resource Manager– SRM Version 1, SRB driver created to store data in SRM– SRM Version 2, development effort to put SRM interface on top of
SRB (Alasdair Earl)– SRM Version 3, development effort to put SRM interface on top of
SRB (Peter Kunszt)
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure38
Conclusion
• Distributed data management systems can be built on generic data grid infrastructure– Data grids to support bulk access across remote
sites– Integration of data grid and digital library
capabilities to manage massive data collections– Federation of data grids to build international
discipline-wide collections
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure39
SDSC SRB Team(left to right)
• Arun Jagatheesan• George Kremenek• Sheau-Yen Chen• Arcot Rajasekar (SRB development lead)• Reagan Moore (SRB PI)• Michael Wan (SRB architect)• Roman Olschanowsky (BIRN)• Bing Zhu• Charlie Cowart• Lucas Gilbert • Tim Warnock• Wayne Schroeder (SRB product)• Adam Birnbaum (SRB production)• Antoine De Torcy• Vicky Rowley (BIRN)• Marcio Faerman (SCEC)• Students & emeritus
– Erik Vandekieft– Reena Mathew– Xi (Cynthia) Sheng– Allen Ding– Grace Lin– Qiao Xin– Daniel Moore– Ethan Chen– Jon Weinburg
• Supported by about 20 projects (NSF, DOE, NASA, NARA, NIH, LOC, NHPRC)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture. QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure40
For More Information
Reagan W. MooreSan Diego Supercomputer Center
moore@sdsc.edu
http://www.npaci.edu/DICE
http://www.npaci.edu/DICE/SRB
http://www.npaci.edu/dice/srb/mySRB/mySRB.html
top related