The The NERC DataGrid NERC DataGrid The The NERC NERC DataGrid DataGrid Bryan Lawrence, BADC David Boyd Kerstin Kleese Roy Lowry Dean Williams Bob Drach Mike Fiorino Deputy Director CLRC e- Science centre DL: Climate Database Expert BODC: Marine Database Expert PCMDI: ESG Principle Investigator PCMDI: ESG Metadata Architecture PCMDI: Meteorologist Acronym Summary: PCMDI: Program for Climate Model Data Intercomparison (US Department of Energy, Lawrence-Livermore National Lab) ESG: Earth System Grid
38
Embed
The NERC DataGrid The NERC DataGrid DataGrid The NERC DataGrid DataGrid Bryan Lawrence, BADC David Boyd Kerstin Kleese Roy Lowry Dean Williams Bob Drach.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
E-science should be involved with:• delivering an enhanced meta-data record of archived
data.• 'dictionary' building.• building systems to translate data and link databases.• integrating computer and natural science communities.• the ability to generate a single query across multiple
datasets (in different catalogues) returning both metadata and data.
• the ability to acquire large datasets in near real time (NRT).
• the automatic production of metadata, both by models, and where possible, by observing systems.
An ontology defines the terms used to describe and represent an area of knowledge by specifying the following kinds of concepts:
•Classes (general things) in the many domains of interest •The relationships that can exist among things •The properties (or attributes) those things may have
Ontologies are usually expressed in a logic-based language, so that detailed, accurate, consistent, sound, and meaningful distinctions can be made among the classes, properties, and relations..
class-def defined carnivore subclass-of animal slot-constraint eats value-type animal class-def defined herbivore subclass-of animal slot-constraint eats value-type plant OR (slot-constraint is-part-of has-value plant)
With current funding, the NDG does not aim to build a formal ontology, but we do aim to being to build a thesaurus that can form the basis of one, and we do hope to spin off a project to build one and integrate it in the NDG
class-def animalclass-def plant subclass-of NOT animal class-def tree subclass-of plant class-def branch slot-constraint is-part-of has-value tree class-def leaf slot-constraint is-part-of has-value branch class-def
• Requires databases of metadata & querying those databases.• Each part of the NDG will have an internal metadata catalogue (&/or
database), and data (either in flat files or the database).– so the querying strategy must support centralised querying on partially indexed
data, followed (if necessary) by distributed querying, which may or may not need mapping into a local database schema.
– In the grid environment the indexes themselves will be replicated, and some data may also be replicated.
• Major NDG design issue: developing appropriate data models, database schema and indexing strategies!– This is not a generic problem, it will be specific to our datatypes.– Technology needs to be public domain (i.e. free) for uptake!– NDG approach to database technology will be developed in conjunction with
NDG: Ingestion TasksNERC DataGrid: BADC Data Ingestion
BNL 03/01/02
DataFiles
010010010
Docs
RawData
010010010
Generate XML forGranule Catalog
Generate XML forDataSet Catalog
Generate XML forLibrary Catalog
Docs
Docs
Raw Data Input: - dataset documentation - binary data files - possibly doc files with individual data files
Phase One: Produce "Self Describing Data" (e.g. NetCDF).Phase Two: Generate XML MetadataPhase Three: Ingest Metadata into catalogues, and relocate files
IngestMetadata,
Relocate Files
Normally desirable to directly ingest data already in self-describing format(along with additional documentation)!
Datasets supported at phase one will be existing 3D data such as ECMWF and Met OfficeUM analyses at the BADC, and UM simulation data in university groups
Phase one depends on theintegration of existingtechnologies:
- SRB- LDAP- CDAT/CDMS- XML cataloging- Live Access Server- Cookies, and Unix authentication- wraping Z39.50 inWDSL (Zoom)?
along with a new requestmanager.
UM Data Files heldin Uni Res. Grps
dataflow pathway
registry pathway
IngresMetadata DB
Web ServerPerl Scripts
Existing BADC Technology
NERCMetadataGateway
registry pathway
Replace with
GlobusGiggle?
Next steps include:
•Replacing the transport layers in the metadata gateway with SOAP
•Replacing the SGML in the metadata gateway with XML
• Possible to find, reformat, and visualise disparate datasets from disparate organisations within one application.
• No longer necessary to rely on personal contacts to locate and acquire data of interest if it’s held in the BADC/BODC.
• Key requirement for interdisciplinarity; the ability to test data comparison ideas without learning foreign formats and establishing personal relationships every time.
• Other NERC data designated data centres implementing NDG.
Take up by community:
• NDG software (but not necessarily graphics tools) in use in GODIVA project and in wider UK university community (including data repositories in research groups).
Risks Of Failure• Someone else does it first – unlikely!• Performance too slow for users!
– More cache and replication– Improve database performance (UK DBTF!)– Data-compression layer for XML– Reduce scope and search depth (don’t want to do this!)
• Globus 3 (OGSA) delivery heavily delayed– Web services implementation + Globus2 + datagrid service registry
• Availability of people with appropriate skills– re-deploy existing staff where possible– Schedule begins with three months training.
• ESG-II architecture delayed or incompatible with UK architecture– Close relationship with PCMDI means we will be able to proceed