Top Banner

Click here to load reader

The NERC DataGrid

Jan 19, 2016

ReportDownload

Documents

zuzela

The NERC DataGrid. Bryan Lawrence, BADC David Boyd Kerstin Kleese Roy Lowry Dean Williams Bob Drach Mike Fiorino. Deputy Director CLRC e-Science centre. DL: Climate Database Expert. BODC: Marine Database Expert. PCMDI: ESG Principle Investigator. PCMDI: ESG Metadata Architecture. - PowerPoint PPT Presentation

  • Bryan Lawrence, BADCDavid BoydKerstin KleeseRoy LowryDean WilliamsBob DrachMike Fiorino

    Deputy Director CLRC e-Science centreDL: Climate Database ExpertBODC: Marine Database ExpertPCMDI: ESG Principle InvestigatorPCMDI: ESG Metadata ArchitecturePCMDI: MeteorologistAcronym Summary:PCMDI: Program for Climate Model Data Intercomparison (US Department of Energy, Lawrence-Livermore National Lab)ESG: Earth System Grid (US Grid Project: NCAR, Argonne, PCMDI, USC )

  • OutlineMotivationThe Earth System Griddefinitions of portals and applicationsontologiesRelations with other NERC e-science programmes.Architecturequeryingsoftware StackInitial steps and Project ManagementConnectivity with other grid projectsSuccess and FailureSummary of what we are doing and the road to the future

  • The BADC part of NCAS!The Role: Key words: Curation and Facilitation!http://www.badc.rl.ac.uk

  • Just under half of BADC users are NOT atmospheric scientists:

    Chart1

    56

    104

    132

    152

    132

    42

    126

    160

    Sheet4

    Atmospheric Chemistry206

    Atmospheric Physics445

    Earth Observation56

    Earth Science104

    Engineering132

    Geography152

    Marine Sciences132

    Mathematics42

    Biological/Medical126

    Terrestrial/Fresh Water160

    Sheet4

    56

    104

    132

    152

    132

    42

    126

    160

    Registered Non-Atmospheric Science Users

    Sheet1

    JanJan22063

    FebFeb30061

    MarMar47059

    AprApr37018

    MayMay34045

    JunJun44059

    JulJul0700

    AugAug0520

    SepSep0660

    OctOct0500

    NovNov0560

    DecDec0370

    TOTALTOTAL214331305

    Sheet2

    Access permissions6910282

    Data availability589885

    FTP problem856

    FTP/Browse Archive problem001

    File formats556

    Forms (incomplete/incorrect)886

    Other253734

    Other computer problem93014

    Other data problem1066

    Password121823

    Understanding data101930

    Web traj0312

    TOTAL214331305

    Sheet3

    Atmospheric chemistry0012

    Atmospheric physics0027

    Atmospheric science6314166

    Commercial6139

    Earth Observation004

    Earth Science118

    Economics001

    Engineering8514

    Geography0021

    Geography/Economics23156

    Marine science121217

    Maths/Computing331

    Medical/Biological81215

    Other91515

    Polar270

    Public/Schools31313

    Terrestrial and Freshwater283328

    Unknown296148

    Not defined1900

    TOTAL214331305

    Engineering8514

    Marine Science121217

    Medical/Biological81215

    Geography/Economics231527

    Public/Schools31313

    Terrestrial and Freshwater283328

    Sheet3

    000

    000

    000

    000

    000

    000

    Number of Queries per Semester

    User Queries by Discipline

  • Motivation Town meeting 2001E-science should be involved with:delivering an enhanced meta-data record of archived data.'dictionary' building.building systems to translate data and link databases.integrating computer and natural science communities.the ability to generate a single query across multiple datasets (in different catalogues) returning both metadata and data.the ability to acquire large datasets in near real time (NRT).the automatic production of metadata, both by models, and where possible, by observing systems.

    Summary from two of the four working groups!

  • Relevant to many stakeholders(Slide from Julia Slingos introduction to CGAM as part of NCAS)

  • MotivationPage 22:NERC will ... ensure that Earth system science is underpinned by e-science investments to enable access, manipulation of data from diverse sources.

  • The Data Use Chain

  • NERC Metadata Gateway - SST Geospatial coordinates forgotten. Time reference forgotten. Need to get entire field(s), and find correct time!And if I want to compare data from different locations?- multiple logins- multiple formats- discovery?

  • Searching: need comprehensive metadata!A priori would any user know to look in the COAPEC data set? Earth system-science means we have to remove these boundaries! detailed file level metadata isnt visible, and so data mining applications impossible.- need ontologies to help queries match actual data descriptions.NB: Dynamic catalogues!

  • What is an Ontology?An ontology defines the terms used to describe and represent an area of knowledge by specifying the following kinds of concepts:Classes (general things) in the many domains of interest The relationships that can exist among things The properties (or attributes) those things may have

    Ontologies are usually expressed in a logic-based language, so that detailed, accurate, consistent, sound, and meaningful distinctions can be made among the classes, properties, and relations..

  • Ontology Example:An example of part of ontology defined using OIL (e.g. see Oil in a Nutshell, D. Fensel et.al.) ontology-definitions slot-def eats inverse is-eaten-by slot-def has-part inverse is-part-of properties transitive class-def defined carnivore subclass-of animal slot-constraint eats value-type animal class-def defined herbivore subclass-of animal slot-constraint eats value-type plant OR (slot-constraint is-part-of has-value plant) With current funding, the NDG does not aim to build a formal ontology, but we do aim to being to build a thesaurus that can form the basis of one, and we do hope to spin off a project to build one and integrate it in the NDGclass-def animalclass-def plant subclass-of NOT animal class-def tree subclass-of plant class-def branch slot-constraint is-part-of has-value tree class-def leaf slot-constraint is-part-of has-value branch class-def

    class-def giraffe subclass-of animal slot-constraint eats value-type leaf class-def lion subclass-of animal slot-constraint eats value-type herbivore

    RelationshipsClassesProperties(OIL: Ontology Inference Layer)

  • ESG: Example of a Web-based Data Portal ESG will provide support for: large but simple data sets, limited metadata, but not searchable. NDG will provide support forSmall-but-complex datasets.Data-mining (searchable metadata).NDG is complementary to ESG!

  • Live Access Server (1) we will keep the basic structure, but gradually replace components.

  • Live Access Server (2)Data Request Structure:

  • ESG: Example of a Client ApplicationWe will: Provide python based classes for our observational data to complement the access to 3D gridded data. Provide a web services wrapper so that other grid applications can access NDG data.

  • Applications and Portals

  • Relationship to GODIVA (Haines et.al.)(Grid for Ocean Diagnostics, Interactive Visualisation and Analysis)Architecture of the GODIVA Grid: NDG will: improve data discovery tools for GODIVA (even for their own datasets). provide metadata creation tools for GODIVA participants. provide access to data held outside GODIVA participants.

  • ClimatePrediction.comCP.COM will need the NDG to make best use of observational data in evaluating their parameter space.

  • Mining on the GridFrom Hinkes NASA IPG presentation at CEOS, Rome, May 2002

  • Data mining: Grid Miner ArchitectureFrom Hinkes NASA IPG presentation at CEOS, Rome, May 2002The devil is in the detail: how does the data mining agent get at the data? Need data mining clients objects which can read specific datatypes and present themselves to agents!

  • Finding data: Querying!Requires databases of metadata & querying those databases.Each part of the NDG will have an internal metadata catalogue (&/or database), and data (either in flat files or the database).so the querying strategy must support centralised querying on partially indexed data, followed (if necessary) by distributed querying, which may or may not need mapping into a local database schema. In the grid environment the indexes themselves will be replicated, and some data may also be replicated.Major NDG design issue: developing appropriate data models, database schema and indexing strategies!This is not a generic problem, it will be specific to our datatypes.Technology needs to be public domain (i.e. free) for uptake!NDG approach to database technology will be developed in conjunction with DBTF.

  • Query Pathway; software components

  • Information StructurePCMDI ComponentsNDG ComponentsJoint InterfacesExisting Components

  • Simplified Software StackKey point:make use of existing technology, allow component replacement with time!Achievable by:interface definition and integration.Note: Any application will be able to access our data services via the OGSA wrapper in the middleware.

  • Software stack

  • NDG: Ingestion Tasks

  • Draft Project SchedulePhase One Delivery

  • Metadata Gateway

  • Replace with GlobusGiggle?Next steps include:Replacing the transport layers in the metadata gateway with SOAPReplacing the SGML in the metadata gateway with XMLetc

  • Connectivity?Evolution!Innovation?

  • Indicators of SuccessFinding and making use of data:Possible to find, reformat, and visualise disparate datasets from disparate organisations within one application.No longer necessary to rely on personal contacts to locate and acquire data of interest if its held in the BADC/BODC.Key requirement for interdisciplinarity; the ability to test data comparison ideas without learning foreign formats and establishing personal relationships every time. Other NERC data designated data centres implementing NDG.Take up by community:NDG software (but not neces

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.