Top Banner
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk Finding and utilising atmospheric/oceanic data in a distributed world: the UK NERC DataGrid. Bryan Lawrence (Kerstin Kleese, Roy Lowry, Kevin O’Neill, Andrew Woolf & others) NCAS/British Atmospheric Data Centre Rutherford Appleton Laboratory, CCLRC
55

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: Finding and utilising atmospheric/oceanic data in a distributed.

Jan 29, 2016

Download

Documents

Marylou Baker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Finding and utilising atmospheric/oceanic data in a distributed world: the UK NERC DataGrid.

Bryan Lawrence(Kerstin Kleese, Roy Lowry, Kevin O’Neill, Andrew Woolf & others)

NCAS/British Atmospheric Data Centre

Rutherford Appleton Laboratory, CCLRC

Page 2: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NDG Partners

• As funded a partnership between – British Atmospheric Data Centre (BADC, PI: Bryan Lawrence) – British Oceanographic Data Centre (BODC, Co-I: Roy Lowry)– CLRC E-science Centre (Co-I: Kerstin Kleese)– PCMDI at LNL in the US (Dean Williams, Bob Drach, Mike Fiorino)

• Project has caught the imagination, extra funding now supports:– A number of groups at the NERC Centre for Ecology and Hydrology

(CEH: Ecology DataGrid)– NERC Earth Observation Data Centre & Plymouth Marine Lab Remote

Sensing

• Not directly funded major collaborators will include:– ClimatePrediction.net, GODIVA (NERC e-science projects)– NCAS/CGAM: The Centre for Global Atmospheric Modelling at the University of Reading

(via Lois Stenman-Clark and Katherine Bouton)

• Project will support HIGEM

Page 3: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Outline

• Motivation:• The NDG Goals• Working in a standards based world – ISO

and OGC …• NDG Metadata• Summary

Page 4: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

The British Oceanographic Data Centre

(not for much longer, moving to a site on Liverpool University campus imminently)

Page 5: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

British Atmospheric Data Centre

The Role: Key words: Curation and Facilitation!

Page 6: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Easily catalogued, but successful preservation?

One could argue that the writers of these documents did a brilliant job of preserving the bits-and-bytes of their time …

And yes they’ve both been translated … many times, it’s a shame the meanings are different …

Phaistos Disk, 1700BC

Page 7: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NERC Metadata Gateway - SST

No clean handover from discovery to browse and use!

• Geospatial coordinates forgotten. Time reference forgotten. Need to get entire field(s), and find correct time!•And if I want to compare data from different locations?

- multiple logins- multiple formats- discovery?

Page 8: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

A priori would any user know to look in the COAPEC data set?

Earth system-science means we have to remove these boundaries!

• detailed file level metadata isn’t visible, and so data mining applications impossible.

NB: Dynamic catalogues!

How good is our metadata?

Page 9: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Finding Data

The Goal: Very simple interface, hide the complex software!

Page 10: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

A newer “dataset”

The extreme relevance of this example from Amazon was pointed out by Jon Callahan (LAS project, PMEL)!

Page 11: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

PCMDI – Best practice!

(if you know where to look)

Final references are papers!

Is the information coupled to the datasets? What if I take a dataset home, and another, and another … and then forget which is which?

Can I ask the question: what datasets used the Semtner sea ice parameterisation?

Page 12: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Huge variety of Data Sets

Page 13: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Querying datasets

Complex Metadata, held in Ingres database: export DIF and Z39.50

No possibility of automatic data usage …

Page 14: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Different types of data returned: Wallingford

Supporting very diverse user community: NetCDF is not enough …

Page 15: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Modelling advances: Baseline Numbers

• T42 CCSM (current, 280km)– 7.5GB/yr, 100 years -> .75TB

• T85 CCSM (140km)– 29GB/yr, 100 years -> 2.9TB

• T170 CCSM (70km)– 110GB/yr, 100 years -> 11TB

NCAR

Don Middleton

Page 16: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Capacity-related Improvements

Increased turnaround, model development, ensemble of runs

Increase by a factor of 10, linear data

• Current T42 CCSM– 7.5GB/yr, 100 years -> .75TB * 10 = 7.5TB

Page 17: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Capability-related Improvements

Spatial Resolution: T42 -> T85 -> T170Increase by factor of ~ 10-20, linear data

Temporal Resolution: Study diurnal cycle, 3 hour data

Increase by factor of ~ 4, linear data

CCM3 at T170 (70km)

Page 18: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Capability-related Improvements

Quality: Improved boundary layer, clouds, convection, ocean physics, land model, river runoff, sea ice

Increase by another factor of 2-3, data flat

Scope: Atmospheric chemistry (sulfates, ozone…), biogeochemistry (carbon cycle, ecosystem dynamics),middle Atmosphere Model…

Increase by another factor of 10+, linear data

Page 19: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Model Improvement Wishlist

Grand Total:

Increase compute by a Factor O(1000-10000)

NCAR

Don Middleton

Page 20: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Climate in 20010 – A graphic Illustration

Figures from Gary Strand, NCAR, ESG website

Page 21: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Summary thus far

Contentions: • The average atmospheric scientific project involves

about 1/3 of the time data handling! (Getting, reformatting etc).

• The problem for earth system model projects is about to get worse – for everyone, from the initiator, to the archiver, to the analyst, to the contributor, to the improver.

• (Remember the documentation problem is growing exponentially too: new sub-components etc)

Page 22: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

The NERC DataGrid

Wider InternetNERC Grid

taperobot

XML data-base

XML data-base

BADC NDG Wrapper

OnlineData

OnlineData

BODC NDGWrapper

OnlineData

XML data-base

Group NDGWrapper

Software Agent

Grid User

Satellite Supercomputer

Research Group DataSources

Internet Link

Internet User

Internet LinkESG (&other)Applications

Wider Internet

NDGWeb

Portal

XML data-base

Page 23: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

XML

QQuery

Destination

NDGPortal

QueryType

One ormoreLocal

NDG DB

XML

D

Browse and redefine query

Discovery

QueryType

Data

Note that definitions A do not need tomatch any ingested A

Documents and Annotations

Detailed

User/SoftwareGenerates Query

XML

CDeliver one or moredocuments to user

XML

B

LocalNDG DBexists?

IngestA

Y

ExtractData

PhysicalData

Deliver Data

NDG Query and Data Delivery

Define DataRequest, Q

XML

A

Page 24: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

The Data Use Chain

Discovery

Authentication

Authorisation

Extraction

Sub-Sampling

Regridding

Processing Display

Delivery

Formatting

Time-line

Page 25: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Requirements: Information (1)

Amazon Discovery gives good examples:•Browse•Similar datasets•Details•Content examples

Learn from the library and book handling community!

Our domain Issuesrequire:•Dealing with Volume•Formats•Providing Tools

All require documentation (aka metadata);

We need to improve our information handling

Page 26: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

What is metadata?

The answer depends on who you are!

Firstly: information to help one use one’s own data: e.g.

calibration data (A)

Internet User

Metadata can help one find other people’s data

… and then help one obtain and use it. (C)

Metadata can be used to enable the preservation of data for posterity (all of

ABCD)

It is information passed with the data to enable someone else to use it. It describes the

data. (B)Metadata can be used to

enable automatic software to manipulate data. (D)

Page 27: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NDG Metadata Taxonomy

Metadata

XML

AXML

B

A: Usage metadata generatedfrom (or about) the data.

Normally generated directlyfrom internal metadata

XML

CXML

D

XML

QQ: Schema whichdefines supported

queries uponA,B,C,D

Relationships

B: Generic completemetadata, semantic , syntactic(A), including discipline specfic

(E).

C: Metadata generated todescribe both documentations

and annotations (as opposed tobinary data).

D: Discovery metadatasuitable for harvesting.

Probably based on Dublin core& GEO. Subset of B and C.

Definitions

XML

D

XML

C

XML

B

XML

EXML

SE: Extra metadata,discipline specific.

S: Summary metadata(overlap between A&D)

XML

AS?

XML

D

XML

E

Page 28: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

• ISO 19101: Geographic information – Reference model• ISO 19103: Geographic information – Conceptual schema

language• ISO 19107: Geographic information – Spatial schema• ISO 19108: Geographic information – Temporal schema• ISO 19109: Geographic information – Rules for application

schema• ISO 19111: Geographic information – Spatial referencing by

coordinates• ISO 19115: Geographic information – Metadata• ISO 19118: Geographic information – Encoding• ISO 19121: Geographic information – Imagery and gridded

data

ISO TC211

Page 29: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Dataset title

Dataset reference date

Dataset responsible partyMetadata point of contact

Dataset language

Dataset character set

Dataset topic categoryAbstract describing dataset

Spatial resolution of dataset

Spatial representation type

Geographic location of dataset

Vertical/temporal extent for dataset

Reference system

Lineage

Distribution format

On-line resource

Metadata character set

Metadata date stamp

Metadata standard name

Metadata standard version

Metadata file identifier

Metadata language

ISO19115

Page 30: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

• Metadata extensions and profiles

ISO

Direct relationship between ISO19115 and our (B) Intermediate schema.

Page 31: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

• Profiling of ISO 191xx“The comprehensiveness and large number of options available in various base standards make it difficult to combine them for practical applications. … A profile integrates a set of base standards and/or modules (predefined subsets) of base standards to meet a specific implementation requirement.”

• Registration of profiles“A profile that is registered through an ISO registration procedure becomes an International Standardized Profile (ISP). National standards that are expressed as profiles of ISO base standards may be registered at a national level.”

ISO19101

Page 32: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NDG A and B metadata in practice

Clear separation of function between use and discovery.• Standards Compliant

• Avoid tie-in to details of particular fields or data formats or even components

Metadata model (B)• “Intermediate” schema, supports multiple discovery formats

NDG Data Model (A).• provides an abstract semantic model for the structure of data within NDG,

• enables the specification of concrete instances for use by NDG Data Services

Page 33: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

(B) Metadata Model

Activity

IncludesIncluded-in

IncludesIncluded-in

IncludesIncluded-in

Can-be-aggregated-in

ProducesOutput-by

Derived data entities

Observation stationTypes

Basic data entitiesDataset types

Dataproduction

tools

IncludesIncluded-in

Deploys-aDeployed-on-a

ProducesOutput-at

ProducesOutput-by

Common Data Entities- dimensions, * spatial/temporal- grids- organisations- people- places/areas

Data Granules

Page 34: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

(B) Metadata Model: an NDG Intermediate Schema, Conceptual Overview

Tier-0Activity

IncludesIncluded-in

IncludesIncluded-in

IncludesIncluded-in

Inter-tier relationship

Directed relationship

Can-be-aggregated-in

Integrated-into

Produces/Output-by

Can-be-aggregated-in

Tier-1 -Observationstation Types

Common entities

Tier-4 - Basic dataentities

Tier-3 -Datasettypes

Tier-2 - Dataproductiontools

IncludesIncluded-in

CollectsCan-be-collected-in

Integrated-into

Deploys-aDeployed-on-a

Is-time-ordered-series-of

Follows-a

Superset-ofSubset-of

Processed-to-a

Instrument

Ensemble

Analysis

Stationary Moving

SectionProfileLagrangianpath

Grid

Time Trajectory

Point

Area

Place

Model

Simulation

Spatiotemporalentity

Entity withDIF record

Dataowningentity

Sample

Can-be-aggregated-in

Person

Organisation

Role

Tier-5 - Deriveddata entities TimeseriesClimatology

Measurement

IntegrationCan-be-aggregated-in

N-dimensionaldataset

Can-be-aggregated-in

Integrated-into

Integrated-into

ProducesOutput-by

ProducesOutput-by

Spatialdimensions

Page 35: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Dataset

Variables

Multidimensionalarray

... of other arrays

... or fromaggregated

storage

Rich spatiotemporalreferencing (standards-compliant: ISO19108, ISO19111)

NDG Data Model

Page 36: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

<?xml version="1.0" encoding="UTF-8"?><!-- edited with XMLSPY v5 rel. 4 U (http://www.xmlspy.com) by Data Management Group (CCLRC) --><xs:schema targetNamespace="http://ndg.badc.rl.ac.uk/ndgDataModel" xmlns:ndgDataModel="http://ndg.badc.rl.ac.uk/ndgDataModel" xmlns:ndg19118="http://ndg.badc.rl.ac.uk/ndg19118" xmlns:ndg19115="http://ndg.badc.rl.ac.uk/ndg19115" xmlns:ndg19111="http://ndg.badc.rl.ac.uk/ndg19111" xmlns:ndg19108="http://ndg.badc.rl.ac.uk/ndg19108" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified" version="0.2">

<xs:import namespace="http://ndg.badc.rl.ac.uk/ndg19108" schemaLocation="..\ISO TC211 schemas\Version 0.2\ndg19108.xsd"/><xs:import namespace="http://ndg.badc.rl.ac.uk/ndg19111" schemaLocation="..\ISO TC211 schemas\Version 0.2\ndg19111.xsd"/><xs:import namespace="http://ndg.badc.rl.ac.uk/ndg19115" schemaLocation="..\ISO TC211 schemas\Version 0.2\ndg19115.xsd"/><xs:import namespace="http://ndg.badc.rl.ac.uk/ndg19118" schemaLocation="..\ISO TC211 schemas\Version 0.2\ndg19118.xsd"/><!--PART 1 - Objects with identity--><xs:element name="GI">

<xs:annotation><xs:documentation>This is the root element of the document. It is a limited (and incomplete!) implementation of the root

element described in ISO 19118 (clause 5.4). The latter describes the format of a compliant XML exchange file, intended for encoding a single dataset. The application of ISO 19118 for the current NDG Data Model which may contain multiple datasets needs to be resolved.</xs:documentation>

</xs:annotation><xs:complexType>

<xs:sequence minOccurs="0" maxOccurs="unbounded"><xs:group ref="ndgDataModel:Object"/>

</xs:sequence></xs:complexType>

</xs:element><xs:group name="Object">

<xs:annotation><xs:documentation>A dataset contains one or more elements that encode objects, grouped in a choice group that shall be used

to restrict the legal objects in a dataset (ISO 19118, clause A.5.4.2).</xs:documentation></xs:annotation><xs:choice>

<xs:element name="Dataset" type="ndgDataModel:Dataset"/><xs:element name="Coordinates" type="ndgDataModel:Coordinates"/>

</xs:choice></xs:group>

UML conceptual model:•ISO 19103 (conceptual schema language)•ISO 19109 (rules for application schema)

XML schema

ISO 19118 (encoding)

NDG Data Model Schema

Page 37: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

SAXdemarshalling

extractionserialisation

writeData(selectedComponents)

Instantiating the NDG Data Model

Page 38: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Further Application in NERC DataGrid

• eg Data model “Coordinates”

Dataset

Variable

Array

Coordinate GranuleDescriptor

1

*

*

*

1

1

1

*

*

*

1

0..1

ISO 19111

ISO 19108

Page 39: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NDG Semantic Data Model

name : string(idl) = WOCE SR3

WOCE SR3 section : Dataset

name : string(idl) = salinityparameter : CFStandardName = sea_water_salinityunits : PhysicalUnits = psumissingValue : NumericscaleFactor : Numericoffset : Numeric

Salinity : Variable

rank : short(idl) = 1axisSize : short(idl) = 50hasData : boolean(idl) = notype : NumericTypeNameisomorphicChildren : boolean(idl) = no

Salinity Data : Array

name : string(idl) = longitudeaxisType : AxisType = xaxisUnits : AxisUnits = degrees_eastrank : short(idl) = 1axisSize : short(idl) = 50type : NumericTypeName = floatarrayDimension : short(idl) = 1

Cruise track longitude : MappedCoordinate

name : string(idl) = latitudeaxisType : AxisType = yaxisUnits : AxisUnits = degrees_northrank : short(idl) = 1axisSize : short(idl) = 50type : NumericTypeName = floatarrayDimension : short(idl) = 1

Cruise track latitude : MappedCoordinate

rank : short(idl) = 1axisSize : short(idl) = 20hasData : boolean(idl) = yestype : NumericTypeName = floatisomorphicChildren : boolean(idl)

Cast 1 : Array

rank : short(idl) = 1axisSize : short(idl) = 40hasData : boolean(idl) = yestype : NumericTypeName = floatisomorphicChildren : boolean(idl)

Cast 50 : Array

name : string(idl) = depthaxisType : AxisType = zaxisUnits : AxisUnits = maxisSize : short(idl) = 20type : NumericTypeName = floatarrayDimension : short(idl) = 1

Cast 1 depth : ProductCoordinate

Page 40: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NDG Prototype

Layout not important (yet!)

It’s what’s under the hood that counts …

( … the data is NOT in NetCDF. The original data is available …

… the search covered data that could have been harvested …

… the architecture works!)

Page 41: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

XML

QQuery

Destination

NDGPortal

QueryType

One ormoreLocal

NDG DB

XML

D

Browse and redefine query

Discovery

QueryType

Data

Note that definitions A do not need tomatch any ingested A

Documents and Annotations

Detailed

User/SoftwareGenerates Query

XML

CDeliver one or moredocuments to user

XML

B

LocalNDG DBexists?

IngestA

Y

ExtractData

PhysicalData

Deliver Data

NDG Query and Data Delivery

Define DataRequest, Q

XML

A

Page 42: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NDG Discovery Service Element

DirectoryInterchange

Format

DublinCore

GEOProfile

(Z39.50)

IntermediateSchema

Document(s)(XML)

XSLTProcessor

XSLTProcessor

XSLTProcessor

passthru

CatalogueInteroperabiltiy

Protocol ?

NDG DiscoveryServiceElement

XSLT IngestTransformation

ExistingMetadata

Traditional and Grid Service (GT3) Interfaces

Page 43: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NDG Metadata Status

• We have built a SIMPLE prototype based primarily on our data model and used our structures to find, locate, reformat and deliver data typical of BODC and BADC observational data. (This is a first)

• We are about to re-engineer.• Key issues to address will be

– Vocabularies, and

– Ontologies

– Developing a Model Attribute Language (with CGAM, PRISM, PCMDI and others).

• Populating our metadata; a boring and laborious job!

Page 44: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Wider Internet

Research Group

Satellite

SuperComputer

Shared Resources

DB

Research Group

Research Group

Metadata Origins

Consider a hierarchy of data users beginning with an individual scientist, who may herself be part of a research group, itself part of a community sharing resources, lying in the wider internet …To be well integrated the metadata should have a role at each level!(The data portal client and server interface may be different at each level).At each level “extra” metadata will be required, probably produced by dedicated staff at the research group, or data centre.

Page 45: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Vocabularies

• BODC has a parameter dictionary with o(10K) entries• CF standard name vocabulary, o(100) entries• NASA Global Change Master Directory o(1000) entries• … there are more.

Need methods of mapping namespaces, communities will not sacrifice their existing taxonomies …

Page 46: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

An ontology defines the terms used to describe and represent an area of knowledge by specifying the following kinds of concepts:

•Classes (general things) in the many domains of interest •The relationships that can exist among things •The properties (or attributes) those things may have

Ontologies are usually expressed in a logic-based language, so that detailed, accurate, consistent, sound, and meaningful distinctions can be made among the classes, properties, and relations..

What is an Ontology?

Page 47: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

RDF: Resource Description Framework

• W3C language which builds on the hierarchical attribute/entity structures of XML.

• Used to create a collection of assertions – specified as triples– <Lion> <is an instance of> <Animal>– <Aslan> <is an instance of> <Lion>– <Lion> <has the property> <mane>

• Now we can build tools which use these concepts:– Aslan has a mane!– Aslan will also have animal properties.

• RDF Schema vocabulary builds on this to allow namespaces and more … (ranges etc).

Page 48: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Real Ontologies … immature …

SWEET: Semantic Web for Earth and Environmental Technology (NASA, JPL)

Earth realms,

Numerics

Physical Properties

Units

Phenomena

I believe they are Attempting a CF mapping…

Page 49: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Requirements (2)

We need to think about our networks and our tools for moving and keeping track of data!

• We can’t rely on the “leave it at the supercomputer site”– How do we do joint analysis?

– How do we process the data at all?

• Malcolm Atkinson quoting Jim Gray pointed out that it takes:

~ o(minute) to grep or ftp a GB

~ o(2 days) to grep or ftp a TB

~ o(3 years) to grep or ftp a PB

• Requires – sophisticated “fire and forget” file transfer (that has to out perform

“sneaker net”).

– Disk and compute resources for processing.

Page 50: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

ESG1 Results (Supercomputing, 2001)

Allcock et al. 2001

Dallas to Chicago:

Page 51: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Starting with the LAS

Deployment for UK users within a few weeks (constraint is primarily access control)

Page 52: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

LAS – Simple Box fill Output

Work for us to do: Labelling is inadequate as yet ..

ERA40

Page 53: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Cache management in LAS/CDAT

Calls cdms.open to open data file.CDAT

BADC/CDAT intercepts command and checks cache

BADC/CDAT

YES

Spectral file is converted

on-the-fly and placed in cache.

NO

Cache unlocked. New cdms.open command

sent to CDAT and cache file opened.

Cache also checks if enough room, deletes oldest files if necessary and checks against disk space limit.

Locks access to cache. Checks if

regular gridded file is in cache list.

localCache.py

18 TB virtual dataset

LAS

ERA-404 TB

Spectral Archive

ERA-40 < 1TBGrid Cache

Internet User

NetCDF file, plot or animations delivered

to user.

Data object delivered to LAS.

Page 54: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Summary

• Earth System Modelling extends the data handling challenge.

• We need better information management

• We need better tools for moving things around

• We need better tools for using remote data

• … and we need data manipulation hardware!

The NDG is attempting (with help) to address:

• Information management

• Data movement

• Tools to manipulate large volumes of data.

… and doing this all in as standards compliant a manner as possible.

Page 55: ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid:  Finding and utilising atmospheric/oceanic data in a distributed.

ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

You’ve gone TOO FAR!