Shoaib Sufi
CCLRC e-Science Centre
CCLRC Scientific Metadata (CSMD) Model
April 2004 NESC
Shoaib Sufi
CCLRC e-Science Centre
Model Motivation
• A common general format/standard for Scientific Studies and data holdings metadata does not exist
• By proposing Model and Implementation:– Form a specification for the types of metadata
studies should captured by Scientific Studies– Ease citation, collaboration, exploitation and
Integration– Allow easy Integration of distributed
heterogeneous metadata systems into a homogeneous (albeit virtual) Platform
Shoaib Sufi
CCLRC e-Science Centre
Structure of Metadata Model
• The CCLRC Scientific metadata model (CSMD) is a study-data set orientated model:– Indexing– Provenance– Data Description– Data Location– Access Conditions– Related Material
Shoaib Sufi
CCLRC e-Science Centre
What influenced CSMD
• CIP from Earth Observation• DDI from Social Sciences• DublinCore from the Library community
– Publication only metadata• XSIL as used on LIGO
– Low level ‘Scientific Data Objects’ focus• CERA from the MPIM
– A bit specific to Earth Sciences but close• … hence the need to develop out own General
Model – CCLRC Scientific Metadata Model
Shoaib Sufi
CCLRC e-Science Centre
some Model aims
• Abstract class orientated description of the types of metadata that should be captured by Scientific Studies
• Create a denominator for Scientific Study metadata which form a specification
• Metadata workshop at NIEES 2002 during a discussion on metadata standards – are people capturing metadata at the moment – simple answer given was no !!
Shoaib Sufi
CCLRC e-Science Centre
CSMD Used on DataPortal
• XML Implementation used as Data Interface for DataPortal
• Single view of heterogeneous systems/schemas
• Acts as a stress test of the model– Limitations feed into
Model Requirements– New requirements fed
back into implementation
Shoaib Sufi
CCLRC e-Science Centre
Model Breakdown: Provenance
• The Study contains the following metadata:– The Study Name– The Study Institution– The Investigator– Extended Study Information
• Abstract• Funding • Start and End times
– Investigations
Shoaib Sufi
CCLRC e-Science Centre
Investigations
• A Study can have more than one investigation; possible enumerations are experiment, simulation, measurements etc. – investigations contain:– Name– Investigation Type– Abstract– Resource– Link to DataHolding
Shoaib Sufi
CCLRC e-Science Centre
Topic (for indexing)
• Keywords– Discipline (i.e. domain)– Keyword Source (e.g.
domain dictionary)– Keyword
• Subjects– Discipline– Subject Source (e.g.
domain taxonomy)– Subject
Shoaib Sufi
CCLRC e-Science Centre
Access Condition & Related Material
• Access Conditions– Contains a list of users or groups who are
allowed access to the metadata and data, or a pointer to an access control system which contains such data for this study
• Related Material– One or many links and or textual descriptions
of material related to this study e.g. earlier studies or parallel studies
Shoaib Sufi
CCLRC e-Science Centre
Data
• Data Description holds a logical description of the Study’s data:– Data Name– Type of Data– Status– Data Topic– Parameters– Related Data Ref– Relation type (e.g.
derived)
• Data Location contains the link between logical name and physical URI’s– Data Name– Locator(s)
Shoaib Sufi
CCLRC e-Science Centre
More on Parameters
• Parameters contain a lot of information about the data objects (DO) and collections
• A collection/DO can have many parameter entries, each parameter entry contains:
• Parameter derivation (e.g. measured/fixed)– The value– The units– Range – Error margin
• Parameter aggregation is also supported
Shoaib Sufi
CCLRC e-Science Centre
Cardinality Issues
• The model recommends a certain cardinality of elements
• Certain metadata components are necessary for one to have an instance of the implemented model – treating everything as optional is not acceptable
• It is though implementations may modify this more to their needs – model attempts to remain ideal (i.e. most common Cardinality)
Shoaib Sufi
CCLRC e-Science Centre
Enumeration Issues
• Enumerations (or controlled vocabularies) e.g. types of investigator, types of institutions; these are distinct from the model e.g. as taxonomies are.
• However they are necessary for the model to work so implementations e.g. CCLRC DataPortal XML implementation of the model propose some enumerations for common things
• Recognised and relevant controlled vocabularies are hoped to be used by implementation where they are available
Shoaib Sufi
CCLRC e-Science Centre
Conformance Level
• For a complete metadata study-dataset record a large amount of metadata has to be stored/processed
• So it’s useful to have conformance levels
• Model uses 5 levels
• Each level specifies more metadata (and Indexing information) should be held
Shoaib Sufi
CCLRC e-Science Centre
Level 1
• Type of Information captured:
– Study and Investigation metadata with indexing at the Study level
• Level 1 metadata is similar to library/publication style metadata (e.g. DublinCore)
Shoaib Sufi
CCLRC e-Science Centre
Level 2
• Type of Information captured:
– Level 1 + DataHolding metadata (i.e. DataSets and DataObjects)
Shoaib Sufi
CCLRC e-Science Centre
Level 3
• Type of Information captured:
– Level 2 + related material, Access condition, indexing to data collection levels
Shoaib Sufi
CCLRC e-Science Centre
Level 4
• Type of Information captured:
– Level 3 + indexing to data object level and data object parameter information
Shoaib Sufi
CCLRC e-Science Centre
Level 5
• Type of Information captured:
– All metadata components are filled as L4 + funding, resources used, facilities used etc
Shoaib Sufi
CCLRC e-Science Centre
Conformance Levels
• L1 is similar to library/publication style metadata (e.g. DublinCore)
• The current DataPortal uses somewhere between L2 and L3 – indexing at study level moving towards collection level but with parameter information
• Envisaged only new systems designed with CSMD will conform to L4+
• Benefit of conformance levels; the higher the level of conformance to the CSMD the richer the clients that operate on the data can be– e.g. identifying datasets and objects which link directly
to keywords/taxonomies and not just studies
Shoaib Sufi
CCLRC e-Science Centre
Shoaib Sufi
CCLRC e-Science Centre
Facilities using CSMD
• CCLRC Facilities (via CCLRC DataPortal):– ISIS - Neutron Spallation at Rutherford Appleton Laboratory
(test)– SR – Synchroton Radiation source at Daresbury Laboratory
(test) – British Atmospheric Data Centre (BADC) at RAL (prototype)
• External Facilities (via CCLRC DataPortal):– Max-Planck-Institut für Meteorologie (MPIM) in Hamburg
• External Projects using CSMD– NERC funded E-mineral ‘environment from the molecular level’– EPSRC funded E-materials project– Manchester MyGrid project uses an adapted version– ISIS (RAL) have taken data needs inhouse and use a model
based heavily on CSMD
Shoaib Sufi
CCLRC e-Science Centre
The Future
• Increased use/recommendation for use of Controlled vocabularies
• Increased support for formal identification systems
• Feeding relevant ideas from other standards• Update XML and Relational implementations so
they more closely track the model.• Look into internationalisation issues and see if
these effect the model or the implementations
Shoaib Sufi
CCLRC e-Science Centre
More information
• Latest Model description
– http://www-dienst.rl.ac.uk/library/2002/tr/dltr-2002001.pdf
• For an XML implementation and Relational Implementation, newer draft of the model documentation e-mail:
– [email protected] with the subject containing [metadata model request]