Top Banner
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic Data Centre IOC/IODE GEBICH Image courtesy of Alex Image courtesy of Alex Mustard Mustard
40

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

Mar 27, 2015

Download

Documents

Evelyn Ballard
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Data Integration Made EasierData Integration Made Easier

Gwenaëlle Moncoiffé

British Oceanographic Data CentreIOC/IODE GEBICH

Image courtesy of Alex MustardImage courtesy of Alex Mustard

Page 2: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

IntroductionIntroduction

• Data integration: not an option but a necessity.Data integration: not an option but a necessity.

Data integration at the heart of IMBERData integration at the heart of IMBER

JGOFS experience ≠ WOCE ≠ GLOBEC> “lessons learnt”JGOFS experience ≠ WOCE ≠ GLOBEC> “lessons learnt”

LegacyLegacy

Increasing demand for easily accessible data and for Increasing demand for easily accessible data and for compilation of global datasets for use in climatologies, compilation of global datasets for use in climatologies, gridded productsgridded products

E.g. CO2, DMS, data integration activities under SOLASE.g. CO2, DMS, data integration activities under SOLAS

No data integration possible without good data management practicesNo data integration possible without good data management practices

Page 3: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

IntroductionIntroduction

• Strong institute or team-based ethicStrong institute or team-based ethic• Fewer data better managedFewer data better managed• Long-term staff, stable teamsLong-term staff, stable teams

• Left to the individualLeft to the individual• Demise of the long-term lab technicianDemise of the long-term lab technician• Diversification + short contract + high turnover Diversification + short contract + high turnover

> degradation of data management standards> degradation of data management standards• Rules disappeared or were no longer Rules disappeared or were no longer

appropriate/adaptedappropriate/adapted• Data management becomes an IT “thing”Data management becomes an IT “thing”

• Growing community ethicGrowing community ethic• Value in data beyond personal research useValue in data beyond personal research use• Need to re-instate strong basic principles for Need to re-instate strong basic principles for

efficient data managementefficient data management• Re-establish data management as a basic Re-establish data management as a basic

scientific skill & an essential requirementscientific skill & an essential requirement• Not a technicality but our responsibility as Not a technicality but our responsibility as

scientists.scientists.

Paper log eraPaper log era

Global networks eraGlobal networks era

Personal Computer eraPersonal Computer era

HIGHHIGH

LOWLOW

HIGHHIGH

Data Management standardsData Management standards

Page 4: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

IntroductionIntroduction

““ResearchersResearchers need to be obliged to document and manage their data need to be obliged to document and manage their data with as much professionalism as they devote to their with as much professionalism as they devote to their

experiments.” experiments.”

““they should receive greater support in this endeavour than they are they should receive greater support in this endeavour than they are afforded at present.”afforded at present.”

““such standards require such standards require support from researcherssupport from researchers, who should adopt , who should adopt them and deploy them consistently.”them and deploy them consistently.”

““The lack of standards, for instance, confounds many a researcher The lack of standards, for instance, confounds many a researcher seeking to harness the diversity of knowledge now available on seeking to harness the diversity of knowledge now available on

any chosen topic.”any chosen topic.”

Nature vol 455 issue 7209 Nature vol 455 issue 7209 September 2008September 2008

““All credit, then, to those in the vanguard of interoperability.”All credit, then, to those in the vanguard of interoperability.”

Page 5: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

IntroductionIntroduction

• This presentation aims to provide tips on how to make the This presentation aims to provide tips on how to make the whole process work more efficiently for individual research whole process work more efficiently for individual research scientists and anybody keen to follow good data management scientists and anybody keen to follow good data management practices.practices.

• Practical and non-technicalPractical and non-technical

• Mainly cruise-based examples but could be adapted to any Mainly cruise-based examples but could be adapted to any form of data collection activitiesform of data collection activities

• Work in progress which feeds from or will feed into the IMBER Work in progress which feeds from or will feed into the IMBER CookbookCookbook

• Examples and list of resources will be biased towards areas I Examples and list of resources will be biased towards areas I am more familiar with: UK/Europe, cruise based data am more familiar with: UK/Europe, cruise based data collections;collections;

• Your input to help fill the gaps and fine tune some sections is Your input to help fill the gaps and fine tune some sections is important.important.

Page 6: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Rule 1: start earlyRule 1: start early

• Section relevant to all project PIsSection relevant to all project PIs

Page 7: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Rule 1: start earlyRule 1: start early

• Include data management in your budget! 5-10% of your grant is a Include data management in your budget! 5-10% of your grant is a good approximation.good approximation.

• As a project PI / cruise PSO: decide what strategy you will adoptAs a project PI / cruise PSO: decide what strategy you will adopt

Do it yourself (not recommended even for small projects)- time-consuming Do it yourself (not recommended even for small projects)- time-consuming and too tempting to prioritise the meaty “publishable” bitsand too tempting to prioritise the meaty “publishable” bits

Hire somebody to act as a data manager/integrator (full-time or part–time)Hire somebody to act as a data manager/integrator (full-time or part–time)

Delegate task to a [young] scientist embedded in your team or approach Delegate task to a [young] scientist embedded in your team or approach your national data centre or institute data management teamyour national data centre or institute data management team

• A data scientist’s tasks start before the first cruise starts and end after A data scientist’s tasks start before the first cruise starts and end after the last samples have been analysed and the last dataset handed the last samples have been analysed and the last dataset handed over. over.

• If multiple projects share the same cruise(s) then money can be If multiple projects share the same cruise(s) then money can be pooled to support 1 data management berth and pre- and post-cruise pooled to support 1 data management berth and pre- and post-cruise work.work.

Page 8: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Rule 2: follow the guidelinesRule 2: follow the guidelines

• Section relevant to project PIs, data collectors and data Section relevant to project PIs, data collectors and data

scientistsscientists

Page 9: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Rule 2: follow the guidelinesRule 2: follow the guidelines

Page 10: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Rule 2: follow the guidelinesRule 2: follow the guidelines

• Will facilitate collaboration between IMBER teams, Will facilitate collaboration between IMBER teams,

projects, countries or activities (“interoperability”)projects, countries or activities (“interoperability”)

• Will enable IMBER scientists to make use of Will enable IMBER scientists to make use of

existing and developing analytical toolsexisting and developing analytical tools

• Will make it easier to provide data to modellers Will make it easier to provide data to modellers

and global dataset buildersand global dataset builders

• Will give your data more value by making them Will give your data more value by making them

more re-usablemore re-usable

Page 11: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Help is at hand: data centresHelp is at hand: data centres• NODCs NODCs

• Established expert data management centres/teamsEstablished expert data management centres/teams

BCO-DMOBCO-DMO

CYBER from CNRS at the laboratoire de Villefranche CYBER from CNRS at the laboratoire de Villefranche

……

• Specialist and world data centresSpecialist and world data centres

CDIAC, COPEPODS, CCHDOCDIAC, COPEPODS, CCHDO

WDC-I Silver Spring, WDC-MAREWDC-I Silver Spring, WDC-MARE

......

NODCsNODCs

Page 12: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Help is at hand: advising groups, technical Help is at hand: advising groups, technical support, standard setterssupport, standard setters

• MMI (Marine Metadata Interoperability) website and very accessible MMI (Marine Metadata Interoperability) website and very accessible guides on metadata, ontologies, vocabulariesguides on metadata, ontologies, vocabularies

• IODE: Ocean Teacher, Ocean Data Standards, Expert groupsIODE: Ocean Teacher, Ocean Data Standards, Expert groups

• ICES: Guidelines for data collection and exchangeICES: Guidelines for data collection and exchange

• CLIVAR and Carbon Hydrographic Data Office (CCHDO): Manuals for CLIVAR and Carbon Hydrographic Data Office (CCHDO): Manuals for data collection and exchangedata collection and exchange

• GCMD: Discovery metadata, DIFs GCMD: Discovery metadata, DIFs

• SeaDataNet (in Europe): standards, vocabularies mappingSeaDataNet (in Europe): standards, vocabularies mapping

• ……

Page 13: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

IMBER: what? where? who? how?IMBER: what? where? who? how?

• Span a wide range of measurements: standard core Span a wide range of measurements: standard core

oceanographic measurements, biogeochemistry, biodiversity oceanographic measurements, biogeochemistry, biodiversity

data, chemical and biological stocks and ratesdata, chemical and biological stocks and rates

• From surface to deep ocean sedimentFrom surface to deep ocean sediment

• Research cruises, in situ enrichment experiments, modelling Research cruises, in situ enrichment experiments, modelling

activities, lab and mesocosm experiments, remote sensingactivities, lab and mesocosm experiments, remote sensing

Need a central searchable inventoryNeed a central searchable inventory

Page 14: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Building the IMBER data inventoryBuilding the IMBER data inventory

• At the Project levelAt the Project level

• At the Cruise levelAt the Cruise level

• At Individual levelAt Individual level

Minimum requirementMinimum requirement

Project information in GCMDProject information in GCMD

Cruises in CSR databases and eventually DIFsCruises in CSR databases and eventually DIFs

Page 15: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Building the IMBER data inventoryBuilding the IMBER data inventory

• GCMD portal but learn from past experience (e.g. GLOBEC)GCMD portal but learn from past experience (e.g. GLOBEC)

• Ensure that at least one level of information (the “project” Ensure that at least one level of information (the “project”

level) is captured consistently and linked to IMBER as a projectlevel) is captured consistently and linked to IMBER as a project

• Follow DIF guidelines for IMBER projects (see cookbook)Follow DIF guidelines for IMBER projects (see cookbook)

• Cruise information will be added once a converter CSR to DIF Cruise information will be added once a converter CSR to DIF

becomes availablebecomes available

• It is good practice to create a DIF for your individual dataset It is good practice to create a DIF for your individual dataset

record; in doing this ensure it is linked to IMBER (there will be record; in doing this ensure it is linked to IMBER (there will be

info in the Cookbook about this)info in the Cookbook about this)

Page 16: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Building the IMBER data inventoryBuilding the IMBER data inventory

• Why CSRs?Why CSRs?

Specifically designed for oceanographic cruise reportingSpecifically designed for oceanographic cruise reporting

The only existing international standard (initiated by the The only existing international standard (initiated by the IOC in the 60s)IOC in the 60s)

Two international repositories: DOD and ICESTwo international repositories: DOD and ICES

Adopted by European SeaDataNet project and by POGOAdopted by European SeaDataNet project and by POGO

Online tools are becoming availableOnline tools are becoming available

Long history with an important legacy population (38,000 Long history with an important legacy population (38,000 cruises in ICES database)cruises in ICES database)

Page 17: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

Building the IMBER inventory

DIFsDIFs

CSRsCSRs

CruisesCruises

Filled in by PSOs or Filled in by PSOs or appointed appointed

project’s data project’s data specialistspecialist

Reviewed by IMBER DLOReviewed by IMBER DLO

Converter to be developedConverter to be developed

NODCNODC

Data discoveryData discovery

Filled in by projects’ data Filled in by projects’ data scientistsscientists

Examples of non cruise-based Examples of non cruise-based activitiesactivities::

- Lab and mesocosm experiments- Lab and mesocosm experiments- Remote-sensing- Remote-sensing- Socio-economic data- Socio-economic data- Model-derived data- Model-derived data

IMBER Metadata IMBER Metadata Portal in Portal in GCMDGCMD

Projects and non-cruise activitiesProjects and non-cruise activities

Page 18: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Data sharing and archivingData sharing and archiving

• Section relevant to all data collectors and appointed Section relevant to all data collectors and appointed

data scientistsdata scientists

Page 19: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Preparing data for the archivePreparing data for the archive

• What to archive?What to archive?

Three levelsThree levels ExamplesExamples Archive?Archive? Where?Where?

Raw (digital) Raw (digital) datadata

Plankton raw counts, Plankton raw counts,

PVR images, FRRF PVR images, FRRF

raw files, etc.raw files, etc.

YES to enable YES to enable

future reference and future reference and

for safe keeping.for safe keeping.

NODCs, WDCs, NODCs, WDCs,

digital digital

archives/repositoriesarchives/repositories

Primary Primary (measured) (measured) processedprocessed

Plankton abundance, Plankton abundance,

biovolume, etc.biovolume, etc. YES, this is the YES, this is the

minimum required.minimum required.NODCs, digital NODCs, digital

archives or archives or

repositories, repositories,

national and national and

international SDCs, international SDCs,

WDCsWDCs

Derived Derived /interpreted /interpreted datadata

Plankton biomass Plankton biomass

using conversion using conversion

factors, ratio of factors, ratio of

measured measured

quantities, quantities,

reduced/combined reduced/combined

data, etc.data, etc.

ok for sharing and ok for sharing and

sometimes very sometimes very

important to archive important to archive

but but nevernever on their on their

own.own.

NODCs, digital NODCs, digital

archives or archives or

repositories, repositories,

national and national and

international SDCs, international SDCs,

WDCsWDCse.g. Abundance + biovolumes needed e.g. Abundance + biovolumes needed

alongside plankton biomassalongside plankton biomass

Page 20: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Preparing data for sharing and for the archivePreparing data for sharing and for the archive

• How to share and archive data?How to share and archive data?

• ReferencesReferences

Best Practices for Preparing Environmental Data Sets to Share and Archive (Cook et al 2001, updated by Hook et al 2007, Oak Ridge National Laboratory) – 9 pages.

– http//daac.ornl.gov/PI/bestprac.html

BCO-DMO Data Management Guidelines Manual: a collection of best practice recommendations for collecting and sharing biogeochemical and ecological oceanographic data and metadata – 15 pages.

– http://bcodmo.org/files/bcodmo/BCO-DMO_best_prac_v1d1.pdf

Page 21: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

• The 7 best practices for environmental data The 7 best practices for environmental data

(http@//daac.ornl.gov/PI/bestprac.html)(http@//daac.ornl.gov/PI/bestprac.html)

1.1. Assign descriptive (and unique) file namesAssign descriptive (and unique) file names

2.2. Use consistent and stable file formatsUse consistent and stable file formats

3.3. Define the content of your data filesDefine the content of your data files

4.4. Use consistent data organisationUse consistent data organisation

5.5. Perform basic quality assurancePerform basic quality assurance

6.6. Assign descriptive data set titlesAssign descriptive data set titles

7.7. Provide documentationProvide documentation

How to share and archive dataHow to share and archive data

Page 22: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

• 1. Assign descriptive file names1. Assign descriptive file names

Reflect content and uniquely identify the data fileReflect content and uniquely identify the data file

Use project acronym, cruise identifier, station identifier, Use project acronym, cruise identifier, station identifier, study title, investigator, data type, version number, date study title, investigator, data type, version number, date created, etc.created, etc.

Compliant with various data systems and platformsCompliant with various data systems and platforms

– only numbers, letter, dashes, underscores – no space or only numbers, letter, dashes, underscores – no space or special charactersspecial characters

– Use lower caseUse lower case

– Limit length to 64 characters maxLimit length to 64 characters max

– Use appropriate file extension to reflect file type.Use appropriate file extension to reflect file type.

Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive

Page 23: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

D999_ctd001_pro_20080901.cnv

D999_frrf001_pro_20080901.asc

D999_frrf001_raw.asc

D999_uway_nutrients_20080901.xls

D999_ctd_nutrients_20080901.xls

D999_blogg_hplc_v1_20080901.csv

D999_blogg_hplc_bodc_20090901.csv

D999_blogg_hplc_bodc_20081201.csv

Bodc.xls

Data for bodc.xls

Nutrients.csv

Hplc.csv

Page 24: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

• 2. Use consistent and stable file formats2. Use consistent and stable file formats

Tabular dataTabular data Spreadsheet Spreadsheet andand ASCII format ASCII format Use the same format throughout the file or filesUse the same format throughout the file or files For ASCII, use delimiters such as comma, tab or For ASCII, use delimiters such as comma, tab or

semicolonsemicolon Don’t include figures and summary statistics in the Don’t include figures and summary statistics in the

data file – keep these in a separate filedata file – keep these in a separate file Use a headerUse a header

– Avoid special characters and avoid using the chosen delimiter

– Identify your dataset with data file name, data set title, author, version name, date created, date modified, reason for modifications, associated document file name

– Use column headings for parameter names

– Use one row for parameter units

Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive

Page 25: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

• 3. Define the content of your data files3. Define the content of your data files

Parameter NameParameter Name

– Use commonly accepted parameter namesUse commonly accepted parameter names

– Always define abbreviated parameter names in data documentationAlways define abbreviated parameter names in data documentation

– Use consistent capitalisationUse consistent capitalisation

– No special character and use underscores to replace spacesNo special character and use underscores to replace spaces

UnitsUnits

– Explicit in data file and in documentationExplicit in data file and in documentation

– Use recommended units as much as possible (see for example Use recommended units as much as possible (see for example http://ijgofs.whoi.edu/D_I_M/core_parameters_Dec2003_final.pdf)http://ijgofs.whoi.edu/D_I_M/core_parameters_Dec2003_final.pdf)

– If expressing concentrations in units “per kg-1” (WOCE standard If expressing concentrations in units “per kg-1” (WOCE standard http://whpo.ucsd.edu/manuals/pdf/90_1/appendxg.pdfhttp://whpo.ucsd.edu/manuals/pdf/90_1/appendxg.pdf) include the ) include the information necessary to convert the value on a unit “per litre”.information necessary to convert the value on a unit “per litre”.

Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive

Page 26: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

Examples of data Examples of data spreadsheet submission spreadsheet submission to the BODCto the BODC

BODC has an history of not BODC has an history of not being prescriptive with file being prescriptive with file format – issues list of format – issues list of requirements and requirements and preferencespreferences

Advantage: scientists do not Advantage: scientists do not have to reformat their have to reformat their data specially for usdata specially for us

Disadvantage: more time-Disadvantage: more time-consuming to ingest; consuming to ingest; variation in data layout variation in data layout seems to be unlimited; seems to be unlimited; and scientists easily forget and scientists easily forget our recommendations.our recommendations.

Solution? Distribute clearer Solution? Distribute clearer guidelines (checklist?)? guidelines (checklist?)? Provide examples? Provide examples?

Spreadsheet example 1

Page 27: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

Good filenameGood filename

Good data set titleGood data set title

Good overall data organisationGood overall data organisation

Good explicit column headersGood explicit column headers

Spreadsheet example 1

Page 28: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

Definitions need to be explicit.Definitions need to be explicit.

Avoid free text for dates - use Avoid free text for dates - use standard machine readable standard machine readable date and time formats.date and time formats.

Do not mix characters and Do not mix characters and numbers in same field. Use numbers in same field. Use separate columns for flags.separate columns for flags.

Use consistent data format Use consistent data format AND indicate the convention AND indicate the convention used for absent values in the used for absent values in the header.header.

Avoid blank cells unless the Avoid blank cells unless the value is missing.value is missing.

Spreadsheet example 1

Page 29: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

Example of a data file formatted for easy ingestion in any data systemExample of a data file formatted for easy ingestion in any data system

Spreadsheet example 1

Page 30: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

Original data fileOriginal data file

Well organised data file but Well organised data file but extremely difficult to ingest extremely difficult to ingest in a database because split in a database because split in multiple worksheetsin multiple worksheets

Spreadsheet example 2

Page 31: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

Spreadsheet example 2

Reformatted data fileReformatted data file

Page 32: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Other issues to look forOther issues to look for

• Depth poorly defined: e.g. “SSCM”, “1%LD”, “bottom”, “0m”, “mixed Depth poorly defined: e.g. “SSCM”, “1%LD”, “bottom”, “0m”, “mixed layer”, “O2 minimum”layer”, “O2 minimum”

• Underway samples identified by their position only (no date and time)Underway samples identified by their position only (no date and time)

• CTD samples identified by a “site” name only although multiple CTD CTD samples identified by a “site” name only although multiple CTD casts were done at that site at different times of day (and night).casts were done at that site at different times of day (and night).

• Sampling identified by a date only: no station identifiers, no timeSampling identified by a date only: no station identifiers, no time

• Use of coloured cells for quality indicators: not machine readable and Use of coloured cells for quality indicators: not machine readable and not safe for long-term preservation (not preserved in ASCII format) not safe for long-term preservation (not preserved in ASCII format)

Page 33: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

• 4. Use consistent data organisation4. Use consistent data organisation

matrix layout matrix layout

Keep similar data togetherKeep similar data together

Avoid very large files Avoid very large files BUTBUT

Do not break up the data in many small data files or Do not break up the data in many small data files or multiple worksheets.multiple worksheets.

Only split files when necessary on ground of file size or Only split files when necessary on ground of file size or different metadata field requirements.different metadata field requirements.

Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive

Rows of observationsRows of observations

Columns of variablesColumns of variables

Page 34: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

• 5. Perform basic quality assurance5. Perform basic quality assurance

Check file format in ASCII versionCheck file format in ASCII version

Check file organisation and descriptorsCheck file organisation and descriptors

Ensure all essential metadata is includedEnsure all essential metadata is included

Check that all the values have transferred correctly to the Check that all the values have transferred correctly to the ASCII version.ASCII version.

In particular, check for missing reference indicator “#REF!”In particular, check for missing reference indicator “#REF!”

Convert formula to value before saving CSV file.Convert formula to value before saving CSV file.

Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive

Page 35: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

• 6. Assign descriptive data set titles6. Assign descriptive data set titles

Data set titles should contain information about the type of data Data set titles should contain information about the type of data and for example:and for example:

– the date range, the location, and/or the instruments used;the date range, the location, and/or the instruments used;

– If your data set is part of a larger field project, the project If your data set is part of a larger field project, the project name too.name too.

Keep title length to less than 80 characters (spaces included) as Keep title length to less than 80 characters (spaces included) as much as possiblemuch as possible

Names should contain only numbers, letters, dashes, underscores Names should contain only numbers, letters, dashes, underscores and spaces -- no special characters.and spaces -- no special characters.

The data set title should be consistent with the name(s) of the The data set title should be consistent with the name(s) of the data file(s) in the data archive whether the data set contain only data file(s) in the data archive whether the data set contain only one data file or many thousands data files.one data file or many thousands data files.

Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive

Page 36: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

Think Scientific Think Scientific PublicationPublication

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive

• 7. Provide data set documentation7. Provide data set documentation

Title = data set titleTitle = data set title

Author(s), affiliation and contact person Author(s), affiliation and contact person

Background: reason for collecting the data, funding source Background: reason for collecting the data, funding source and project/programme.and project/programme.

Material and methods: sampling strategy and Material and methods: sampling strategy and methodology, analytical procedures, quality control methodology, analytical procedures, quality control proceduresprocedures

Data set content descriptionData set content description

Short quality assessment to report problems that may limit Short quality assessment to report problems that may limit the use of the datathe use of the data

ReferencesReferences

Page 37: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Archiving the data filesArchiving the data files

• Section relevant to all data collectors and appointed Section relevant to all data collectors and appointed

data scientistsdata scientists

Page 38: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Archiving the data filesArchiving the data files

• Make your master files read-onlyMake your master files read-only

• Take copies on a range of mediaTake copies on a range of media

• Send a copy to your national data centre with instructions on Send a copy to your national data centre with instructions on

access policy if not already agreedaccess policy if not already agreed

• Create a DIF for your dataset in GCMDCreate a DIF for your dataset in GCMD

• Archive as soon as possible and as soon as you start sharing Archive as soon as possible and as soon as you start sharing

the data – use version numbers and documentation to clearly the data – use version numbers and documentation to clearly

identify the version of the data identify the version of the data

Page 39: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

The environmentally responsible data flowThe environmentally responsible data flow

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Underway Underway filesfiles

Phyto Phyto composition composition abundanceabundance

MacronutrMacronutrientsients

Primary Primary productionproductionCTD data CTD data

filesfiles

Etc.Etc.

Global specialised Global specialised databasesdatabases

Global datasets and data productsGlobal datasets and data products

National data centresNational data centres

AgreemenAgreement on t on Dataset Dataset Publication Publication and and CitationCitation

Working Working groups/surveys groups/surveys from GBIF, from GBIF, SCOR/IODE, SCOR/IODE, NERC/JISC, …NERC/JISC, …

Project usersProject users

Integrated, documented and quality controlled primary dataIntegrated, documented and quality controlled primary data

Page 40: IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.

IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008

Thank youThank you