IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic Data Centre IOC/IODE GEBICH Image courtesy of Alex Image courtesy of Alex Mustard Mustard
40
Embed
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008 Data Integration Made Easier Gwenaëlle Moncoiffé British Oceanographic.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Data Integration Made EasierData Integration Made Easier
Gwenaëlle Moncoiffé
British Oceanographic Data CentreIOC/IODE GEBICH
Image courtesy of Alex MustardImage courtesy of Alex Mustard
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
IntroductionIntroduction
• Data integration: not an option but a necessity.Data integration: not an option but a necessity.
Data integration at the heart of IMBERData integration at the heart of IMBER
Increasing demand for easily accessible data and for Increasing demand for easily accessible data and for compilation of global datasets for use in climatologies, compilation of global datasets for use in climatologies, gridded productsgridded products
E.g. CO2, DMS, data integration activities under SOLASE.g. CO2, DMS, data integration activities under SOLAS
No data integration possible without good data management practicesNo data integration possible without good data management practices
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
IntroductionIntroduction
• Strong institute or team-based ethicStrong institute or team-based ethic• Fewer data better managedFewer data better managed• Long-term staff, stable teamsLong-term staff, stable teams
• Left to the individualLeft to the individual• Demise of the long-term lab technicianDemise of the long-term lab technician• Diversification + short contract + high turnover Diversification + short contract + high turnover
> degradation of data management standards> degradation of data management standards• Rules disappeared or were no longer Rules disappeared or were no longer
appropriate/adaptedappropriate/adapted• Data management becomes an IT “thing”Data management becomes an IT “thing”
• Growing community ethicGrowing community ethic• Value in data beyond personal research useValue in data beyond personal research use• Need to re-instate strong basic principles for Need to re-instate strong basic principles for
efficient data managementefficient data management• Re-establish data management as a basic Re-establish data management as a basic
scientific skill & an essential requirementscientific skill & an essential requirement• Not a technicality but our responsibility as Not a technicality but our responsibility as
scientists.scientists.
Paper log eraPaper log era
Global networks eraGlobal networks era
Personal Computer eraPersonal Computer era
HIGHHIGH
LOWLOW
HIGHHIGH
Data Management standardsData Management standards
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
IntroductionIntroduction
““ResearchersResearchers need to be obliged to document and manage their data need to be obliged to document and manage their data with as much professionalism as they devote to their with as much professionalism as they devote to their
experiments.” experiments.”
““they should receive greater support in this endeavour than they are they should receive greater support in this endeavour than they are afforded at present.”afforded at present.”
““such standards require such standards require support from researcherssupport from researchers, who should adopt , who should adopt them and deploy them consistently.”them and deploy them consistently.”
““The lack of standards, for instance, confounds many a researcher The lack of standards, for instance, confounds many a researcher seeking to harness the diversity of knowledge now available on seeking to harness the diversity of knowledge now available on
““All credit, then, to those in the vanguard of interoperability.”All credit, then, to those in the vanguard of interoperability.”
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
IntroductionIntroduction
• This presentation aims to provide tips on how to make the This presentation aims to provide tips on how to make the whole process work more efficiently for individual research whole process work more efficiently for individual research scientists and anybody keen to follow good data management scientists and anybody keen to follow good data management practices.practices.
• Practical and non-technicalPractical and non-technical
• Mainly cruise-based examples but could be adapted to any Mainly cruise-based examples but could be adapted to any form of data collection activitiesform of data collection activities
• Work in progress which feeds from or will feed into the IMBER Work in progress which feeds from or will feed into the IMBER CookbookCookbook
• Examples and list of resources will be biased towards areas I Examples and list of resources will be biased towards areas I am more familiar with: UK/Europe, cruise based data am more familiar with: UK/Europe, cruise based data collections;collections;
• Your input to help fill the gaps and fine tune some sections is Your input to help fill the gaps and fine tune some sections is important.important.
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Rule 1: start earlyRule 1: start early
• Section relevant to all project PIsSection relevant to all project PIs
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Rule 1: start earlyRule 1: start early
• Include data management in your budget! 5-10% of your grant is a Include data management in your budget! 5-10% of your grant is a good approximation.good approximation.
• As a project PI / cruise PSO: decide what strategy you will adoptAs a project PI / cruise PSO: decide what strategy you will adopt
Do it yourself (not recommended even for small projects)- time-consuming Do it yourself (not recommended even for small projects)- time-consuming and too tempting to prioritise the meaty “publishable” bitsand too tempting to prioritise the meaty “publishable” bits
Hire somebody to act as a data manager/integrator (full-time or part–time)Hire somebody to act as a data manager/integrator (full-time or part–time)
Delegate task to a [young] scientist embedded in your team or approach Delegate task to a [young] scientist embedded in your team or approach your national data centre or institute data management teamyour national data centre or institute data management team
• A data scientist’s tasks start before the first cruise starts and end after A data scientist’s tasks start before the first cruise starts and end after the last samples have been analysed and the last dataset handed the last samples have been analysed and the last dataset handed over. over.
• If multiple projects share the same cruise(s) then money can be If multiple projects share the same cruise(s) then money can be pooled to support 1 data management berth and pre- and post-cruise pooled to support 1 data management berth and pre- and post-cruise work.work.
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Rule 2: follow the guidelinesRule 2: follow the guidelines
• Section relevant to project PIs, data collectors and data Section relevant to project PIs, data collectors and data
scientistsscientists
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Rule 2: follow the guidelinesRule 2: follow the guidelines
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Rule 2: follow the guidelinesRule 2: follow the guidelines
• Will facilitate collaboration between IMBER teams, Will facilitate collaboration between IMBER teams,
projects, countries or activities (“interoperability”)projects, countries or activities (“interoperability”)
• Will enable IMBER scientists to make use of Will enable IMBER scientists to make use of
existing and developing analytical toolsexisting and developing analytical tools
• Will make it easier to provide data to modellers Will make it easier to provide data to modellers
and global dataset buildersand global dataset builders
• Will give your data more value by making them Will give your data more value by making them
more re-usablemore re-usable
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Help is at hand: data centresHelp is at hand: data centres• NODCs NODCs
• Established expert data management centres/teamsEstablished expert data management centres/teams
BCO-DMOBCO-DMO
CYBER from CNRS at the laboratoire de Villefranche CYBER from CNRS at the laboratoire de Villefranche
……
• Specialist and world data centresSpecialist and world data centres
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Help is at hand: advising groups, technical Help is at hand: advising groups, technical support, standard setterssupport, standard setters
• MMI (Marine Metadata Interoperability) website and very accessible MMI (Marine Metadata Interoperability) website and very accessible guides on metadata, ontologies, vocabulariesguides on metadata, ontologies, vocabularies
• IODE: Ocean Teacher, Ocean Data Standards, Expert groupsIODE: Ocean Teacher, Ocean Data Standards, Expert groups
• ICES: Guidelines for data collection and exchangeICES: Guidelines for data collection and exchange
• CLIVAR and Carbon Hydrographic Data Office (CCHDO): Manuals for CLIVAR and Carbon Hydrographic Data Office (CCHDO): Manuals for data collection and exchangedata collection and exchange
data, chemical and biological stocks and ratesdata, chemical and biological stocks and rates
• From surface to deep ocean sedimentFrom surface to deep ocean sediment
• Research cruises, in situ enrichment experiments, modelling Research cruises, in situ enrichment experiments, modelling
activities, lab and mesocosm experiments, remote sensingactivities, lab and mesocosm experiments, remote sensing
Need a central searchable inventoryNeed a central searchable inventory
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Building the IMBER data inventoryBuilding the IMBER data inventory
• At the Project levelAt the Project level
• At the Cruise levelAt the Cruise level
• At Individual levelAt Individual level
Minimum requirementMinimum requirement
Project information in GCMDProject information in GCMD
Cruises in CSR databases and eventually DIFsCruises in CSR databases and eventually DIFs
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Building the IMBER data inventoryBuilding the IMBER data inventory
• GCMD portal but learn from past experience (e.g. GLOBEC)GCMD portal but learn from past experience (e.g. GLOBEC)
• Ensure that at least one level of information (the “project” Ensure that at least one level of information (the “project”
level) is captured consistently and linked to IMBER as a projectlevel) is captured consistently and linked to IMBER as a project
• Follow DIF guidelines for IMBER projects (see cookbook)Follow DIF guidelines for IMBER projects (see cookbook)
• Cruise information will be added once a converter CSR to DIF Cruise information will be added once a converter CSR to DIF
becomes availablebecomes available
• It is good practice to create a DIF for your individual dataset It is good practice to create a DIF for your individual dataset
record; in doing this ensure it is linked to IMBER (there will be record; in doing this ensure it is linked to IMBER (there will be
info in the Cookbook about this)info in the Cookbook about this)
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Building the IMBER data inventoryBuilding the IMBER data inventory
• Why CSRs?Why CSRs?
Specifically designed for oceanographic cruise reportingSpecifically designed for oceanographic cruise reporting
The only existing international standard (initiated by the The only existing international standard (initiated by the IOC in the 60s)IOC in the 60s)
Two international repositories: DOD and ICESTwo international repositories: DOD and ICES
Adopted by European SeaDataNet project and by POGOAdopted by European SeaDataNet project and by POGO
Online tools are becoming availableOnline tools are becoming available
Long history with an important legacy population (38,000 Long history with an important legacy population (38,000 cruises in ICES database)cruises in ICES database)
Building the IMBER inventory
DIFsDIFs
CSRsCSRs
CruisesCruises
Filled in by PSOs or Filled in by PSOs or appointed appointed
project’s data project’s data specialistspecialist
Reviewed by IMBER DLOReviewed by IMBER DLO
Converter to be developedConverter to be developed
NODCNODC
Data discoveryData discovery
Filled in by projects’ data Filled in by projects’ data scientistsscientists
Examples of non cruise-based Examples of non cruise-based activitiesactivities::
- Lab and mesocosm experiments- Lab and mesocosm experiments- Remote-sensing- Remote-sensing- Socio-economic data- Socio-economic data- Model-derived data- Model-derived data
IMBER Metadata IMBER Metadata Portal in Portal in GCMDGCMD
Projects and non-cruise activitiesProjects and non-cruise activities
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Data sharing and archivingData sharing and archiving
• Section relevant to all data collectors and appointed Section relevant to all data collectors and appointed
data scientistsdata scientists
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Preparing data for the archivePreparing data for the archive
• What to archive?What to archive?
Three levelsThree levels ExamplesExamples Archive?Archive? Where?Where?
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Preparing data for sharing and for the archivePreparing data for sharing and for the archive
• How to share and archive data?How to share and archive data?
• ReferencesReferences
Best Practices for Preparing Environmental Data Sets to Share and Archive (Cook et al 2001, updated by Hook et al 2007, Oak Ridge National Laboratory) – 9 pages.
– http//daac.ornl.gov/PI/bestprac.html
BCO-DMO Data Management Guidelines Manual: a collection of best practice recommendations for collecting and sharing biogeochemical and ecological oceanographic data and metadata – 15 pages.
Reflect content and uniquely identify the data fileReflect content and uniquely identify the data file
Use project acronym, cruise identifier, station identifier, Use project acronym, cruise identifier, station identifier, study title, investigator, data type, version number, date study title, investigator, data type, version number, date created, etc.created, etc.
Compliant with various data systems and platformsCompliant with various data systems and platforms
– only numbers, letter, dashes, underscores – no space or only numbers, letter, dashes, underscores – no space or special charactersspecial characters
– Use lower caseUse lower case
– Limit length to 64 characters maxLimit length to 64 characters max
– Use appropriate file extension to reflect file type.Use appropriate file extension to reflect file type.
Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive
D999_ctd001_pro_20080901.cnv
D999_frrf001_pro_20080901.asc
D999_frrf001_raw.asc
D999_uway_nutrients_20080901.xls
D999_ctd_nutrients_20080901.xls
D999_blogg_hplc_v1_20080901.csv
D999_blogg_hplc_bodc_20090901.csv
D999_blogg_hplc_bodc_20081201.csv
Bodc.xls
Data for bodc.xls
Nutrients.csv
Hplc.csv
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
• 2. Use consistent and stable file formats2. Use consistent and stable file formats
Tabular dataTabular data Spreadsheet Spreadsheet andand ASCII format ASCII format Use the same format throughout the file or filesUse the same format throughout the file or files For ASCII, use delimiters such as comma, tab or For ASCII, use delimiters such as comma, tab or
semicolonsemicolon Don’t include figures and summary statistics in the Don’t include figures and summary statistics in the
data file – keep these in a separate filedata file – keep these in a separate file Use a headerUse a header
– Avoid special characters and avoid using the chosen delimiter
– Identify your dataset with data file name, data set title, author, version name, date created, date modified, reason for modifications, associated document file name
– Use column headings for parameter names
– Use one row for parameter units
Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
• 3. Define the content of your data files3. Define the content of your data files
Parameter NameParameter Name
– Use commonly accepted parameter namesUse commonly accepted parameter names
– Always define abbreviated parameter names in data documentationAlways define abbreviated parameter names in data documentation
– Use consistent capitalisationUse consistent capitalisation
– No special character and use underscores to replace spacesNo special character and use underscores to replace spaces
UnitsUnits
– Explicit in data file and in documentationExplicit in data file and in documentation
– Use recommended units as much as possible (see for example Use recommended units as much as possible (see for example http://ijgofs.whoi.edu/D_I_M/core_parameters_Dec2003_final.pdf)http://ijgofs.whoi.edu/D_I_M/core_parameters_Dec2003_final.pdf)
– If expressing concentrations in units “per kg-1” (WOCE standard If expressing concentrations in units “per kg-1” (WOCE standard http://whpo.ucsd.edu/manuals/pdf/90_1/appendxg.pdfhttp://whpo.ucsd.edu/manuals/pdf/90_1/appendxg.pdf) include the ) include the information necessary to convert the value on a unit “per litre”.information necessary to convert the value on a unit “per litre”.
Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive
Examples of data Examples of data spreadsheet submission spreadsheet submission to the BODCto the BODC
BODC has an history of not BODC has an history of not being prescriptive with file being prescriptive with file format – issues list of format – issues list of requirements and requirements and preferencespreferences
Advantage: scientists do not Advantage: scientists do not have to reformat their have to reformat their data specially for usdata specially for us
Disadvantage: more time-Disadvantage: more time-consuming to ingest; consuming to ingest; variation in data layout variation in data layout seems to be unlimited; seems to be unlimited; and scientists easily forget and scientists easily forget our recommendations.our recommendations.
Solution? Distribute clearer Solution? Distribute clearer guidelines (checklist?)? guidelines (checklist?)? Provide examples? Provide examples?
Spreadsheet example 1
Good filenameGood filename
Good data set titleGood data set title
Good overall data organisationGood overall data organisation
Good explicit column headersGood explicit column headers
Spreadsheet example 1
Definitions need to be explicit.Definitions need to be explicit.
Avoid free text for dates - use Avoid free text for dates - use standard machine readable standard machine readable date and time formats.date and time formats.
Do not mix characters and Do not mix characters and numbers in same field. Use numbers in same field. Use separate columns for flags.separate columns for flags.
Use consistent data format Use consistent data format AND indicate the convention AND indicate the convention used for absent values in the used for absent values in the header.header.
Avoid blank cells unless the Avoid blank cells unless the value is missing.value is missing.
Spreadsheet example 1
Example of a data file formatted for easy ingestion in any data systemExample of a data file formatted for easy ingestion in any data system
Spreadsheet example 1
Original data fileOriginal data file
Well organised data file but Well organised data file but extremely difficult to ingest extremely difficult to ingest in a database because split in a database because split in multiple worksheetsin multiple worksheets
Spreadsheet example 2
Spreadsheet example 2
Reformatted data fileReformatted data file
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
• Underway samples identified by their position only (no date and time)Underway samples identified by their position only (no date and time)
• CTD samples identified by a “site” name only although multiple CTD CTD samples identified by a “site” name only although multiple CTD casts were done at that site at different times of day (and night).casts were done at that site at different times of day (and night).
• Sampling identified by a date only: no station identifiers, no timeSampling identified by a date only: no station identifiers, no time
• Use of coloured cells for quality indicators: not machine readable and Use of coloured cells for quality indicators: not machine readable and not safe for long-term preservation (not preserved in ASCII format) not safe for long-term preservation (not preserved in ASCII format)
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
• 4. Use consistent data organisation4. Use consistent data organisation
matrix layout matrix layout
Keep similar data togetherKeep similar data together
Avoid very large files Avoid very large files BUTBUT
Do not break up the data in many small data files or Do not break up the data in many small data files or multiple worksheets.multiple worksheets.
Only split files when necessary on ground of file size or Only split files when necessary on ground of file size or different metadata field requirements.different metadata field requirements.
Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive
Rows of observationsRows of observations
Columns of variablesColumns of variables
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Check file format in ASCII versionCheck file format in ASCII version
Check file organisation and descriptorsCheck file organisation and descriptors
Ensure all essential metadata is includedEnsure all essential metadata is included
Check that all the values have transferred correctly to the Check that all the values have transferred correctly to the ASCII version.ASCII version.
In particular, check for missing reference indicator “#REF!”In particular, check for missing reference indicator “#REF!”
Convert formula to value before saving CSV file.Convert formula to value before saving CSV file.
Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
• 6. Assign descriptive data set titles6. Assign descriptive data set titles
Data set titles should contain information about the type of data Data set titles should contain information about the type of data and for example:and for example:
– the date range, the location, and/or the instruments used;the date range, the location, and/or the instruments used;
– If your data set is part of a larger field project, the project If your data set is part of a larger field project, the project name too.name too.
Keep title length to less than 80 characters (spaces included) as Keep title length to less than 80 characters (spaces included) as much as possiblemuch as possible
Names should contain only numbers, letters, dashes, underscores Names should contain only numbers, letters, dashes, underscores and spaces -- no special characters.and spaces -- no special characters.
The data set title should be consistent with the name(s) of the The data set title should be consistent with the name(s) of the data file(s) in the data archive whether the data set contain only data file(s) in the data archive whether the data set contain only one data file or many thousands data files.one data file or many thousands data files.
Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Preparing Data for sharing and for the ArchivePreparing Data for sharing and for the Archive
• 7. Provide data set documentation7. Provide data set documentation
Title = data set titleTitle = data set title
Author(s), affiliation and contact person Author(s), affiliation and contact person
Background: reason for collecting the data, funding source Background: reason for collecting the data, funding source and project/programme.and project/programme.
Material and methods: sampling strategy and Material and methods: sampling strategy and methodology, analytical procedures, quality control methodology, analytical procedures, quality control proceduresprocedures
Data set content descriptionData set content description
Short quality assessment to report problems that may limit Short quality assessment to report problems that may limit the use of the datathe use of the data
ReferencesReferences
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Archiving the data filesArchiving the data files
• Section relevant to all data collectors and appointed Section relevant to all data collectors and appointed
data scientistsdata scientists
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008
Archiving the data filesArchiving the data files
• Make your master files read-onlyMake your master files read-only
• Take copies on a range of mediaTake copies on a range of media
• Send a copy to your national data centre with instructions on Send a copy to your national data centre with instructions on
access policy if not already agreedaccess policy if not already agreed
• Create a DIF for your dataset in GCMDCreate a DIF for your dataset in GCMD
• Archive as soon as possible and as soon as you start sharing Archive as soon as possible and as soon as you start sharing
the data – use version numbers and documentation to clearly the data – use version numbers and documentation to clearly
identify the version of the data identify the version of the data
The environmentally responsible data flowThe environmentally responsible data flow
IMBER workshop BEER: Being Efficient and Environmentally Responsible, Miami, 9 November 2008