NSF Data Management Plans https://www.flickr.com/photos/intersectionconsulting/7537238368/in/set-72157614274686504/ Kate Anderson Chris Elsik 3/22/17
NSF Data Management Plans
https://www.flickr.com/photos/intersectionconsulting/7537238368/in/set-72157614274686504/
Kate AndersonChris Elsik3/22/17
Why we’re here…
“The goal of data management is to produce self-describing data sets” (DataONE Primer)
Why we’re here…
“The goal of data management is to produce self-describing data sets” (DataONE Primer)
u Data are important!
Why we’re here…
“The goal of data management is to produce self-describing data sets” (DataONE Primer)
u Data are important!
u Benefits you & your collaborators
Why we’re here…
“The goal of data management is to produce self-describing data sets” (DataONE Primer)
u Data are important!
u Benefits you & your collaborators
u Benefits science & inquiry
Why we’re here…
“The goal of data management is to produce self-describing data sets” (DataONE Primer)
u Data are important!Note: today, we’re focusing on final data (rather than raw or intermediate data)
u Benefits you & your collaborators
u Benefits science & inquiry
u Funding Agencies and Journals Require Data ManagementPlans and Sharing of Data
Data Sharing Policy
u Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing.
Data Management Plans
u Proposals submitted or due on or after January 18, 2011, must include a supplementary document of no more than two pages labeled “Data Management Plan”.
Data Management Plans
u Subject to peer review
u Read the DMP Guidelines for your Directorate!
u Standard FAQ Answer regarding the sharing of data: “Data resulting from the award should be managed according to the data management plan that accompanied the proposal.”
Data Management Plans
u The types of data, samples, physical collections, software, curricular materials, and other materials to be produced in the course of the project;
u The standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies);
u Policies for access and sharing, including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements;
u Policies and provisions for re-use, re-distribution, and the production of derivatives; and
u Plans for archiving data, samples, and other research products, and for preservation of access to them.
The Circle of Life…
(DataONE Primer)
The Circle of Life…
(DataONE Primer)
DMP Best Practices
http://libraryguides.missouri.edu/datamanagement
DMP’s: A little vague
u All files will be stored on PI’s secure computer. All laboratory notebooks will be stored in PI’s office.
u All sample data will be collected and organized using [Specialty Software Name]. The files will contain information about sample characteristics and the conditions under which these characteristics were measured. Approximately 1-2 GB of data will be generated.
u Data will be available to anyone who desires access to our data. When possible, data will be made available online.
DMP’s: Better!
u NSF example (excerpt of data and metadata standards section): The project will leverage existing metadata standards currently stored in Ecological Metadata Language (EML) format for the NutNet project. We will add additional metadata entries for the arthropod community composition and arthropod stoichiometry; field notes taken during the time of collection will be recorded. Morpho software will be used to generate the metadata file in EML. We chose EML format for our metadata since it allows integration with existing NutNet data housed in the Knowledge Network for Biocomplexity (KNB) data repository.
https://www.dataone.org/sites/all/documents/DMP_NutNet_Formatted.pdf
The Circle of Life…
(DataONE Primer)
Metadata Basicsu Data about data
u Metadata lets others discover, understand, and use your data
u Metadata/annotation must be added throughout the lifecycle
Metadata to Consider: Who, What, Where, When, Why, How
u Name of the data set and data files
u Date of creation and last modification
u Software used to create file (including version)
u Data processing performed
u Who collected the data
u Contact information of responsible party
u Sponsor or funding agencies
u Why the data were collected (abstract; keywords; controlled vocabulary); when and where
u Instrumentation; experimental conditions; calibrations
u Units of measure
u Taxonomic details
u Known problems that limit data use
u How to cite the data set
Metadata to Consider: Who, What, Where, When, Why, How
u Name of the data set and data files
u Date of creation and last modification
u Software used to create file (including version)
u Data processing performed
u Who collected the data
u Contact information of responsible party
u Sponsor or funding agencies
u Why the data were collected (abstract; keywords; controlled vocabulary); when and where
u Instrumentation; experimental conditions; calibrations
u Units of measure
u Taxonomic details
u Known problems that limit data use
u How to cite the data set
The Circle of Life…
(DataONE Primer)
Data Repositories
Domain Repositories
u Data stored with similar items
u Researchers in your area are familiar with the repository
u Subject-specific / data-type specific needs addressed
u More computational tools available
MOspace
u Subject repository may not exist
u Preserves link to institution with guarantee of support from the university
u Domain repositories can shut down once the grant ends
Domain Repositories: So Many Choices!
Depositing Data to MOspace
u Simply email the MOspace team to get things going!
u Let them know:
u Author name(s)
u Project title and description
u Types of file(s) you want to submit
u Estimated file size
u Special software needed to read the file(s)
u If your data have been deposited in another repository (e.g. Dryad, DataONE, ICPSR)
u The MOspace team will contact you about best ways to submit your data.
So, what do I say about MOspace in my NSF DMP?
Remember that DMPs are subject to peer review, so the nature of the plan will be specific to your project.
"[X type of data] will be deposited in MOspace, the University of Missouri's digital institutional repository. MOspace is based on MIT's DSpace technology and is a joint venture of the University of Missouri's Division of Information Technology and the University Libraries. MOspace items will include appropriate metadata and a permanent URL. Items will be freely available via the MOspace web site at https://mospace.umsystem.edu and will be searchable via Google and other search engines."
Think about the Licensing…
More Resourcesu MU Libraries Guide on NSF Data Management Plans:
http://libraryguides.missouri.edu/datamanagement
u MU Libraries Guide on Data Sets: http://libraryguides.missouri.edu/datasets
u MU Libraries Guide on Open Access: http://libraryguides.missouri.edu/oajournals
u MU Libraries Guide on Public Access: http://libraryguides.missouri.edu/publicaccess
u DataONE: Primer on Data Management: What you always wanted to know* (*but were afraid to ask): https://www.dataone.org/best-practices
u MIT Libraries. Data Management and Publishing: http://libraries.mit.edu/guides/subjects/datamanagement/index.html
u UW-Madison Research Data Services: http://researchdata.wisc.edu/
u University of Arizona Libraries Data Management Resources: http://data.library.arizona.edu/
DMP development exercise for Missouri Transect trainees
u Missouri Transect students and postdocs are tasked with developing a Data Management Plan (DMP) for their research projects.
u The exercise will provide valuable experience to trainees.
u CI Team will provide advice and feedback.
u The individual DMPs will be used to update the Missouri Transect DMP.
Current Missouri Transect Data Management Plan
u The current Missouri Transect DMP is available here:
https://missouriepscor.org/cyberinfrastructure/data-management
u Was developed prior to proposal submission by someone without expertise in the subject domains (C. Elsik), and some subjects are not included.
u We plan to update the Missouri Transect DMP after receiving your individual DMPs.
u A more current and detailed Data Sharing Policy (separate from the DMP) is available here, but it does not include subject-specific information:
https://missouriepscor.org/cyberinfrastructure/data-policy
Excerpt from current Missouri Transect DMP
u Types of Data Produced
This project will produce many diverse datasets. The Climate team will work with current and archived climate data from weather stations throughout the state, including 5-minute, hourly and daily conditions for air temperature, relative humidity, wind direction and speed, soil temperature at 2-inch depth, solar radiation, and rainfall. Microclimate data will be collected by Doppler radar. Climate models will also use data from the North American Regional Climate Change Assessment Program, Missouri Mesonet, the PRISM grid (parameter-elevation regressions on independent slopes model), the National Elevation Dataset 30m Digital Elevation Models grid, the Pennsylvania State University Soil Information for Environmental Modeling Ecosystem Management database, and the National Land Cover Dataset 2001. The climate team will also collect soil redox potential, soil moisture, and pH using in situ probes.
An example from the current Missouri Transect DMP
u Data and metadata standards
Climate data will be stored in CF-compliant NetCDF format. Doppler radar data will be available as Nexrad level III and IRIS, which can be converted to Universal Format. Genomic data formats include Fastq, Fasta, GFF3, BAM/SAM, VCF. File formats include text/ASCII, standard imaging (e.g. jpg, pgm, ppm, tiff for 2D, ply, blend, mesh, pcd for 3D), imaging for GIS (GeoTiff), video (e.g. mp4, MPEG, avi), binary MatLab (mat). For some data types, metadata content is embedded in data files. For example, webcam image data will be stored in JPEG format, because it includes the ability to store meta-data as EXIF-tags within the jpg format itself, including time-stamps, GPS location, exposure, focal length, focus distance, and what color-correction algorithm has been applied. Similarly, GeoTiffis a public domain metadata standard that allows georeferencing information to be embedded within a TIFF file.
An example from the current Missouri Transect DMP
u Plans for archiving & preservation
The CI Team will work with investigators to identify appropriate repositories. Sequencing data will be submitted to the NCBI Sequence Read Archive. iPlant will serve as a repository for plant image and phenotyping data. Other repositories will be identified through resources such as the Open Access Directory (http://oad.simmons.edu/oadwiki/Data_repositories), the Ohio State University science repository (http://library.osu.edu/find/subjects/science-data/) and the Registry of Research Data Repositories (http://www.re3data.org). Metadata, code, small datasets and links to large datasets used in publications will be archived as supplements or in repositories such as Dryad (http://datadryad.org/) or MOspace(http://libraryguides.missouri.edu/MOspace), the UM System institutional repository.
Clarification: Missouri Transect Data Portal vs Archival Repository
u The Missouri Transect data portal provides a means to store and share data throughout the duration of the Missouri Transect Project.
https://data.missouriepscor.org
u According to the Missouri Transect Data Policy, all data must be submitted to this portal or another approved server for data sharing during the project period.
u However, the Missouri Transect Data Portal is not a long-term archival repository.
u Each individual DMP should list a repository where data will be submitted at the end of the Missouri Transect project.