Active Data Curation in Libraries: Issues and Challenges ASEE ELD Presentation June 27, 2011 William H. Mischo & Mary C. Schlembach
Dec 21, 2015
Active Data Curation in Libraries: Issues and Challenges
ASEE ELD PresentationJune 27, 2011
William H. Mischo & Mary C. Schlembach
Active Data Curation• Curation is the active use of data. It is a lifecycle
process.• Curation requires discipline specific knowledge
and experience.• Domain dependent curation rules and
preservation actions must be merged into the scientific workflow processes.
• Need to automate data ingest, descriptive metadata creation, preservation and digital object relationships.
Scientific Workflow
Fedora/Hydra Trusted Digital Repository (OAIS compliant)
Preservation Actions
Metadata Management
METS, PREMIS, MODS, DC, XSLT
The Grainger Library Active Data Curation The Grainger Library Active Data Curation Lifecycle ElementsLifecycle Elements
Curation Rule Engine
Operates on Metadata, Content Objects
AIPs, OAI-ORE
Curation Rule Engine:-- Domain dependent
-- Can be invoked explicitly-- But also automated based on
system trigger events
CI-3, CI-5 Responses
Access Mechanisms and E-Scholarship
Services, GRIPs
DIP Packages
SIP packagesAppraisal
and Selection
Migration and
Emulation Tools
Use, Reuse, Repurposing
Tools
Say What?• What is the role of the library? The engineering
librarian? The campus? The subject discipline? • Libraries are creating content asset preservation
systems. Trusted Digital Repositories. Fedora/Hydra/archivematica at UIUC Library.
• Role for the science/engineering library: connecting data to literature.
• Knowledge creation process and libraries.• GrIPs (Group Information Profiles).• NSF Data Management Plans.
What Data should be Curated?• Defining data curation: DataNet projects: Data
Conservancy (Hopkins), DataONE (New Mexico). • Purdue profiles.• Raw data and processed data.• We surveyed several groups in specific
disciplines. – Atmospheric Sciences (experimental)– Biophysics (simulation data).
Atmospheric Science: Experimental Data• Five levels and two data streams:
– Level 1: raw voltages from an instrument– Level 2: calibrated data derived from raw
voltages– Level 3: image products displaying the data– Level 4: derived parameters, statistics, etc.
from calibrated data– Level 5: analysis of Level 4 data that winds
up in papers, publications, etc.• Two other necessary data streams: ancillary
instrument information and metadata.
Biophysics: Simulation Data• Modeling of interactions of atomic level molecular data.• Three levels:
– Level 1: raw data from simulation run: positions and velocities of particles; software widely used.
– Level 2: various raw data extracts of subsets of particles run data.
– Level 3: visualization files (movie, images); analysis products generated from the visualization data for publication data.
• Also necessary are input parameters (starting coordinates, etc.) and other metadata.
Data Management Plan• The Data Management Plan (DMP) is a new NSF
mandatory supplementary document for all research proposals.– http://www.nsf.gov/bfa/dias/policy/dmp.jsp
• Each directorate, including the Engineering Directorate (ENG) is providing specific directions and required elements.
• The ENG document: http://nsf.gov/eng/general/ENG_DMP_Policy.pdf
Data Management Plan• The digital data to be archived includes
analyzed data – typically data that will go into articles and papers, and the metadata that defines the data that was generated.
• For Engineering Directorate grants, raw data from sensors or other instruments is not required to be archived.