This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1. Best Practices Creating and Managing Research Data Presented
by Sherry Lake [email protected]
http://dmconsult.library.virginia.edu/ Data Life Cycle Re-Purpose
Re-Use Deposit Data Collection Data Analysis Data Sharing Proposal
Planning Writing Data Discovery End of Project Data Archive Project
Start Up
2. Why Manage Your Data?
3. Best Practices for Creating Data 1. Use Consistent Data
Organization 2. Use Standardized Naming, codes and formats 3.
Assign Descriptive File Names 4. Perform Basic Quality Assurance /
Quality Control 5. Preserve Information - Use Scripted Languages 6.
Define Contents of Data Files; Create Documentation 7. Use
Consistent, Stable and Open File Formats
4. Spreadsheet Examples
5. Spreadsheets
6. Consistent Data Organization Spreadsheets (such as those
found in Excel) are sometimes a necessary evil They allow shortcuts
which will result in your data not being machine-readable But there
are some simple steps you can take to ensure that you are creating
spreadsheets that are machine-readable and will withstand the test
of time
7. Spreadsheets
8. Spreadsheet Problems?
9. Problems Dates are not stored consistently Values are
labeled inconsistently Data coding is inconsistent Order of values
are different
10. Problems Confusion between numbers and text Different types
of data are stored in the same columns The spreadsheet loses
interpretability if it is sorted
11. How would you correct this file?
12. Spreadsheet Best Practices Include a Header Line 1st line
(or record) Label each Column with a short but descriptive name
Names should be unique Use letters, numbers, or _ (underscore) Do
not include blank spaces or symbols (+ - & ^ *)
13. Columns of data should be consistent Use the same naming
convention for text data Each line should be complete Each line
should have a unique identifier Spreadsheet Best Practices
14. Spreadsheet Best Practices Columns should include only a
single kind of data Text or string data Integer numbers Floating
point or real numbers
15. Use Naming Standards & Codes Use commonly accepted
label names that describe the contents (e.g., precip for
precipitation) Use consistent capitalization (e.g., not: temp,
Temp, and TEMP in same file) Standard codes State Postal (VA, MA)
FIPS Codes for Counties and County Equivalent Entities
(http://www.census.gov/geo/reference/codes/cou.html)
16. Use Standardized Formats Use standardized formats for units
International System of Units (SI)
http://physics.nist.gov/Pubs/SP330/sp330.pdf ISO 8601 Standard for
Date and Time YYYYMMDDThh:mmss.sTZD 20091013T09:1234.9Z
20091013T09:1234.9+05:00 Spatial Coordinates for Latitute/Longitude
+/- DD.DDDDD -78.476 (longitude) +38.029 (latitude)
17. File Names
18. File Names Use descriptive names Not too long; CamelCase
Try to include time Date using YYYYMMDD Use version numbers Dont
use spaces May use - or _ Dont change default extensions
19. Organize Files Logically Make sure your file system is
logical and efficient Biodiversity Lake Grassland Experiments Field
Work Biodiv_H20_heatExp_2005_2008.csv
Biodiv_H20_predatorExp_2001_2003.csv
Biodiv_H20_planktonCount_start2001_active.csv
Biodiv_H20_chla_profiles_2003.csv Project Name Location Experiment
Name Date File Format
20. Check for missing, impossible, anomalous values Plotting
Mapping Examine summary statistics Verify data transfers from
notebooks to digital files Verify data conversion from one file
format to another Data Validation Hook, et al. 2010. Best Practices
for Preparing Environmental Data Sets to Share and Archive.
Available online:
http://daac.ornl.gov/PI/BestPractices-2010.pdf.
21. Data Manipulation You will need to repeat reduction and
analysis procedures many times You need to have a workflow that
recognizes this Scripted languages can help capture the workflow
You could just document all steps by hand After the 20th iteration
through your data set; however, you may feel more fondly towards
scripted languages Learn the analytical tools of your field Talk to
colleagues, etc. and choose at least one tool to master
22. Preserve Information Keep Original (Raw) File Do not
include transformations, interpolations, etc. Consider making the
raw data read-only Save as a new file Processing Script (R)
23. Preserving: Scripted Notes Use a scripted language to
process data R Statistical package (free, powerful) SAS MATLAB
Processing scripts records processing Steps are recorded in textual
format Can be easily revised and re-executed Easy to document
GUI-based analysis may be easier, but harder to reproduce
24. Data Documentation (Metadata) Informal or formal methods to
describe your data Important if you want to reuse your own data in
the future Also necessary when sharing your data
25. Define Contents of Data Files Create a Project Document
File (Lab Notebook) Details such as: Names of data & analysis
files associated with study Definitions for data and codes (include
missing value codes, names) Units of measure (accuracy and
precision) Standards or instrument calibrations
26. Data Dictionary Example
27. Data Dictionary Example
28. Data Documentation Project Documentation Dataset
Documentation Context of data collection Data collection methods
Structure, organization of data files Data sources used Data
validation, quality assurance Transformations of data from the raw
data through analysis Information on confidentiality, access and
use conditions Variable names and descriptions Explanation of codes
and schemas used Algorithms used to transform data File format and
software (including version) used
29. File Format Sustainability Types Examples Text ASCII, Word,
PDF Numerical ASCII, SPSS, STATA, Excel, Access, MySQL Multimedia
Jpeg, tiff, mpeg, quicktime Models 3D, statistical Software Java,
C, Fortran Domain-specific FITS in astronomy, CIF in chemistry
Instrument-specific Olympus Confocal Microscope Data Format
30. Choosing File Formats Accessible Data (in the future)
Non-proprietary (software formats) Open, documented standard
Common, used by the research community Standard representation
(ASCII, Unicode) Unencrypted & Uncompressed
31. 1. Use Consistent Data Organization 2. Use Standardized
Naming, Codes and Formats 3. Assign Descriptive File Names 4.
Perform Basic Quality Assurance / Quality Control 5. Preserve
Information - Use Scripted Languages 6. Define Contents of Data
Files; Create Documentation 7. Use Consistent, Stable and Open File
Formats Best Practices for Creating Data
32. Will improve the usability of the data by you or by others
Your data will be computer ready Save you time Following these Best
Practices.
33. Research Life Cycle Data Life Cycle Re- Purpose Re- Use
Deposit Data Collection Data Analysis Data Sharing Proposal
Planning Writing Data Discovery End of Project Data Archive Project
Start Up
34. Managing Data in the Data Life Cycle Choosing file formats
File naming conventions Document all data details Access control
& security Backup & storage
35. Data Security & Access Control Network security keep
confidential or sensitive data off internet servers or computers on
connected to the internet Physical security Access to buildings and
rooms Computer Systems & Files Use passwords on files/system
Virus protection
36. Backup Your Data Reduce the risk of damage or loss Use
multiple locations (here, near, far) Create a backup schedule Use
reliable backup medium Test your backup system (i.e., test file
recovery)
37. Storage & Backup
38. Sustainable Storage Lifespan of Storage Media:
http://www.crashplan.com/medialifespan/
39. Best Practices Bibliography Borer, E. T., Seabloom, E. W.,
Jones, M. B., & Schildhauer, M. (2009). Some simple guidelines
for effective data management. Bulletin of the Ecological Society
of America, 90(2), 205-214.
http://dx.doi.org/10.1890/0012-9623-90.2.205 Graham, A., McNeill,
K., Stout, A., & Sweeney, L. (2010). Data Management and
Publishing. Retrieved 05/31/2012, from
http://libraries.mit.edu/guides/subjects/data-management/. Hook, L.
A., Santhana Vannan, S.K., Beaty, T. W., Cook, R. B. and Wilson,
B.E. (2010). Best Practices for Preparing Environmental Data Sets
to Share and Archive. Available online
(http://daac.ornl.gov/PI/BestPractices-2010.pdf) from Oak Ridge
National Laboratory Distributed Active Archive Center, Oak Ridge,
Tennessee, U.S.A.
http://dx.doi.org/10.3334/ORNLDAAC/BestPractices-2010.
40. Best Practices Bibliography (Cont.) Inter-university
Consortium for Political and Social Research (ICPSR). (2012). Guide
to social science data preparation and archiving: Best practices
throughout the data cycle (5th ed.). Ann Arbor, MI. Retrieved
05/31/2012, from
http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf. Van den
Eynden, V., Corti, L., Woollard, M. & Bishop, L. (2011).
Managing and Sharing Data: A Best Practice Guide for Researchers
(3rd ed.). Retrieved 05/31/2012, from http://www.data-
archive.ac.uk/media/2894/managingsharing.pdf.