Introduction to Data Management Introduction to Data Management
Jan 01, 2016
Introduction to Data ManagementIntroduction to Data Management
22
Data Management
•Overview of research dataOverview of research data– Joel Roselin, Office of Research Compliance and Joel Roselin, Office of Research Compliance and
TrainingTraining
•Data Storage and RetentionData Storage and Retention– Danianne Mizzy, Engineering LibrarianDanianne Mizzy, Engineering Librarian
•Data SharingData Sharing– Kathryn Pope, Center for Digital Research and Kathryn Pope, Center for Digital Research and
ScholarshipScholarship
33
Goals of research
• The primary goals of research are:The primary goals of research are:– To advance knowledgeTo advance knowledge
– To improve life for people (or animals)To improve life for people (or animals)
•Secondary goals of research:Secondary goals of research:– Career advancementCareer advancement
– Professional recognitionProfessional recognition
– Financial gainFinancial gain
44
When you conduct research…
•……You are entrusted with:You are entrusted with:– Human subjectsHuman subjects
– AnimalsAnimals
– Access to specialized materials and technologyAccess to specialized materials and technology• ChemicalsChemicals
• DrugsDrugs
• MachineryMachinery
• Information (personal or confidential)Information (personal or confidential)
– Funding from government or industryFunding from government or industry
55
When you conduct research…
•Not everyone is granted the privilege to Not everyone is granted the privilege to conduct research:conduct research:– Qualifications include: Qualifications include: • Advanced degree (or enrolled in a degree program)Advanced degree (or enrolled in a degree program)
• Position in a research institutionPosition in a research institution
– Promise to:Promise to:• Be responsible in the conduct of the researchBe responsible in the conduct of the research
• Be responsible stewards of the research dollars and other Be responsible stewards of the research dollars and other resourcesresources
• Share the results of the research for the good of societyShare the results of the research for the good of society
66
When you conduct research…
• The privilege can be revoked for failing to fulfill The privilege can be revoked for failing to fulfill professional responsibilities:professional responsibilities:– Not get fundingNot get funding
– DebarmentDebarment
– Lose of positionLose of position
77
What are data?
•What counts as data in your field?What counts as data in your field?
88
What are data?
•What counts as data in your field?What counts as data in your field?– Subject data (humans or animals)Subject data (humans or animals)• Blood cell countsBlood cell counts
• ObservationalObservational
• Survey responsesSurvey responses
– Lab dataLab data• Test resultsTest results
• AssaysAssays
– Other dataOther data• Library informationLibrary information
• PhotographsPhotographs
99
What are data?
True or FalseTrue or False
In scientific research, only the information and In scientific research, only the information and observations that are made as part of scientific observations that are made as part of scientific inquiry are considered data.inquiry are considered data.
1010
It’s ALL data
• FalseFalse!!
•Data are not only the information and Data are not only the information and observations made as part of scientific inquiry observations made as part of scientific inquiry but also the materials, the means, and the but also the materials, the means, and the products of that inquiry (sometimes called products of that inquiry (sometimes called data data sourcessources).).
• ExamplesExamples::• Cell linesCell lines• Survey instrumentsSurvey instruments• Associated softwareAssociated software• SpecimensSpecimens
1111
Everything is Data
Everything is data and Everything is data and
data is everything!data is everything!
1212
Sensitive Data
• Some data are highly sensitiveSome data are highly sensitive– Private Health Information (PHI), including insurance informationPrivate Health Information (PHI), including insurance information
– Personal information such as Social Security numbers, financial dataPersonal information such as Social Security numbers, financial data
• Inappropriate release of sensitive information can lead to Inappropriate release of sensitive information can lead to harms:harms:– Privacy violationsPrivacy violations
– Identity theftIdentity theft
– Financial liability for the UniversityFinancial liability for the University
• Sensitive information is highly regulated and requires security, Sensitive information is highly regulated and requires security, e.g. encryptione.g. encryption
• University resources:University resources:– HIPAA website HIPAA website
– IRB WebsiteIRB Website
– Policy on Electronic Data Security Breach Reporting and ResponsePolicy on Electronic Data Security Breach Reporting and Response
1313
1414
Takeways
• Everything is data and data is everything!Everything is data and data is everything!
• The PI is has The PI is has stewardshipstewardship (control) of a (control) of a project's data, with regard to publication and project's data, with regard to publication and copyright. copyright.
Data Management & RetentionData Management & Retention
Danianne MizzyDanianne MizzyEngineering LibrarianEngineering Librarian
1515
Data Management & Retention
• Funder requirements Funder requirements – Minimum or maximum?Minimum or maximum?
– Just because not required doesn’t mean you don’t need Just because not required doesn’t mean you don’t need to consider and address long term accessto consider and address long term access
•Columbia Data Retention PolicyColumbia Data Retention Policy– Research data must be archived for a minimum of three Research data must be archived for a minimum of three
years after the final project close-out, with original data years after the final project close-out, with original data retained wherever possible.retained wherever possible.
1616
Relevant Policies
•CU Policies & ProceduresCU Policies & Procedures– Administrative Code of ConductAdministrative Code of Conduct
– Statement of Ethical ConductStatement of Ethical Conduct
– Faculty HandbookFaculty Handbook
– Sponsored Projects Handbook Sponsored Projects Handbook
– Clinical Research Handbook Clinical Research Handbook
– Electronic Information ResourcesElectronic Information Resources Security Security
• Funder RequirementsFunder Requirements
1717
Agency Retention Periods
•HIPAA – At least 6 yearsHIPAA – At least 6 years
•NIH – 3 yearsNIH – 3 years
•NSF - What constitute reasonable procedures NSF - What constitute reasonable procedures will be determined by the community of will be determined by the community of interest through the process of peer review interest through the process of peer review and program management.and program management.
1818
Data Storage Planning
•Need to plan for entire life-cycleNeed to plan for entire life-cycle
• Establish a baseline and project the rate of Establish a baseline and project the rate of growth for the duration of the project.growth for the duration of the project.
•ActiveActive– Frequent additions & updatesFrequent additions & updates
•ArchivalArchival– In fixed form - only need periodic accessIn fixed form - only need periodic access
1919
Data Storage Considerations
•SizeSize
•Retention periodRetention period
•Privacy or security requirements? Privacy or security requirements?
•Sharing? Sharing?
2020
Data Storage Options at CU
Active (Working) StorageActive (Working) Storage
CUIT CUIT – 500 MB personal critical data500 MB personal critical data
– Workgroup Space on Central –Workgroup Space on Central –• $400 per gigabyte per year with a minimum $400 per gigabyte per year with a minimum
of a half gigabyte (500 MB)of a half gigabyte (500 MB)
– Research Computing ServicesResearch Computing Services• High Performance ClusterHigh Performance Cluster
• For more information contact For more information contact [email protected]
School & Departmental serversSchool & Departmental servers
2121
Data Storage Options at CU
Active StorageActive Storage
Library Library
Center for Digital Research & Center for Digital Research & Scholarship (CDRS)Scholarship (CDRS)
– – Consultation availableConsultation available
2222
Data Storage Options at CU
Archival StorageArchival Storage
•Library – Academic Commons Library – Academic Commons
2323
Data Management Planning
• What file formats? Are they long-lived?What file formats? Are they long-lived?– Long-livedLong-lived– Non-proprietaryNon-proprietary
• Storage and backup strategy?Storage and backup strategy?– Media – CDs and DVDs not long-livedMedia – CDs and DVDs not long-lived
• What project and data identifiers will be assigned?What project and data identifiers will be assigned?
• Naming conventions, file/directory structureNaming conventions, file/directory structure
• Version ControlVersion Control
• Is there a metadata scheme or other community Is there a metadata scheme or other community standard for data sharing/integration?standard for data sharing/integration?
2424
CU Security Policy
• Individuals who access or control University Individuals who access or control University electronic information resources must take electronic information resources must take appropriate and necessary measures to ensure appropriate and necessary measures to ensure the security, integrity, and protection of these the security, integrity, and protection of these resources, using appropriate physical and resources, using appropriate physical and logical security measures.logical security measures.
2525
Data Security and Data Integrity
•Unencrypted vs. EncryptedUnencrypted vs. Encrypted– Keep passwords & keys on paper in a secure locationKeep passwords & keys on paper in a secure location
– and in an Encrypted Digital Fileand in an Encrypted Digital File
•Uncompressed vs. CompressedUncompressed vs. Compressed
2626
Security - Physical
•Restrict access to computers, offices and Restrict access to computers, offices and storage mediastorage media
•Store lab notebooks, samples in locked Store lab notebooks, samples in locked cabinetscabinets
•Only let trusted individuals troubleshoot Only let trusted individuals troubleshoot computer problemscomputer problems
•Appropriate environmental controlsAppropriate environmental controls
2727
Security - Network
•Keep confidential and sensitive data on Keep confidential and sensitive data on computers not connected to the Internetcomputers not connected to the Internet
•Keep virus protection up to dateKeep virus protection up to date
•Don't sent confidential data via e-mail or FTP Don't sent confidential data via e-mail or FTP (use encryption, if you must)(use encryption, if you must)
•Use passwords on files and computersUse passwords on files and computers
•Data disposition at end retention periodData disposition at end retention period
2828
Security – CU Encryption Options
CUIT CUIT
•BitLocker for removable storage devicesBitLocker for removable storage devices
•Can purchase Guardian Hard Disk Encryption Can purchase Guardian Hard Disk Encryption through CUITthrough CUIT
•Windows Encrypting File System (native)Windows Encrypting File System (native)
•Apple – File Vault (native)Apple – File Vault (native)
•WinZip/7 Zip/TruecryptWinZip/7 Zip/Truecrypt
•Savant Application Whitelist software Savant Application Whitelist software
2929
Back-ups
•Make 3 copies Make 3 copies – OriginalOriginal
– External/local External/local
– External/remote – different geographic areaExternal/remote – different geographic area
•Verify recovery is possibleVerify recovery is possible– Checksum validationChecksum validation
– Test file restore after initial set-upTest file restore after initial set-up
– Periodically thereafterPeriodically thereafter
3030
Data Back-up Options
•Hard DriveHard Drive
• Tape Back-upTape Back-up
•ServerServer
•Cloud StorageCloud Storage– Amazon S3Amazon S3
– Subject Repository/ Data CentersSubject Repository/ Data Centers• (PubChem, Dryad, IRI/LDEO) (PubChem, Dryad, IRI/LDEO)
3131
Metadata
Structured information that describes, explains, Structured information that describes, explains, locates, and otherwise makes it easier to locates, and otherwise makes it easier to retrieve and use an information resource. retrieve and use an information resource.
3 main types:3 main types:
DescriptiveDescriptive
AdministrativeAdministrative
StructuralStructural
3232
Major Research Metadata Standards
•Darwin Core (Biology)Darwin Core (Biology)
•DDI (Data Documentation Initiative, for data DDI (Data Documentation Initiative, for data sets in social and behavioral sciences) sets in social and behavioral sciences)
•DIF (Directory Interchange Format for scientific DIF (Directory Interchange Format for scientific data sets) data sets)
• EML (Ecological Metadata Language) EML (Ecological Metadata Language)
• FGDC/CSDGM (geographic data) FGDC/CSDGM (geographic data)
•National Biological Information Infrastructure National Biological Information Infrastructure (NBII)(NBII)
3333
Other DMP elements
•Who in the research group will be responsible Who in the research group will be responsible for data management?for data management?
•Are there tools or software needed to Are there tools or software needed to create/process/visualize the data? create/process/visualize the data?
3434
Writing Data Management Plans
• Follow CU and funder polices and guidelinesFollow CU and funder polices and guidelines
•Can use CUL template as starting pointCan use CUL template as starting point
•Visit SCP web site for further informationVisit SCP web site for further information
http://scholcomm.columbia.edu/
3535
Data Management Plans - NSF
1.1. TYPES of data, samples, physical collections, software, TYPES of data, samples, physical collections, software, curriculum materials, and other materials to be curriculum materials, and other materials to be produced in the course of the projectproduced in the course of the project
2.2. STANDARDS to be used for data and metadata format STANDARDS to be used for data and metadata format and content (where existing standards are absent or and content (where existing standards are absent or deemed inadequate, this should be documented along deemed inadequate, this should be documented along with any proposed solutions or remedies)with any proposed solutions or remedies)
3.3. ACCESS and sharing policies including provisions for ACCESS and sharing policies including provisions for appropriate protection of privacy, confidentiality, appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or security, intellectual property, or other rights or requirementsrequirements
4.4. Policies and provisions for RE-USE, re-distribution, and Policies and provisions for RE-USE, re-distribution, and the production of derivativesthe production of derivatives
5.5. Plans for ARCHIVING data, samples, and other research Plans for ARCHIVING data, samples, and other research products, and for preservation of access to themproducts, and for preservation of access to them
6.6. OROR justification why no plan is needed justification why no plan is needed3636
Data Sharing Plan - NIH
1.1. Expected schedule for data sharingExpected schedule for data sharing
2.2. Format of the final datasetFormat of the final dataset
3.3. Documentation to be providedDocumentation to be provided
4.4. Whether or not any analytic tools Whether or not any analytic tools will be providedwill be provided
5.5. Whether or not a data-sharing Whether or not a data-sharing agreement will be required and, if agreement will be required and, if so, a brief description of such an so, a brief description of such an agreementagreement
6.6. Mode of data sharingMode of data sharing
3737
Takeaways
•Create a plan to manage your research data Create a plan to manage your research data before the project beginsbefore the project begins
• Follow the planFollow the plan
•At the end of the project securely archive data At the end of the project securely archive data of long term value and of long term value and
•Properly dispose of obsolete or sensitive dataProperly dispose of obsolete or sensitive data
•Guidance available from OVPR and Scholarly Guidance available from OVPR and Scholarly Communications ProgramCommunications Program
3838
Sharing your data Sharing your data Emerging practicesEmerging practices
3939
Why isn’t data sharing the norm?
• not common in many disciplines not common in many disciplines
• not recognized in promotion/tenurenot recognized in promotion/tenure
• researcher gives up control of dataresearcher gives up control of data
• worries about being scooped or worries about being scooped or misinterpretedmisinterpreted
• time required to present data in usable time required to present data in usable formatformat
• lack of infrastructure and standardslack of infrastructure and standards
4040
Sharing increasingly seen as valuable
““More and more often these More and more often these days, a research project's days, a research project's success is measured not just by success is measured not just by the publications it produces, but the publications it produces, but also by the data it makes also by the data it makes available to the wider available to the wider community.”community.”
-- Nature Nature editorialeditorial 9.10.099.10.09
““It is obvious that making data It is obvious that making data widely available is an essential widely available is an essential element of scientific research.”element of scientific research.”
- Science - Science editorial 2.11.11editorial 2.11.11
4141
4242
““Science has always been about open Science has always been about open debate. But incidents such as the UEA email debate. But incidents such as the UEA email leaks have prompted the Royal Society to leaks have prompted the Royal Society to look at how open science really is. With the look at how open science really is. With the advent of the Internet, the public now advent of the Internet, the public now expect a greater degree of transparency. expect a greater degree of transparency. The impact of science on people’s lives, and The impact of science on people’s lives, and the implications of scientific assessments the implications of scientific assessments for society and the economy are now so for society and the economy are now so great that people won’t just believe great that people won’t just believe scientists when they say “trust me, I’m an scientists when they say “trust me, I’m an expert.” … Science has to adapt.” expert.” … Science has to adapt.”
- Geoffrey Boulton, chair Royal Society working - Geoffrey Boulton, chair Royal Society working group for study: group for study: Science as a public enterprise: Science as a public enterprise:
opening up scientific informationopening up scientific information, 5.13.11, 5.13.11
New need for openness
Sharing advances science
Sharing can help produce significant Sharing can help produce significant advances in research, as these projects advances in research, as these projects have demonstrated.have demonstrated.
Human Human Genome Genome ProjectProject
NIH-funded NIH-funded Alzheimer’s study Alzheimer’s study published in April published in April 20112011
Sloan Sloan Digital Digital Sky Sky SurveySurvey
4343
Sharing benefits researchers
Rewards of sharing may include:Rewards of sharing may include:
• opportunities to do innovative researchopportunities to do innovative research
• research with higher impactresearch with higher impact
• support for transparency in research support for transparency in research
• recognition, reciprocity from colleaguesrecognition, reciprocity from colleagues
• more opportunities to preserve datamore opportunities to preserve data
4444
You may have to share
More funders are requiring itMore funders are requiring it
The National Science Foundation now The National Science Foundation now asks researchers requesting funding to asks researchers requesting funding to show how they will share data.show how they will share data.
• Grant applications must include a Grant applications must include a two-page data management plan.two-page data management plan.
• Data management and access plans Data management and access plans will be evaluated “through the will be evaluated “through the process of peer review and program process of peer review and program management.”management.”
4545
You may have to share
More journals are requiring itMore journals are requiring it
“…“…authors are required to make materials, authors are required to make materials, data and associated protocols promptly data and associated protocols promptly available to readers….available to readers….Nature Nature journals reserve journals reserve the right to refuse publication in cases where the right to refuse publication in cases where authors do not provide adequate assurances authors do not provide adequate assurances that they can comply...”that they can comply...”
4646
What do you share?
NSF says data covered by its NSF says data covered by its data management and sharing data management and sharing requirements will “be requirements will “be determined by the community determined by the community of interest.” of interest.”
This “may include, but is not This “may include, but is not limited to: data, publications, limited to: data, publications, samples, physical collections, samples, physical collections, software and models.”software and models.”
4747
Some data are not shareable
Be aware of reasons you may NOT Be aware of reasons you may NOT want to share your data:want to share your data:
• Data must be scrubbed of Data must be scrubbed of confidential information before confidential information before sharing. sharing.
• You may be able to justify not You may be able to justify not sharing if your data includes sharing if your data includes proprietary licenses or patentable proprietary licenses or patentable items, is useful for further items, is useful for further analyses, etc.analyses, etc.
4848
How and when do you share?““How” depends on… How” depends on…
• the format of your datathe format of your data• funder and publisher requirementsfunder and publisher requirements• any restrictions on your dataany restrictions on your data
““When” depends on…When” depends on…• customary embargo periodscustomary embargo periods• if relevant guidelines specify amount if relevant guidelines specify amount
of time within which data must be of time within which data must be sharedshared
4949
5050
Guidelines from the NSF
Data should be provided at lowest possible cost.Data should be provided at lowest possible cost.
Data may be made available viaData may be made available via
• national data centernational data center
• widely available journal, book, or websitewidely available journal, book, or website
• institutional archives standard for discipline institutional archives standard for discipline
• other EAR-specified repositories. other EAR-specified repositories.
Data should be made available as soon as Data should be made available as soon as possible, but no later than two years after possible, but no later than two years after collection. collection.
Division of Earth Sciences (EAR) Division of Earth Sciences (EAR)
Repositories are: Repositories are:
• organized around institutions or subjectsorganized around institutions or subjects
• often open accessoften open access
• archival, not active, storage for digital dataarchival, not active, storage for digital data
• may offer:may offer:
o long-term preservation and accesslong-term preservation and access
o search engine optimizationsearch engine optimization
o permanent URL or DOI permanent URL or DOI
Online repositories
5151
Columbia’s repository
AC accepts data and other materials from AC accepts data and other materials from Columbia faculty, students, and staff, and Columbia faculty, students, and staff, and provides: provides: • a permanent URLa permanent URL• secure replicated storagesecure replicated storage• accurate metadataaccurate metadata• globally accessible repository globally accessible repository • option for contextual linking between data option for contextual linking between data
and published research resultsand published research results5252
Some subject-based repositories
5353
NASA’s space science NASA’s space science mission repositorymission repository
Cryospheric data repository Cryospheric data repository run by U of Coloradorun by U of Colorado
Macromolecular structural Macromolecular structural data repository run by data repository run by international consortiuminternational consortium
NOAA’s NOAA’s marine data marine data
repositoryrepository
Biological activities of small Biological activities of small molecules data repository run molecules data repository run by NCBI at Nat’l Library of by NCBI at Nat’l Library of MedicineMedicine
5454
More subject-based repositories
Deep-sea core Deep-sea core samples repository samples repository housed at LDEOhoused at LDEO
Data repository for Data repository for archeology and archeology and related disciplines related disciplines run by nonprofit run by nonprofit consortiumconsortium
Basic and applied biosciences Basic and applied biosciences data repository run by data repository run by consortium of publishersconsortium of publishers
Geodesy data Geodesy data repository run by repository run by university university consortiumconsortium
Social science data repository Social science data repository run by consortiumrun by consortium
Data licenses
• Copyright issues around data can be Copyright issues around data can be complexcomplex
• These groups offer “ready-made” licenses These groups offer “ready-made” licenses for data that help clarify any restrictions on for data that help clarify any restrictions on reusereuse
5555
Data sharing is here to stay
Initiatives are underway to:Initiatives are underway to:
• establish norms for sharingestablish norms for sharing
• create sharing and preservation infrastructurecreate sharing and preservation infrastructure
• establish standards for interoperabilityestablish standards for interoperability
• clarify copyright and licensing issues clarify copyright and licensing issues
Data ConservancyData Conservancy
Digital Curation CentreDigital Curation Centre5656
Takeaways
• Data sharing requirements are being Data sharing requirements are being
implemented by more funders and implemented by more funders and publishers.publishers.
• Norms and standards for sharing are not set Norms and standards for sharing are not set and vary across disciplines.and vary across disciplines.
• Be aware of sharing requirements and Be aware of sharing requirements and restrictions on your data. restrictions on your data.
• Find links to a variety of institutional and Find links to a variety of institutional and data repositories at data repositories at http:scholcomm.columbia.eduhttp:scholcomm.columbia.edu
5757
Contacts
• Joel RoselinJoel Roselin• Office of Research Compliance and TrainingOffice of Research Compliance and Training• [email protected]@columbia.edu
• Danianne MizzyDanianne Mizzy• Engineering LibrarianEngineering Librarian• [email protected]@columbia.edu
• Kathryn PopeKathryn Pope• Center for Digital Research and ScholarshipCenter for Digital Research and Scholarship• [email protected]@columbia.edu