Data accessibility and challenges Jyo2 khadake 24 th October 2016 EMBLABR workshop
Data life cycle
The life cycle of data depends on Project aims and purpose.
Planning/ project design Finding/crea2ng the data
Extrac2ng Transforming and Loading Processing
Analyzing data –informa2on – publica2on Data associated with study can be reused
Data access and data sharing
• What do you expect when we access data? • What do you expect when we share data?
• These are two sides of the same coin
Open access data policy
• Data created from research are valuable resources that can be used and reused for future scien2fic and educa2onal purposes. Sharing data facilitates new scien2fic inquiry, avoids duplicate data collec2on and provides real life resources for educa2on and training
OR • Publicly funded research data should be as far as possible openly available to the scien2fic community
What does this achieve • Encourages scien2fic enquiry and debate • Promotes innova2on and poten2al new data uses • New collabora2ons between users and creators of data • Maximises transperancy and accoun2bility • Enables scru2ny of research findings • Encourages improvement and valida2on of research findings
• Reduces cost of supplica2ng data collec2on • Increases visibility of research • Provides direct credit to researcher • Research outcome for educa2on and training
Encouraged by • Research funders under guidance from OECD have developed data sharing policies that allow researches 2me for exclusive use of data for a limited 2me with a mandate to publish at the end of agreed period. This can be done via repositories or data centers. The funders also require data management and sharing plan
• Journals data that forms basis of publica2on needs to be shared or deposited within an accessible accessible database or repository.
• Ini2a2ves like DataCite registry assign Unique digital object iden2fiers DOIs to research data helping scien2st make data discoverable, citable and tracable so research data as well as publica2on based on those data form part of scien2fic output.
• Use of Metadata dependent URIs to iden2fy and share data
How to share / access data
• Specialist data centers, archives or data banks • Journal to support publica2on • Ins2tu2onal repository • Online via project or ins2tu2onal website • Informally between researchers on a peer-‐to-‐peer basis
URI iden2fies data
Advantages of deposi2ng data with data center or repository
• Assurance that data meets set standards • Long term preserva2on of standardised accessible data format, format
conversion when so_ware upgraded • Safe keeping with a`ribu2on in secure environment • Regular data backup • Online resource discovery through catalogues • Access in popular formats • Licensing arrangement to acknowledge data rights • Standardised cita2on mechanism to acknowledge data ownership • Pormo2on of data to many users • Monitoring secondary usage of data • Management of access to data and user queries on behalf of data owner
What affects Sharing/Accessing data
Size of data and compute Community developed of data standards Exis2ng repositories or storage facili2es
Nature of data Appropriate data tracking and governance
Key management points Metadata
Size of data Decides what kind of storage/ archival is used Cloud storage
OK for data that does not go into terabytes or does not have restric2ons Cost implica2ons Available as DaaS, SaaS, PaaS, IaaS
Sta2c storage: Cluster based compu2ng/storage Geographical restric2ons Provides compute for analysis since big data
does not move. Good access control?
Compute for analysis
• Once there is data, access decision needs to be made on how much compute is required for analysis.
• Cloud based solu2ons are available for small scale data
• Data centers like Aimes allow for compute on clusters
• Ins2tute/repository may provide HPC as well as so_ware for analysis
Community developed data standards An ac2ve collabora2ve community is essen2al for development of community standards
The standards are required for format/s for data storage/exchange vocabulary for data representa2on
Absence of Community standards?
Catalogues can be found at: h`p://www.ebi.ac.uk/ols/index
h`p://bioportal.bioontology.org/
Exis2ng data repositories/storage • Topic specific repositories will give maximum exposure to the data / access to relevant data
• Issue with mul2ple repositories – collabora2ve approaches to repositories eg. RCSB for structure data
• Absence of repositories ??
• h`p://datacite.org/repolist • h`p://databib.org
Nature of data
• This decides whether the data can be open access or controlled access.
• There may be further geographical restric2on on the data.
• If controlled access is required there is a need for development of Data Access Agreements & Applica2on Forms.
• Management of the access control
Approaches to secure access
• DAC controlled access but with / without monitoring
• Highly controlled access where only analysis results can be taken away -‐ Datasheild
Roles and responsibili2es Par2cularly important where sensi2ve data, personal data or patent data are involved. Appropriate consents and ethics need to be in place Some2mes only processed ananomized data can be used.
• Requires the establishment of DAC and MC – Manages applica2ons – Approves applica2ons – Manages access – Manages destruc2on of data if required
Data management planning
• Plan ahead to create high – quality and sustainable data that can be shared
• This will need checking periodically to see that the plan s2ll meets requirements
Available resources: h`ps://dmponline.dcc.ac.uk h"p://www.mrc.ac.uk/documents/doc/data-‐management-‐plan-‐template/
Metadata
• What is metadata? – Documenta2on and descrip2on associate with data
– Required to make sense of the data eg descrip2on of variables, classifica2on scheme, dates and project..
There are Metadata standards Eg. Dublin core, Darwin core, OECD minimal data set, AGROVOC
Forma2ng your data
• Different formats good for different purposes • Open formats adopted by community are more sustainable eg. Re, 2f, vaw, xml, csv
• Proprietary and/or compressed formats that have widespread use eg. Doc, jpg, mp3, gzip
• Organising files and folders • Quality assurance • Version control and authen2city transcrip2on Available resources
Storing your data • Keep your digital data safe secure and recoverable • Making backups at least 2 • Ins2tu2onal back-‐up policies • Manage backups: snapshots, integrity, recoverability • Data storage strategy • Data security • Security of personal data • Data destruc2on / disposal • Data transmission and encryp2on • File sharing and collabora2ve environment
-‐ email, dropbox, _p, encrypted media, file store, VRES ..
Ins2tu2onal backup/storage
Ins2tutes are required to provide storage of data. Make sure you allocate funds for this when you write proposal.
Planning
Genera2ng/
Reliability
Ownership
Metadata
Versioning Standardisa2on
Quality
Publishing
Archiving
* *
*
* Destroy
*
Resources for archiving data
• Dryad — Dryad is an interna2onal repository of data underlying peer-‐reviewed ar2cles in the basic and applied biosciences.
• The Dataverse Network — The Dataverse Network is an open source applica2on to publish, share, reference, extract and analyze research data. (Harvard)
Destroy data
• Physical destruc2on • Overwri2ng • Demagne2sing the storage
• Disc distruc2on • Purging the printers and other devices
Best Prac2ces
• Make DMP • Use standard vocabulary • Standardised format • Check ins2tu2onal policy for data storage and exchange
• Check funders policy for data exchange • Check legal constraints and requirements. • Make data available under DAA • Wri`en policy for reten2on and disposal of data • Safe and secure sharing of data
Strategies for centers
• Provide management framework for researchers
Some sources are: UK data archive Boston university
Melbourne Data Cura2on Center