Top Banner
02/07/2022 1 Research Data Planning ...for the Sciences MSGR UpSkills Program Jeff Christiansen & Steve Bennett 13 September 2012
52

UpSkills: Research Data Management for the Sciences

Jan 18, 2015

Download

Technology

stevage

A 2 hour introductory session presented to PhD students at the University of Melbourne, 13 September 2012.

Given by Steve Bennett (VeRSI) and Jeff Christiansen (ANDS).
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UpSkills: Research Data Management for the Sciences

10/04/2023 1

Research Data Planning ...for the Sciences

MSGR UpSkills ProgramJeff Christiansen & Steve Bennett

13 September 2012

Page 2: UpSkills: Research Data Management for the Sciences

210/04/2023

Why data management What data Where you store it Who owns it How you manage it

Bonus: start work on a data management plan!

Page 3: UpSkills: Research Data Management for the Sciences

Intro – who we are

Dr Jeff Christiansen [email protected]

Australian National Data ServicePreviously researcher in molecular genetics

Steve Bennett: [email protected]

Victorian e-Research Strategic InitiativeHelps researchers with systems for digital data

310/04/2023

Page 4: UpSkills: Research Data Management for the Sciences

410/04/2023

Why data management What data Where you store it Who owns it How you manage it

Page 5: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research BSc (Hons)

510/04/2023

Experiment 1

Experiment 2

?

Page 6: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research PhD

10/04/2023

Page 7: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research PhD

710/04/2023

CCACGCGTCCGGTGTGAGCTCTCCTTCAGCTGCTGCAGGCATTACACTCAGCTCTGCTGT CCAAGCTGCTCATGTGATTGCCCTCTAATCCATTCAGGCAAAGTGAGCTAGACTTGTTTA AGCTGCAGGTCTTATTTTGATTGTAGCAGGCTAGTGAACAGTCACAGAAGTGGTTCAAGT ATTGTGCCCCTTGGAGCTGTTATCTTTGAAAATGTGGCCGTGGCTGGAAAAGGATGCATC TGCACCAATGGCACAGTGACCAGCCAGTTGCTTAGGGGCTTAGCTGGTGGATTTGGACCT GTCTTCTGCAACCTGGGGAAAGCATAATCTACTGTGTTATTTGATAATGGAAGCGCCGTG ATCAGATCCATCCCTCTGCTTTGAATTTTCAAACAAATAATCAAGAATTTGGCTCGTGTT AAAAAAAAAAAAAAAA

Page 8: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research PhD

810/04/2023

Page 9: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research PhD

910/04/2023

CCACGCGTCCGGTGTGAGCTCTCCTTCAGCTGCTGCAGGCATTACACTCAGCTCTGCTGT CCAAGCTGCTCATGTGATTGCCCTCTAATCCATTCAGGCAAAGTGAGCTAGACTTGTTTA AGCTGCAGGTCTTATTTTGATTGTAGCAGGCTAGTGAACAGTCACAGAAGTGGTTCAAGT ATTGTGCCCCTTGGAGCTGTTATCTTTGAAAATGTGGCCGTGGCTGGAAAAGGATGCATC TGCACCAATGGCACAGTGACCAGCCAGTTGCTTAGGGGCTTAGCTGGTGGATTTGGACCT GTCTTCTGCAACCTGGGGAAAGCATAATCTACTGTGTTATTTGATAATGGAAGCGCCGTG ATCAGATCCATCCCTCTGCTTTGAATTTTCAAACAAATAATCAAGAATTTGGCTCGTGTT AAAAAAAAAAAAAAAA

Page 10: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research PhD

1010/04/2023

CCACGCGTCCGGTGTGAGCTCTCCTTCAGCTGCTGCAGGCATTACACTCAGCTCTGCTGT CCAAGCTGCTCATGTGATTGCCCTCTAATCCATTCAGGCAAAGTGAGCTAGACTTGTTTA AGCTGCAGGTCTTATTTTGATTGTAGCAGGCTAGTGAACAGTCACAGAAGTGGTTCAAGT ATTGTGCCCCTTGGAGCTGTTATCTTTGAAAATGTGGCCGTGGCTGGAAAAGGATGCATC TGCACCAATGGCACAGTGACCAGCCAGTTGCTTAGGGGCTTAGCTGGTGGATTTGGACCT GTCTTCTGCAACCTGGGGAAAGCATAATCTACTGTGTTATTTGATAATGGAAGCGCCGTG ATCAGATCCATCCCTCTGCTTTGAATTTTCAAACAAATAATCAAGAATTTGGCTCGTGTT AAAAAAAAAAAAAAAA

Page 11: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research PhD

1110/04/2023

CCACGCGTCCGGTGTGAGCTCTCCTTCAGCTGCTGCAGGCATTACACTCAGCTCTGCTGT CCAAGCTGCTCATGTGATTGCCCTCTAATCCATTCAGGCAAAGTGAGCTAGACTTGTTTA AGCTGCAGGTCTTATTTTGATTGTAGCAGGCTAGTGAACAGTCACAGAAGTGGTTCAAGT ATTGTGCCCCTTGGAGCTGTTATCTTTGAAAATGTGGCCGTGGCTGGAAAAGGATGCATC TGCACCAATGGCACAGTGACCAGCCAGTTGCTTAGGGGCTTAGCTGGTGGATTTGGACCT GTCTTCTGCAACCTGGGGAAAGCATAATCTACTGTGTTATTTGATAATGGAAGCGCCGTG ATCAGATCCATCCCTCTGCTTTGAATTTTCAAACAAATAATCAAGAATTTGGCTCGTGTT AAAAAAAAAAAAAAAA

Page 12: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research PhD

1210/04/2023

CCACGCGTCCGGTGTGAGCTCTCCTTCAGCTGCTGCAGGCATTACACTCAGCTCTGCTGT CCAAGCTGCTCATGTGATTGCCCTCTAATCCATTCAGGCAAAGTGAGCTAGACTTGTTTA AGCTGCAGGTCTTATTTTGATTGTAGCAGGCTAGTGAACAGTCACAGAAGTGGTTCAAGT ATTGTGCCCCTTGGAGCTGTTATCTTTGAAAATGTGGCCGTGGCTGGAAAAGGATGCATC TGCACCAATGGCACAGTGACCAGCCAGTTGCTTAGGGGCTTAGCTGGTGGATTTGGACCT GTCTTCTGCAACCTGGGGAAAGCATAATCTACTGTGTTATTTGATAATGGAAGCGCCGTG ATCAGATCCATCCCTCTGCTTTGAATTTTCAAACAAATAATCAAGAATTTGGCTCGTGTT AAAAAAAAAAAAAAAA

Page 13: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research Postdoc

Page 14: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research EMAGE Database Project Manager

Page 15: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research EMAGE Database Project Manager

Page 16: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research EMAGE Database Project Manager

Page 17: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research EMAGE Database Project Manager

Page 18: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research EMAGE Database Project Manager Cross DB queries need to use appropriate descriptors, not just free text E.g. Gene name identifiers

Page 19: UpSkills: Research Data Management for the Sciences

Becoming aware of data management in research Being organised, having systems in place and adopting

community standards are all helpful in data management.

Think about what you will be required to do when publishing.

There are obligations for having data available for others post publication.

It’s useful to have your data organised so you can collaborate with others easily.

What will happen to your data when you leave the lab? Your supervisor would like to know what’s what/where.

Page 20: UpSkills: Research Data Management for the Sciences

Data Planning & ManagingMotivators

#1 Meet your obligations legal, ethical, funding requirements; uni, department, group policies Find out now – avoid hassle later (ask [email protected])

#2 Make your life easier a data management system to make your research work a data management plan to save time keeping data, finding stuff again, labelling, security sharing & collaborating

#3 Helping your career being a professional researcher data – your assets and records – finding, understanding data in years to come contributing to global research community manage your data now, help your future self.

2010/04/2023

Page 21: UpSkills: Research Data Management for the Sciences

2110/04/2023

Why data management

What data Where you store it Who owns it How you manage it

Ask: [email protected]

Page 22: UpSkills: Research Data Management for the Sciences

What is data?

Observational data Sensor readings, telemetry (non-reproducible)

Experimental data Gene sequences, chromatograms (reproducible,

but expensive)

Simulation data Climate models (model the most important thing)

Derived/compiled data Compiled database (reproducible but expensive)

2210/04/2023

Page 23: UpSkills: Research Data Management for the Sciences

What else is data?

Social sciencesSurveys, statistical data

HumanitiesCultural artefacts (video, photos, sound…)

Physical samplesSoil, biological, water, archeological…

Does anyone here not have data?

2310/04/2023

Page 24: UpSkills: Research Data Management for the Sciences

2410/04/2023

The University’s definitions

Research Data laboratory notebooks; field notebooks; primary research data (hardcopy or

in computer); questionnaires; audiotapes; videotapes; models; photographs;films; test responses; slides; artefacts; specimens; samples

Research Records Includes correspondence (electronic mail and paper-based correspondence);

project files; grant applications; ethics applications; technical reports; researchreports; master lists; signed consent forms; and information sheets for researchparticipants

Administrative Records (Research Office, Central Records) Includes contracts and agreements, patents, licences, grants, intellectual property

and trademarks, policies, ethics, research project files, reports, publications

What is often included as “Research Data”:= data + records + copies (physical & digital)

= stuff you used and/or created

Page 25: UpSkills: Research Data Management for the Sciences

Group activity (15 mins)

Form groups of similar discipline Earth sciences/forestry/botany/agriculture Health/medical biology/physio/social work Engineering/computer science/linguistics

Discuss: What kind of data do you collect? How do you get it?

Your data management checklist: Section 1.1

2510/04/2023

Discu

ss

Page 26: UpSkills: Research Data Management for the Sciences

2610/04/2023

Why data management What data

Where you store it Who owns it How you manage it

Page 27: UpSkills: Research Data Management for the Sciences

2710/04/2023

Research trends

Research Data is increasing in size Protein crystallography100 GB/experiment Gene sequencing 1,000 GB/day High-energy physics 10,000,000s GB/year Astronomy (SKA) 1,000,000,000 GB/day

Research Collaborations are increasing Human Genome project (1990-2003)

113 people, 20 orgs

Belle collaboration (1994-..) ~370 people, 60 inst., 14 countries

ATLAS collaboration @ LHC CERN (1994-2020+) ~2500 people, 169 inst., 37 countries

Research Data is increasingly digital Wonderful opportunities for reuse,

sharing, collaboration, analysis Data science (4th paradigm) “eResearch”!

Page 28: UpSkills: Research Data Management for the Sciences

2810/04/2023

Research trends

Large scale data intensive science “A totally new way of doing research” New research methods, new skills,

therefore new training needed

New skills... Specialists – in both technology and

research Informatics – dealing with data from

collection through analysis Data Management and Planning –

collecting, maintaining, sharing dataEveryone!

Page 29: UpSkills: Research Data Management for the Sciences

How big?

2910/04/2023

1mb(spreadsheets)

10 Gb(numerical, video)

1Tb(simulations, synchrotron) 1Pb

Limit of Google Drive, DropBox…

Easy?(Probably already solved)

AwkwardEasy!

Page 30: UpSkills: Research Data Management for the Sciences

3010/04/2023

Where to keep it?

Possibilities:Research group storage

Ask!

Local computer Backups crucial. Sharing hard. Disaster looms.

Cloud (Dropbox, Google Drive) Check security, legals. How to archive?

Ask [email protected]

Page 31: UpSkills: Research Data Management for the Sciences

Sharing

3110/04/2023

Page 32: UpSkills: Research Data Management for the Sciences

3210/04/2023

Page 33: UpSkills: Research Data Management for the Sciences

Group activity #2 (15 mins)

DiscussHow much data will you have?Where will you store it?What data formats?

Data management checklistComplete section 2.3 & 2.4 If non-digital: 2.1, 2.2

3310/04/2023

Discu

ss

Page 34: UpSkills: Research Data Management for the Sciences

3410/04/2023

Why data management What data Where you store it

Who owns it How you manage it

Page 35: UpSkills: Research Data Management for the Sciences

In collaborations, get IP right early. Find out:

Does the University own your data?Can you still share it?Restrictions?Licences?

3510/04/2023

Page 36: UpSkills: Research Data Management for the Sciences

IP – who claims to own it Copyright – who has legal backing

(not all data can be copyright)

Ethics – more rules you agreed toMust you keep the data private?Must you share it?

Privacy – can you de-identify the data?

3610/04/2023

Page 37: UpSkills: Research Data Management for the Sciences

Group activity #3 (15 mins)

DiscussWho owns your data?What data can you share? With whom?How will you protect confidential information?

Data management checklistComplete section 1.3

3710/04/2023

Discu

ss

Page 38: UpSkills: Research Data Management for the Sciences

3810/04/2023

Why data management What data Where you store it Who owns it

How you manage it

Page 39: UpSkills: Research Data Management for the Sciences

University Code of Conduct for Research

3910/04/2023

Page 40: UpSkills: Research Data Management for the Sciences

University Policy on Management of Research Data and Records

4010/04/2023

Page 41: UpSkills: Research Data Management for the Sciences

Starting your system

Consider your goals – what do you want to get out of managing your data?

Figure out your criteria for keeping data Picture your data three years from now Consider the metadata you want to collect

to document your datasets

4110/04/2023

Page 42: UpSkills: Research Data Management for the Sciences

Benefits

Find your data 3 years from now Get more papers out of your data Save time and stress – get organised Share with collaborators Some journals require data submission

4210/04/2023

Page 43: UpSkills: Research Data Management for the Sciences

4310/04/2023

Not rocket science! Stop and think about what data you have, what you’re doing, what you should be doing

Some scary facts: Microfilm, non-acidic paper last 100+ years magnetic media lasts 10+ years optical media lasts 20+ years 2-10% of hard drives fail every year software & hardware can outdate quickly

Scary stories: US study 100’s charges “research misconduct”

40% avoided by better data management! UniMelb ~20 cases research misconduct 2008.

Most involved students. All needed good records! Climategate scandal, UK – FOI

Proper Planning & Management is needed!!!

Being more professional...

Burroughs 1977 – B 9495Magnetic Tape Subsystem

Page 44: UpSkills: Research Data Management for the Sciences

High level view

Your data management system needs to cover:

4410/04/2023

Create,Capture,Describe

(Use, Transform, Update)

Store, Secure, Preserve

Keep,Transfer,Destroy

(National Archives)

Page 45: UpSkills: Research Data Management for the Sciences

A simple Data Man. System Identify key data in your context, important stuff to keep (your Data Assets) Find secure places to keep physical & digital Records + Data (filing cabinet, department shared

drive) – backups are essential Where and when should there be checks on your data (sanity checks, quality control, standards) File your data and records into logical divisions, say activities, projects, or pieces of work

eg. folders /DeptShare/johnsmith/Records/ProteinABC Investigation Don’t break things down too much, makes things harder to find!

Have a consistent file naming convention: perhaps: ActivityOrContents-LocationOrPerson-CreateDate-Id-Description.ext eg. “ProteinABC-LJW-20100409-0001 Raw data from instrument.dat”

Keep good metadata (notes, records) on how you captured your data, particularly for physical records

Descriptions of collections or files – Structured text files good enough eg. FileOrCollectionName-metadata.txt

On other things, entities that are not files – Structured text files or spreadsheets Have a good labeling/ID/coding system Perhaps keep a registry (spreadsheet will do; IDs, names, location, basic metadata)

Find the right balance in digitising physical stuff (easy and quick) Digital is easy to keep/transfer/search if stored properly. However, digitising/scanning everything can be

time consuming and without good descriptions may not be useful. Link digital notes/metadata to physical stuff (IDs, names, labels, codes, location) Have some basic digital representations or notes of important physical stuff 45

Page 46: UpSkills: Research Data Management for the Sciences

Free Tools jEdit – text file editor (private notes, metadata and records) local disk + file share + Cobian Backup (private project records, data) Google Desktop (file and email search) Zotero (reference material) (EndNote is Uni default) EVO & Skype & Google chat (video/tele/chat communication)

http://evo.arcs.org.au/

Sakai@Melbourne (project workspace) https://sakai.unimelb.edu.au/

Google docs + Sites (collaborative editing) Google groups (email list)

research data storage, a tricky one… use local storage in preference, ask around DropBox, Google Drive, Microsoft SkyDrive, box.com…

too many others to list, heaps on the web… See Digital Research Tools (DiRT) wiki for a huge list

http://digitalresearchtools.pbworks.com/ Check with your supervisor,

4610/04/2023

see Info Skills classeson EndNote,

UpSkills 29 June on VC

Page 47: UpSkills: Research Data Management for the Sciences

4710/04/2023

Data Security 2 aspects to security

Safety from damage or loss How important is the data to you?

Safety from incorrect use What are the possible consequences?

Safety from damage or loss (unintended and intentional)… What’s acceptable loss (safety can cost, use up time) Backups (data, software, system)

How often (hourly, daily, weekly, monthly, manually, automated)? How many and where (onsite, offsite, both, multiple)? Departmental storage? Probably backed up already!

Disaster Recovery Quality hardware, multiple/spare servers, spare disk drives, Operating System and Applications image backups

(talk with someone technical, your local IT guys)

Page 48: UpSkills: Research Data Management for the Sciences

4810/04/2023

Data Security

Safety from damage or loss (continued)… Make sure Backup is occurring

Essential data and records... “Your Archive” Frequency should depend on how often your data changes Incremental backups are essential. Replication IS NOT SAFE!!! Keep some copies (one?) offsite. Database backups should use database tools (mysqldump, pg_dump etc.)

Departmental storage is best... probably backed up already! Worst case... DIY, use external hard drives or remote storage Seek advice on software

for Windows I use... Cobian Backup, DriveImage XML for Linux I use... rsync (see http://rsync.samba.org/examples.html ) for Mac there is... Time Machine

(talk with someone technical, your local IT guys)

Page 49: UpSkills: Research Data Management for the Sciences

4910/04/2023

Data Security Safety from incorrect use (unintended and malicious)…

PCI DSS - a recommendation (Payment Card Industry Data Security Standard) eg. google for: “nacubo.org payment card data security” 12 requirements that are good practice (first 10 are the basics)

10 IT basics… Firewall servers Do not use default usernames/password Physically protected stored data (lock up servers, disk, tape, source material) Use encrypted transmission over internet (VPN, SSL, SSH, GridFTP, S/MIME email) Update antivirus/antimalware software regularly Use secure and trusted applications Restrict access to sensitive data (tighter control, or put it somewhere else) Assign unique IDs for each user Record and monitor all access to data

Plus some good practice… Don’t retain sensitive data Or encrypt sensitive information

Page 50: UpSkills: Research Data Management for the Sciences

Read up!

Google: research data toolkit http://researchdata.unimelb.edu.au ANDS guides To consider: identifiers, DOIs, archival,

security, licensing, metadata formats, ontologies, controlled vocabularies, definition of “collection”, data reuse, metadata stores…!

5010/04/2023

Page 51: UpSkills: Research Data Management for the Sciences

Group activity #4 (15 mins)

Data management checklistComplete section 3.1

5110/04/2023

Discu

ss

Page 52: UpSkills: Research Data Management for the Sciences

5210/04/2023

Questions?

[email protected]

researchdata.unimelb.edu.au

Copyright (c) 2012, VeRSI Consortium, Lyle Winton , Steve Bennett, Jeff Christiansen