Open Data in a Big Data World: easy to say, but hard to do? Data in a Big Data World: easy to say, but hard to do? Sarah Callaghan [email protected] @sorcha_ni ORCID: 0000-0002-0517-1031

Open Data in a Big Data World: easy to say, but hard to do?

Sarah Callaghan [email protected]

@sorcha_niORCID: 0000-0002-0517-1031

Geoffrey Boulton, Dominique Babini, Simon Hodson, Jianhui Li, Tshilidzi Marwala, Maria Musoke, Paul Uhlir, Sally Wyatt

3rd LEARN workshop on Research Data Management, “Make research data management policies work”

Helsinki, 28 June 2016

Principles, Policies & Practice

Responsibilities

1-2. Scientists

3. Research institutions & universities

4. Publishers

5. Funding agencies

6. Scholarly societies and academies

7. Libraries & repositories

8. Boundaries of openness

Enabling practices

9. Citation and provenance

10. Interoperability

11. Non-restrictive re-use

12. Linkability

http://www.icsu.org/science-

international/accord

The Data Deluge

http://www.economist.com/node/21521549

http://www.leadformix.com/blog/2013/02/the-big-data-deluge/

It used to be “easy”…

Suber cells and mimosa leaves. Robert

Hooke, Micrographia, 1665

The Scientific Papers of William Parsons,

Third Earl of Rosse 1800-1867

…but datasets have gotten so big, it’s not

useful to publish them in hard copy anymore

Hard copy of the Human Genome at the Wellcome Collection

Example Big Data: CMIP5

CMIP5: Fifth Coupled Model Intercomparison Project

• Global community activity under the World Meteorological Organisation (WMO) via the World Climate Research Programme (WCRP)

•Aim:

– to address outstanding scientific questions that arose as part of the 4th Assessment Report process,

– improve understanding of climate, and

– to provide estimates of future climate change that will be useful to those considering its possible consequences.

Many distinct experiments, with very different characteristics, which influence the configuration of the models, (what they can do, and how they should be interpreted).

Simulations:

~ 90,000 years

~ 60 experiments

~ 20 modelling centres (from around the world) using

~ 30 major(*) model configurations

~ 2 million output “atomic” datasets

~ 10's of petabytes of output

~ 2 petabytes of CMIP5 requested output

~ 1 petabyte of CMIP5 “replicated” output

Which are replicated at a number of sites (including ours)

Major international collaboration!

Funded by EU FP7 projects (IS-ENES2, Metafor) and US (ESG) and other national sources (e.g. NERC for the UK)

CMIP5 numbers

Big Data:

• Industrialised and standardised data

and metadata production

• Large groups of people involved

• Methods for making the data open,

attribution and credit for data creation

established

Long Tail Data:

• Bespoke data and metadata creation

methods

• Small groups/lone researchers

• No generally accepted methods for

attribution and credit for data creation.

Often data is closed due to lack of

effort to open it

https://flic.kr/p/g1EHPR

Most people have an idea of what a publication is

Some examples of data (just from the Earth Sciences)

1. Time series, some still being updated e.g. meteorological measurements

2. Large 4D synthesised datasets, e.g. Climate, Oceanographic, Hydrological and Numerical Weather Prediction model data generated on a supercomputer

3. 2D scans e.g. satellite data, weather radar data

4. 2D snapshots, e.g. cloud camera

5. Traces through a changing medium, e.g. radiosonde launches, aircraft flights, ocean salinity and temperature

6. Datasets consisting of data from multiple instruments as part of the same measurement campaign

7. Physical samples, e.g. fossils

Open Data is not a new idea

Henry Oldenburg

Data, Reproducibility and Science

Science should be reproducible – other

people doing the same experiments in

the same way should get the same

results.

Observational data is not reproducible

(unless you have a time machine)

Therefore we need to have access to

the data to confirm the science is valid!

Poor data analysis generates false

facts – and false facts &

inaccessible data undermine

science & its credibility

http://www.flickr.com/photos/31333486@N00/1893012324/sizes/

o/in/photostream/

A crisis of reproducibility and credibility?

The data providing the evidence for a published concept MUST be concurrently

published, together with the metadata. To do otherwise is scientific MALPRACTICE

Pre-clinical oncology – 89% not reproducible

Why?

• Misconduct/fraud

• Invalid reasoning

• Absent or inadequate data and/or metadata

We’re only going to get more data

More big data - linked data – machine learning

The internet of things

So, what must we do?

• Concurrently publish data and metadata that are the evidence for a published

scientific claim – to do otherwise is malpractice

• Data science skills for researchers

• Re-establish standards of reproducibility for a data-intensive age

• Patterns not hitherto seen

• Unsuspected relationships

• Integrated analysis of diverse data (e.g. natural & social science)

• Complex systems

e.g. complexity: dynamic evolution and system state

But not all research is or needs to be data-intensive

Scientific Opportunities of Big Data

https://www.clickz.com/clic

kz/column/2389218/create

-better-content-via-humor

http://www.tylervigen.com/spurious-correlations

Caveat Emptor!

Data supporting a published claim Other data for re-use & integration

Pillars of the Digital Revolution

Big Data

Volume

Velocity

Variety

Veracity

Linked

Data

Many

databases

Semantic

Relations

Deeper

meaning

Foundations : Openness

Machine analysis & learning

The Open Data Edifice

Open Data initiatives in areas of:

Life sciences

Earth Science,

Environmental Science

Food Science

Agricultural Science

Chemical Crystallography

Bioinformatics/Genomics

Linguistics

Social Sciences

Evolutionary biology

Biodiversity

Astronomy

Earth Observation (GEO)

Archaeology

Atmospheric sciences

EMBL-EBI services Labs around the

world send us

their data and

we…

Archive it

Classify it Share it with

other data

providers

Analyse, add

value and

integrate it

…provide

tools to help

researchers

use it

A collaborative

enterprise

Elixir programme

It is happening: bottom-

up Open Data initiatives

The Open Data Iceberg

The Technical Challenge

The Consent Challenge

The Institutional Challenge

The Funding Challenge

The Support Challenge

The Skills Challenge

The Incentives Challenge

The Mindset Challenge

Processes &

Organisation

People

Developed from: Deetjen, U., E. T. Meyer and R. Schroeder

(2015). OECD Digital Economy Papers, No. 246, OECD

Publishing.

A National Infrastructure

Technology

Scientists

i. Publicly funded scientists have a responsibility to contribute to the

public good through the creation and communication of new

knowledge, of which associated data are intrinsic parts. They

should make such data openly available to others as soon as

possible after their production in ways that permit them to be re-

used and re-purposed.

ii. The data that provide evidence for published scientific claims

should be made concurrently and publicly available in an

intelligently open form. This should permit the logic of the link

between data and claim to be rigorously scrutinised and the

validity of the data to be tested by replication of experiments or

observations. To the extent possible, data should be deposited in

well-managed and trusted repositories with low access barriers.

From the Accord: Responsibilities

Open is not enough!

“When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.”

- http://ivory.idyll.org/blog/data-management.html https://flic.kr/p/awnCQu

The Understandability Challenge: Article

What the data set looks like on disk

What the raw data files look like.

I could make these files open easily, but no one would have

a clue how to use them!

The Understandability

Challenge: Data

It’s ok, I’ll just put it out there and if it’s important other people will figure it out

These documents have been preserved for thousands of years!But they’ve both been translated many times, with different meanings each time.

We need Metadata to preserve InformationWe can’t rely on Data Archaeology

Phaistos Disk, 1700BC

It’s not just data!

• Experimental protocols• Workflows• Software code• Metadata• Things that went wrong!• …

http://theupturnedmicroscope.com/comi

c/negative-data/

Creating a dataset is hard work!

"Piled Higher and Deeper" by Jorge Chamwww.phdcomics.com

Documenting a dataset so that it is usable and understandable by others is extra work!

Inputs Outputs

Open access

Administrative

data (held by

public

authorities e.g.

prescription

data)

Public Sector

Research data

(e.g. Met

Office weather

data)

Research

Data (e.g.

CERN,

generated in

universities)

Research

publications

(i.e. papers

in journals)

Open data

Open science

A direction of travel?

Collecting

the data

Doing

research

Doing science

openly

Researchers - Govt & Public sector - Businesses - Citizens - Citizen scientists

(communication/dialogue – joint production of knowledge)

Stakeholders

• Communication/dialogue must be audience-sensitive

• Is it – with all stakeholder groups?

Usability, trust, metadata

http://trollcats.com/2009/11/im-your-friend-and-i-only-want-whats-best-for-you-trollcat/

When you read a journal paper, it’s easy to read and get a quick understanding of the quality of the paper.

You don’t want to be downloading many GB of dataset to open it and see if it’s any use to you.

Need to use proxies for quality:• Do you know the data

source/repository? Can you trust it?• Is there enough metadata so that you

can understand and/or use the data?

In the same way that not all journal publishers are created equal, not all data repositories are created equal

Example metadata from a published dataset:“rain.csv contains rainfall in mm for each month at Marysville, Victoria from January 1995 to February 2009”

Lindenmayer, David B.; Wood, Jeff; McBurney, Lachlan; Michael, Damian; Crane, Mason; MacGregor, Christopher; Montague-Drake, Rebecca; Gibbons, Philip; Banks, Sam C.; (2011): rain; Dryad Digital Repository. http://doi.org/10.5061/DRYAD.QP1F6H0S/3

http://doi.org/10.5061/DRYAD.QP1F6H0S/3

Should ALL data be open?

Most data produced through publically funded research should be open.

But!

• Confidentiality issues (e.g. named persons’ health records)

• Conservation issues (e.g. maps of locations of rare animals at risk from poachers)

• Security issues (e.g. data and methodologies for building biological weapons) There should be a very good

reason for publically funded data to not be open.

Getting scoopedhttp://www.phdcomics.com/comics/archive.php?comicid=795

It happened to me!

I shared my data with another research group. They published

the first results using that data.

I wasn’t a co-author. I didn’t get an acknowledgement.

Citeable does not equal Open!

Just like you can cite a paper that is behind a paywall, you can cite a dataset that isn’t open.

Making something citeable means that:

• You know it exists

• You know who’s responsible for it

• You know where to find it

• You know a little bit about it (title, abstract,…)

Even if you can’t download/read the thing yourself.

Citation gives benefits that

encourage data producers to

make their data open

Be careful of your citations!

Summary and maybe conclusions?

• We need to open the products of research

• to encourage innovation and collaboration

• to give credit to the people who’ve created them

• to be transparent and trustworthy

• Openness does come at a cost!

• It’s not enough for data to be open

• it needs to be usable and understandable too

• Data citation and publication are ways of encouraging researchers to make their data open

• or at least tell the world that their data exists!

• We need a culture change – but it’s already happening!

http://www.keepcalm-o-matic.co.uk/default.aspx#createposter

http://www.keepcalm-o-matic.co.uk/default.aspx#createposter

Thanks!

Any questions?

[email protected]

@sorcha_ni

http://citingbytes.blogspot.co.uk/

“Publishing research without data is simply advertising, not science” - Graham Steel

http://blog.okfn.org/2013/09/03/publishing-research-without-data-is-simply-advertising-not-science/

http://heywhipple.com/dont-show-me-a-something-about-show-me-something/

http://citingbytes.blogspot.co.uk/

Open Data in a Big Data World: easy to say, but hard to do? Data in a Big Data World: easy to say, but hard to do? Sarah Callaghan [email protected] @sorcha_ni ORCID: 0000-0002-0517-1031

Documents