Top Banner
Open Science and Data sharing: the DataFirst experience Martin Wittenberg DataFirst 26 October 2017
24

Open science and data sharing: the DataFirst experience/Martin Wittenberg

Jan 22, 2018

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science and Data sharing:the DataFirst experience

Martin WittenbergDataFirst

26 October 2017

Page 2: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Overview• Introduction• Data and the research ecosystem• The problem of measurement in the social

sciences• Difficulties with sharing data• Why sharing data is essential• The role of a data platform like DataFirst

Page 3: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Introduction• I’m an economist trying to understand what has happened to South

Africa since the end of apartheid– Particularly in relation to wages, employment, inequality, service

delivery• Data and data quality are key

• I also direct DataFirst, which is an organisation based at UCT dedicated to making it easier for researchers to access social science microdata

• www.datafirst.uct.ac.za• https://sites.google.com/site/martinwwittenberg/home

Page 4: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Data and the research ecosystem

• Data doesn’t just appear• The value and meaning of data arises from

how it emerges within the

Page 5: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Data and the research ecosystem

Theory• e.g. how markets work

Application• e.g. the impact of

imposing a minimum wage in 2018

Measurement• e.g. Quarterly Labour

Force Survey• e.g. tax returns

Page 6: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Measurement• Sometimes for research purposes• But also incidental to other purposes

– e.g. tax data, satellite “night light” data

• Understand context, rules and procedures used– Sampling theory– Measurement instrument (e.g. questionnaire)– Fieldwork practice– Post-fieldwork data capture & processing– Imputations for missing values

Page 7: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Measurement in the social sciences

• Crucial to also understand what you are notseeing– Non-response

• In the social sciences the subjects of research often have an interest in the outcome– Choose what to report

Page 8: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

An example from my researchCompare earnings in tax data and surveys• Wages of

employees

Blog post at http://www.econ3x3.org/

Page 9: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Measurement issuesThe picture when looking at earnings from self-employment (business profits)

Why?• Penalties for

not reporting• But accurate

reporting means paying more tax

Page 10: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Data within the research ecosystem

• In summary, data is not useful for research unless– We know where it has come from– What sort of errors/biases are likely to be involved

in the measurement process• AND

– People who are working on applied questions know that it exists/can be accessed

Page 11: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Difficulties with sharing data• One of the challenges of sharing data is to

provide enough information about– Context– Measurement process(Metadata)

• Plus the data must be stored in a way that it is “discoverable”

• All of this costs time and effort

Page 12: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Other difficulties• Fear of getting scooped with one’s own data• Fear of someone else finding a path-breaking

application of the data that one hadn’t thought of• Fear of problems/errors in the measurement

process being exposed• Confidentiality/privacy of respondents

– Ethics clearance

Page 13: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

How might one deal with these?

• Getting scooped– Delay public release

• “Important Science” vs “Mere data gathering”– Underlying issue is really one of skill– Response is often “data squatting”/rent extraction– A more creative response is to find ways to get

training programmes up around the data

Page 14: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Issues with sharing, cont.• Exposing problems with the measurement

process– Becomes more critical if these data are the only

ones available– Reality is that there is no 100% clean dataset– Provided that there is still a detectable “signal” in

the data, it can still be used for science• It becomes easier to “fix” the problems if they are

openly acknowledged

Page 15: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Issues with sharing, cont.

• Confidentiality– “Open science” doesn’t mean that the data has to

be available on the web for anyone– Key issue is that there have to be transparent

protocols for access– e.g. “Secure Labs” as recently established in

DataFirst

Page 16: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Why sharing is essential• Proper science

– Can only be done if results can be replicated– Errors in analysis/measurement exposed

• New insights– It is impossible for one team to be on top of all the ways in

which a dataset could be used– Making data available allows some of the best and brightest

people in the world to think about your issues/problems• e.g. much of our insights into the impact and effectiveness of South

Africa’s old age pension system came from American academics– Of course some garbage is likely to be generated in the process

too

Page 17: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

Why sharing is essential, cont.

• Improvement in skills– South African quantitative social scientists of my

generation learned most of what we know from seeing international economists (notably Nobel prize winner Angus Deaton) work on our data

• He showed that there are fascinating questions to be answered

• He made his code available

Page 18: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

How do we make sharing more successful?

• This is really a question not only about the incentives to researchers and research organisations

• But also about institutions that can facilitate this process

• Organisations like DataFirst play an important role here

Page 19: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

The issue is really how to strengthen the links

Theory• e.g. how markets work

Application• e.g. the impact of

imposing a minimum wage in 2018

Measurement• e.g. Quarterly Labour

Force Survey• e.g. tax returns

Page 20: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Overview

Dissemination

Data Producer Skilled userDissemination

Feedback

Page 21: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Overview

Replicability of results

Data Published Paper

Analysis

Review/ReplicationFollow-up

Skilled Researcher

Reader

Page 22: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Overview

Best practice data production

Data ProducerMethodological

Research“Best practice”

Practical Issues

Feedback

Page 23: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Overview

Best practice data analysis

Page 24: Open science and data sharing: the DataFirst experience/Martin Wittenberg

Open Science

How can we strengthen these loops?

• These are not “add-ons” – they are an integral part of a successful science infrastructure– Like libraries, research clouds etc.– Need to be supported:

• Financially• Mandates for sharing data, particularly if public funds

have been used in collecting them