Top Banner
Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the ‘Data Management through e-Social Science’ research Node , www.dames.org.uk
27

Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

Mar 28, 2015

Download

Documents

Molly Adair
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

Linking the DAMES & e-Stat Nodes

Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting

DAMES is the ‘Data Management through e-Social Science’ research Node , www.dames.org.uk

Page 2: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

2

1. Some background on DAMES

2. First thoughts on linking DAMES and e-Stat

3. Some proposals on usability / services

Page 3: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

3

1) Data Management though e-Social Science

DAMES – www.dames.org.uk

ESRC Node funded 2008-2011

Aim: Useful social science provisionsSpecialist data topics – occupations; education qualifications;

ethnicity; social care; health Mainstream packages and accessible resources

Aim: To exploit/engage with existing DM resources In social science – e.g. ESDS, CESSDA In e-Science – e.g. OGSA-DAI; OMII

Page 4: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

4

To us ‘Data management’ means…

‘the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES Node..]

Usually performed by social scientists themselves• Pre-analysis tasks (though often revised/updated)• Inputs also from data providers

Usually a substantial component of the work process• But may not be explicitly rewarded (and sometimes penalised)

differentiate from archiving / controlling data itselfdifferentiate from archiving / controlling data itself

Page 5: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

5

Some components…

Manipulating data Recoding categories / ‘operationalising’ variables

Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data)

Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions

Harmonisation standards Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) Recommendations on particular ‘variable constructions’

Cleaning data ‘missing values’; implausible responses; extreme values

Page 6: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

6

Example – recoding data

Count

323 0 0 0 0 323

982 0 0 0 0 982

0 425 0 0 0 425

0 1597 0 0 0 1597

0 0 340 0 0 340

0 0 3434 0 0 3434

0 0 161 0 0 161

0 0 0 1811 0 1811

0 0 0 0 2518 2518

0 0 0 331 0 331

0 0 0 0 421 421

0 0 0 257 0 257

102 0 0 0 0 102

0 0 0 0 2787 2787

138 0 0 0 0 138

1545 2022 3935 2399 5726 15627

-9 Missing or wild

-7 Proxy respondent

1 Higher Degree

2 First Degree

3 Teaching QF

4 Other Higher QF

5 Nursing QF

6 GCE A Levels

7 GCE O Levels or Equiv

8 Commercial QF, No OLevels

9 CSE Grade 2-5,ScotGrade 4-5

10 Apprenticeship

11 Other QF

12 No QF

13 Still At School No QF

Highesteducationalqualification

Total

-9.001.00

Degree2.00

Diploma

3.00 Higherschool orvocational

4.00 Schoollevel orbelow

educ4

Total

Page 7: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

7

Example –Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk

Page 8: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

8

Matching files (‘deterministic’)

Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... One-to-one matching

SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta

One-to-many matching (‘table distribution’)SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid .Stata: merge pid using file2.dta

Many-to-one matching (‘aggregation’)SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid)

Many-to-Many matches

Related cases matching

Page 9: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

9

A bit of focus…

I tend to emphasise two data management activities:

1) Variable constructions o Coding and re-coding values

2) Linking datasetso Internal and external linkages

Page 10: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

10

..plus the centrality of keeping clear records of DM activities

Reproducible (for self)Replicable (for all)Paper trail for whole

lifecycleCf. Dale 2006; Freese 2007

In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)

Syntax Examples: www.longitudinal.stir.ac.uk

Page 11: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

Principle DAMES services (current status)

GESDE specialist data environments (prototypes)

Occupations, educational qualifications, ethnicity

Data curation tool (prototype)

Data fusion tool (prototype)

Secure data demonstrator for e-Health research (complete) Micro-simulation model for social care data (prototype) Training workshops and events (in progress)

11

Page 12: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

GEMDE – Grid Enabled Specialist Data Environments

12

Page 13: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

GEODE – Occupational data

Page 14: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

Data curation tool

14

The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way

Page 15: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

Data fusion tool

15

Page 16: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

16

2. Linking DAMES and e-Stat

High level vision is to ingrain data management functionality and uptake within e-Stat modelling capabilities

- Using/adapting DAMES contributions- DAMES services for data linking- DAMES resources for recoding variables

- Making replication central to the data story

Page 17: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

Data and variables

DAMES does not in general provide routes to new/alternative microdata, but to relevant supplementary data (e.g. aggregate data)

Anything on educational qualifications, occupations, ethnicity is of particular interest

Generic tools for merging micro-dataGeneric tools for other variable processes

17

Page 18: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

Data oriented review

Applied research perspective Range of data resources Accessing and documenting data resource

options

18

Page 19: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

The implementation for e-Stat

This is mostly a blank space… …and we’ve not hitherto used Python

Data curation tool and GEODE/GEEDE use IRODS

GEMDE uses a bespoke SQL database Data fusion tool uses R (and some Stata)

scripts accessed via a Liferay portal

Page 20: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

20

3. A pitch for specific e-Stat facilities

..harvest the best of data analysis packages from applied data perspective

Replication in ‘human readable syntax’Something like Stata’s ‘est store’ for multiple

model comparisonsFluency in data oriented options Training resources in data

Page 21: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

Est store demo here

21

Page 22: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

Appendix items

22

Page 23: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

23

Data file specification Variable manipulation & analysis

DAMES most common commands:

Commands invoking other packages

-> usedataset{UKDA_5151}

-> usedatafile{individuals wave A}

-> matchdata{individuals wave A;individuals wave B; link

variable=pid; format=wide}

-> SPSS{match files file=“aindresp.sav” /file=“bindresp.sav”

/by=pid}

-> SPSS{fre var=ajbrgsc}

-> Stata{recode ageb 16/30=1 31/50=2 *=.}

-> R{..}

-> Stata{do $path2\part1_analysis.do}

Model 1:

Graphics

Text interface

Invoked manually or in response to manipulating graphs

BHPS, wave A individuals

BHPS wave B individuals.

Analytical file

Wave C

Gender Current job RGSC

Spouse CAMSIS

Age (yrs) Age

bands

Spouse SOC

Page 24: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

24

‘The significance of data management for social survey research’

(see http://www.esds.ac.uk/news/eventdetail.asp?id=2151)

The data manipulations described above are a major component of the social survey research workload

Pre-release manipulations performed by distributors / archivists• Coding measures into standard categories• Dealing with missing records

Post-release manipulations performed by researchers • Re-coding measures into simple categories

We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently

So the ‘significance’ of DM is about how much better research might be if we did things more effectively…

Page 25: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

25

Some provocative examples for the UK…

Social mobility is increasing, not decreasing!− Popularity of controversial findings associated with Blanden et al (2004)− Contradicted by wider ranging datasets and/or better measures of stratification position− DM: researchers ought to be able to more easily access wider data and better variables

Degrees, MSc’s and PhD’s are getting easier!− {or at least, more people are getting such qualifications}− Correlates with measures of education are changing over time − DM: facility in identifying qualification categories & standardising their relative value within

age/cohort/gender distributions isn’t, but should, and could, be widespread

‘Black-Caribbeans’ are not disappearing! − As the 1948-70 immigrant cohort ages, the ‘Black-Caribbean’ group is decreasingly

prominent due to return migration and social integration of immigrant descendants − Data collectors under-pressure to measure large groups only− DM: It ought to remain easy to access and analyse survey data on Black-Caribbean’s, such

as by merging survey data sources and/or linking with suitable summary measures

Page 26: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

26

Comment – growing interest in data management..?

Historically, references covering DM were few and far between• Dale, A., Arber, S., & Procter, M. (1988). Doing Secondary Analysis. London:

Unwin Hyman Ltd. Recently, there’s been a small burst of relevant references

• Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS Statistics 17.0. Chicago, Il.: SPSS Inc. .

• Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press.

• Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass.

• http://www.esds.ac.uk/support/onlineguides.asp• http://www.longitudinal.stir.ac.uk/

..and growing interest re. ‘documentation for replication’ • Dale, A. (2006). Quality Issues with Survey Research. International Journal of

Social Research Methodology, 9(2), 143-158.• Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not

Sociology? Sociological Methods and Research, 36(2), 2007.

Page 27: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

27

E-Science and Data Management

E-Science isn’t essential to good DM, but it has capacity to improve and support conduct of DM…

1. Concern with standards setting in communication and enhancement of data

2. Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources

3) Contribution of metadata tools/standards for variable harmonisation and standardisation

4) Linking data subject to different security levels

5) The workflow nature of many DM tasks