For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs.

For the e-Stat meeting of 27 Sept 2010

Paul Lambert / DAMES Node inputs

1) Progress updates

• DAMES Node services of a hopefully generic/transferable nature– GESDE services on occupations, educational

qualifications and ethnicity (www.dames.org.uk)– Data curation tool – Data fusion tool for merging data and

recoding/standardising variables

http://www.dames.org.uk/

GESDE: online services for data coordination/organisation

Tools for handing variables in social science data

Recoding measures; standardisation / harmonisation; Linking; Curating

17/MAR/2010 DIR workshop: Handling Social Science Data

3

The data curation tool

4

The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way

Fusion tool (invoking R) - scenariosComponent Description Fusion Tool Requirements

Input 1 = Data 1 (e.g. working data) Declaration of file location (online or in irods system)

Input 2 = Data 2 (e.g. external information) Ditto (or other expert input)

Linkage mechanism 1. Deterministic 2. Probabilistic3. Recode/Transform

Declaration of which of the three types of link is to be used, and file formats involved

Argument specification

Listing of required arguments to the mechanism (e.g. input and output files; variable names, linked to standard classifications)

R script invocation (e.g. condor submission) API or other device collects the above inputs and applies them to an R template

Output = Data 3 File 1+[a bit of file 2] New file is generated, to be supplied to user

Currently: Expected inputs to e-Stat, Autumn 2010 First applications in integrating DAMES data preparation tools

with e-Stat model-building systems

1) {Coordination/planning on WP1.6 workflow tools for pre-

analysis} (?De Roure, McDonald, Michaelides, Lambert, Goldstein, Southampton RA?)

2) Template construction with applications using variable recodes and other pre-analysis adjustments from DAMES systemswith view to generating generic template facilities

3) Preparation of some ‘typical’ example survey data/models (e.g. 10k+ cases, 50+ variables) and their implementation in e-Stat

e.g. Cross-national/longitudinal comparability examples

4) Possible e-Stat inputs to DAMES workshops (Nov 24-5/Jan 25-6)

7a) Links with DAMES• ..DAMES Node core funding period is Feb 2008- Jan 2011..

Further discussion of integrating pre-analysis services from DAMES into e-Stat facilities and templates

Appetite for other application-oriented contributions?• Alternative measures for the ‘changing circumstances during childhood’

application?• ?Preparation of illustrative application(s) with complex survey data • Would need data spec. and broad analytical plan

Pre-analysis options associated with DAMES

Things that could be facilitated by the fusion tool (R scripts) in combination with the curation tool and if relevant specialist data (e.g. from GESDE)

• Alternative measures/derived data • [via deterministic matches/variable transformation routines]• Using GESDE: Occupations, educational quals, ethnicity• (?Health oriented measures using Obesity e-Lab dropbox facility?)• Generic routines: Arithmetic standardisation tools• Replicability of measurement construction (e.g. syntax log of tasks)

• Other possible data/review possibilities• [new but easy] Routine for summarizing data (see wish list) • [new, probably not easy] Weighting data options; routine for

identifying values with high leverage / high residuals• (?provided elsewhere) Probabilistic matching routines

Model for data locations?

• ‘Curation tool’ can be used to attach variable names and metadata to facilitate variable processing

• We then have a model of storing the data in a secure remote location (‘irods server’), from where jobs can be run on it (e.g. in R)– Is this a suitable model for e-Stat?– Is there another data location model?– Or better to supply scripts to run on files in an unspecified

location?

Fusion tool (invoking R) - scenariosComponent Description Fusion Tool Requirements

Input 1 = Data 1 (e.g. working data) Declaration of file location (online or in irods system)

Input 2 = Data 2 (e.g. external information) Ditto (or other expert input)

Linkage mechanism 1. Deterministic 2. Probabilistic3. Recode/Transform

Declaration of which of the three types of link is to be used, and file formats involved

Argument specification

(example overleaf) Listing of required arguments to the mechanism (e.g. input and output files; variable names, linked to standard classifications)

R script invocation (e.g. condor submission) API or other device collects the above inputs and applies them to an R template

Output = Data 3 File 1+[a bit of file 2] New file is generated, to be supplied to user

Mechanism 1: Deterministic link

• Here information is joined on the basis of exact matching values

• Example condor job:

universe=vanillaexecutable = /usr/bin/Rarguments = --slave --vanilla --file=bhps_test.R --args /home/pl3/condor/condor_5/wave1.dta /home/pl3/condor/condor_5/wave17.dta /home/pl3/condor/condor_5/bhps_combined.dat pid wave file pid wave file notification = Neverlog = test1.logoutput = test1.outerror = test1.errqueue

• The input files here are Stata format data• The output is plain text format data• There are 3 linking variables, which happen to

have the same names on both files• Ie ‘pid wave file’ on file 1, and also ‘pid wave file’ on file 2• Different names would be fine but the same number of

variables on both files is essential• Different total numbers of linking variables are fine (most

often there is only 1) • Different R templates can be used to read data in

different formats (e.g. Stata, SPSS, plain text), though exported data can only be readily supplied in plain text

The R template being run in the above application is:

args <- as.factor(commandArgs(trailingOnly = TRUE)); options(useFancyQuotes=TRUE)fileAinp <- as.character(args[1]) fileBinp <- as.character(args[2]) fileCout <- as.character(args[3])##library(foreign)fileA <- read.dta(fileAinp, convert.factors=F)fileB <- read.dta(fileBinp, convert.factors=F)nargs <- sum(!is.na(args))allvars <- args[4:nargs]nargs2 <- (sum(!is.na(allvars)))first_vars <- as.character(allvars[1:(nargs2/2)])second_vars <- as.character(allvars[((nargs2/2)+1):nargs2])######combined2 <- merge(fileA, fileB, by.x=c(first_vars), by.y=c(second_vars), all.x=T, all.y=F, sort=F, suffixes = c(".x",".y") )######write.table(combined2, file=fileCout, col.names=TRUE , sep=",")###

Mechanism 2: Probabilistic link• This is when data form different files are linked on criteria

which are not just an exact match of values, but include some probabilistic algorithm – E.g. for each person in data 1 with the same characteristics,

select a random person from the pool of people in data 2 who are age 35-40, male, education = high, marital status=married, and link their voting preference data to the person in data 1

• Other implementation requirements are equivalent to deterministic matching, so long as criteria for the matching algorithm is determined

• Status: We don’t yet have a pool of probabilistic matching algorithms; we’ve one so far, which is random matching as in the above example

Mechanism 3: Recoding/Transforming

• Here the scenario is the application of an externally provided data recode, or other externally instructed arithmetic operation, onto a variable within data 1

• E.g. take the educational qualifications measure which is coded 1 to 20 in data 1; recode 1 thru 5 to the value 1, 6 thru 10 to the value 2, and all others to the value 3 (this is statistically equivalent to a deterministic match, but some recode inputs may not list every possible value)

• E.g. take the measure of income and calculate its mean standardised values within subgroups defined by regions (e.g. minus regional mean, divided by regional standard deviation)

– Status/Requirement: We need to develop a suitable mechanism to take recode style information/instructions from relevant external sources, and convert it into a suitable format for applying either a ‘recode’ or ‘merge’ routine in R

– We’d like to support: • Recode information supplied via SPSS and Stata syntax specifications; data file

matrices; and, potentially, manual specifications • Other transformation procedures supplied in advance from a small range of

possibilities (e.g. mean standardisation; log transformation, cropping of extreme values) plus a small set of related arguments (e.g. category variables)

Recode examples: Stata syntax:

recode var1 1/5=1 6/10=2 *=3, generate(var2)

SPSS syntax: recode var1 (1 thru 5=1) (6 thru 10=2) (else=3) /into=var2.

Data matrix format: -> Manual entry interface (SPSS example):

=> Linking data management services into the e-Stat template

Add ‘data review’ and ‘data construction’ elements, plus possible additional requests for modelling options

• Data review: single script with minor variations on data • Data construction: As above, these involve variable operations

and linkages with other files/resources – Derive measure on occupations, educational qualifications or ethnicity

given information on the character of existing data• Collected via the cutation tool, or, more realistically, from a short range of pre-

supplied alternatives?– Distributional transformations including standardisation; numeric

transformation; review variable distribution

• Model extensions: Weight cases options; leverage review;

8d) Wish lists/Suggestions• Include tools for describing/summarizing data

Outputs from generic ‘summarize’ commands in R linked to all templates Tool for reviewing model results / leverage, feeding back into model

respecificiation Tools for applying survey weight variables to analysis(?)

• User notes for models constructed (‘What was that?’)– Of benefit to novice and advanced practitioners– Potentially a part of the e-notebook, but could be a linked online guide (static)– E-Stat commands to provide documentation for replication– Terminologies used for the model/other user notes– Software equivalents or near equivalents (including estimator specs) – Algebraic expression and model abstract

• Tools for storing/compiling multiple model results – (mentioned previously, cf. ‘est table’ in Stata)

Possible components of ‘model description’ user notes

1) E-Stat model syntaxmodel{ for (i in 1:length(y36)) { y36[i] ~ dnorm(mu[i], tau) mu[i] <- cons[i] * beta0 + y8[i] * beta1 } # Priors beta0 ~ dflat() beta1 ~ dflat() tau ~ dgamma(0.001000, 0.001000) }

2) E-Stat model and name:

Template1Lev = Linear regression using MCMC

3) Model abstract: E.g. something like: “This model is suitable for a single outcome measure with a continuous distribution. It is comparable to the widely used OLS regression model, and usually leads to identical results, but by using the MCMC estimation method it can lead to different parameter estimates in some circumstances. The model presumes no structured relationship between different cases in the data. See … for further description.”

4) Other common names for this model:

Bayesian regression; etc

5) Specification of the model in other popular packages: BUGS syntax: [as E-Stat syntax]MLwiN syntax: [input here]R: [input here] Stata: MCMC estimation routines not availableSPSS: …Etc…

6) Algebraic representation

[Y=bX + e , etc]

• Est store demo here

20

For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs.

Documents

merging data

summarizing data

data locations

data location model

dames data preparation

stata format data

complex survey data

relevant specialist