Data re-use in the CALIBER programme

Post on 11-May-2015

98 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

An overview of work being performed to make research data easier to manage, analyse and use in the CALIBER programme. Presentation given by Anoop Shah of UCL at the Data Management in Practice workshop which took place on Nov 14th at the London School of Hygiene and Tropical Medicine

Transcript

Data re-use in the CALIBERprogramme

Anoop Shah (a.shah@ucl.ac.uk)

Clinical Epidemiology Group, University College London

14th November 2013

1 The CALIBER programme

2 Why make research data re-usable?

3 The CALIBER approach

4 Summary

The CALIBER programme

UCL & LSHTM collaboration

HospitalEpisode Statistics

MINAP registryGeneral practice

Deathregistrations

CALIBERlinked research database

Funded by NIHR and Wellcome Trust

CALIBER data

Defining continuous variables

clinical e.g. blood pressure, laboratory e.g. white cellcount

� Recorded in CPRD (primary care)

� Identified by ‘entity code’ and medcode (moregranular)

� Lab data now electronically transferred� Problems:

� Missing units� Erroneous values� Inconsistent recording� Missing data

Medcodes associated with a test resultExample: neutrophil counts (a type of white bloodcell) – may be absolute or percentage

Medcode Percent Term

18 89.6 Neutrophil count

17622 9.9 Percentage neutrophils

23114 0.3 Granulocyte count

23115 0.1 Percentagegranulocytes

13777 0.1 Neutrophil count NOS

Distribution of values for different units

Most common units

Analysis issues

� Extraction algorithm� Remove biologically implausible extreme values

� In a huge dataset with no restriction on possiblevalues, there will be some errors

� Standardise units� Decide how to analyse

� Timing e.g. relative to index date� Repeat measures� Transformation, splines, categories etc.� Missing data (e.g. multiple imputation)

Observation time in GP practice

� Observation time – when registered at GPpractice

� Practice ‘up to standard date’ – date afterwhich we expect that data are recorded

� If nothing recorded while registered at GP:� Patient may be abroad� Patient may be genuinely healthy

� Excluding observation time with no recordsrisks bias

Defining a diagnosis, e.g. atrial fibrillation

Defining a diagnosis

� Cross-map against different datasets� Individual data sources may miss cases, so

consider using linked datasets� Important for accurate measures of incidence� May be less important for associations between

disease and risk factor, as long as the risk factordoes not influence recording

Non-fatal myocardial infarction – allsources miss cases

8%

6% 7%

20%18% 10%Primarycare(CPRD)

MINAPdiseaseregistry

HospitalEpisodeStatistics

Motivations for re-using data

� Time taken to prepare data and definevariables

� Cost

� Different definitions used by different groups� Lack of transparency and reproducibility

Possible approaches

� Ad hoc sharing of codelists and algorithmswithin a group

� Publish codelists and algorithms with papers� The CALIBER approach

� Repository of codelists and algorithms� Web portal for researcher access

CALIBER ‘LEGO’ data access model

1001, 2000-01-01, 23,1,NULL,I481001, 1994-08-11,1234,1,3,7L1H3001001, 1993-01-01, 253,1,1,793Mz001231, 2012-03-03, 23,1,123,K651121, 2013-05-04, 7,1,3,5,14AN.001121, 2011-05-21, 81,1,9, G5731001511, 1993-01-11, 91,1,6,9hF1.00 1511, 199-03-11, 91,1,6, G5731009913, 2012-05-21, 81,1,9, G57310067222, 1994-11-01,1234,1,3,7L1H30067222, 1995-12-21,1234,1,3,7L1H30067222, 1991-03-03,1234,1,3,7L1H310682444, 1993-01-01, 253,1,1,793Mz00

1001, 2000-01-01, af_gprd=1 1231, 2012-03-03, af_hes=31121, 2013-05-04, af_procs_gprd=11511, 1993-01-11, heart_valve_gprd=29913, 2012-05-21, af_hes=167222, 1994-08-11, af_hes=1682444, 1993-01-01, heart_valve_hes=2

af=1, af_diag_date=2001-12-01

CALIBER phenotypes (research variables)

� Consistent definitions for multiple studies (over300 variables curated)

� Read, ICD-9, ICD-10, OPCS codelists

� Web portal to view variable definitions, andregistered users can view codelists (https://www.caliberresearch.org/portal)

� Future: able to download scripts (e.g. Stata, R,SQL)

CALIBER data portal

Open data

CALIBER data portal

� Encourage researchers to define variables in away that will be of use to others

� Final validated versions of codelists andvariables

� Review by clinician and researcher

CALIBER analysis software

� R packages for managing codelists and datapreparation (http://caliberanalysis.r-forge.r-project.org/)

� Lookup tables and data dictionaries

� Functions to simplify / automate commonsteps in data preparation

CALIBER expects researchers tocontribute to the resource

Researchcoordinator

Website content

Analysis

Publication

Impacts

InvestigatorsNon-

investigators Industry

ExperiencedNon-

experienced

Website form

Approvals

Data

Unified data access form

LEGO data access modelContribute phenotyping algorithms, linkages

Project feasibility and prioritization

Open access

Contribute to knowledge base

Advancement of knowledgeTranslationLegislation, policy, guidelinesEconomic benefit, industry

Difficulties encountered

� Setting up the data portal takes time, needsdedicated staff

� Researchers need to think outside their ownproject

� Variables are updated / corrected; need tostore different versions

Summary

� When analysing routine data think about howthe data were collected, and cross-checkdifferent sources of information

� Data sharing and re-use can bring benefits butneeds time and resources to manage

top related