Data re-use in the CALIBER programme Anoop Shah ([email protected]) Clinical Epidemiology Group, University College London 14 th November 2013
May 11, 2015
Data re-use in the CALIBERprogramme
Anoop Shah ([email protected])
Clinical Epidemiology Group, University College London
14th November 2013
1 The CALIBER programme
2 Why make research data re-usable?
3 The CALIBER approach
4 Summary
The CALIBER programme
UCL & LSHTM collaboration
HospitalEpisode Statistics
MINAP registryGeneral practice
Deathregistrations
CALIBERlinked research database
Funded by NIHR and Wellcome Trust
CALIBER data
Defining continuous variables
clinical e.g. blood pressure, laboratory e.g. white cellcount
� Recorded in CPRD (primary care)
� Identified by ‘entity code’ and medcode (moregranular)
� Lab data now electronically transferred� Problems:
� Missing units� Erroneous values� Inconsistent recording� Missing data
Medcodes associated with a test resultExample: neutrophil counts (a type of white bloodcell) – may be absolute or percentage
Medcode Percent Term
18 89.6 Neutrophil count
17622 9.9 Percentage neutrophils
23114 0.3 Granulocyte count
23115 0.1 Percentagegranulocytes
13777 0.1 Neutrophil count NOS
Distribution of values for different units
Most common units
Analysis issues
� Extraction algorithm� Remove biologically implausible extreme values
� In a huge dataset with no restriction on possiblevalues, there will be some errors
� Standardise units� Decide how to analyse
� Timing e.g. relative to index date� Repeat measures� Transformation, splines, categories etc.� Missing data (e.g. multiple imputation)
Observation time in GP practice
� Observation time – when registered at GPpractice
� Practice ‘up to standard date’ – date afterwhich we expect that data are recorded
� If nothing recorded while registered at GP:� Patient may be abroad� Patient may be genuinely healthy
� Excluding observation time with no recordsrisks bias
Defining a diagnosis, e.g. atrial fibrillation
Defining a diagnosis
� Cross-map against different datasets� Individual data sources may miss cases, so
consider using linked datasets� Important for accurate measures of incidence� May be less important for associations between
disease and risk factor, as long as the risk factordoes not influence recording
Non-fatal myocardial infarction – allsources miss cases
8%
6% 7%
20%18% 10%Primarycare(CPRD)
MINAPdiseaseregistry
HospitalEpisodeStatistics
Motivations for re-using data
� Time taken to prepare data and definevariables
� Cost
� Different definitions used by different groups� Lack of transparency and reproducibility
Possible approaches
� Ad hoc sharing of codelists and algorithmswithin a group
� Publish codelists and algorithms with papers� The CALIBER approach
� Repository of codelists and algorithms� Web portal for researcher access
CALIBER ‘LEGO’ data access model
1001, 2000-01-01, 23,1,NULL,I481001, 1994-08-11,1234,1,3,7L1H3001001, 1993-01-01, 253,1,1,793Mz001231, 2012-03-03, 23,1,123,K651121, 2013-05-04, 7,1,3,5,14AN.001121, 2011-05-21, 81,1,9, G5731001511, 1993-01-11, 91,1,6,9hF1.00 1511, 199-03-11, 91,1,6, G5731009913, 2012-05-21, 81,1,9, G57310067222, 1994-11-01,1234,1,3,7L1H30067222, 1995-12-21,1234,1,3,7L1H30067222, 1991-03-03,1234,1,3,7L1H310682444, 1993-01-01, 253,1,1,793Mz00
1001, 2000-01-01, af_gprd=1 1231, 2012-03-03, af_hes=31121, 2013-05-04, af_procs_gprd=11511, 1993-01-11, heart_valve_gprd=29913, 2012-05-21, af_hes=167222, 1994-08-11, af_hes=1682444, 1993-01-01, heart_valve_hes=2
af=1, af_diag_date=2001-12-01
CALIBER phenotypes (research variables)
� Consistent definitions for multiple studies (over300 variables curated)
� Read, ICD-9, ICD-10, OPCS codelists
� Web portal to view variable definitions, andregistered users can view codelists (https://www.caliberresearch.org/portal)
� Future: able to download scripts (e.g. Stata, R,SQL)
CALIBER data portal
Open data
CALIBER data portal
� Encourage researchers to define variables in away that will be of use to others
� Final validated versions of codelists andvariables
� Review by clinician and researcher
CALIBER analysis software
� R packages for managing codelists and datapreparation (http://caliberanalysis.r-forge.r-project.org/)
� Lookup tables and data dictionaries
� Functions to simplify / automate commonsteps in data preparation
CALIBER expects researchers tocontribute to the resource
Researchcoordinator
Website content
Analysis
Publication
Impacts
InvestigatorsNon-
investigators Industry
ExperiencedNon-
experienced
Website form
Approvals
Data
Unified data access form
LEGO data access modelContribute phenotyping algorithms, linkages
Project feasibility and prioritization
Open access
Contribute to knowledge base
Advancement of knowledgeTranslationLegislation, policy, guidelinesEconomic benefit, industry
Difficulties encountered
� Setting up the data portal takes time, needsdedicated staff
� Researchers need to think outside their ownproject
� Variables are updated / corrected; need tostore different versions
Summary
� When analysing routine data think about howthe data were collected, and cross-checkdifferent sources of information
� Data sharing and re-use can bring benefits butneeds time and resources to manage