High-Throughput Phenotyping and Cohort Identification from Electronic Health Records for Clinical and Translational Research Jyotishman Pathak, PhD Assistant Professor of Biomedical Informatics Health Sciences Research Grand Rounds April 23, 2012
Feb 25, 2016
High-Throughput Phenotyping and Cohort Identification from Electronic Health Records for Clinical and Translational ResearchJyotishman Pathak, PhDAssistant Professor of Biomedical Informatics
Health Sciences Research Grand RoundsApril 23, 2012
High-Throughput Phenotyping from EHRs
Background – The Problem
• Patient recruitment is a huge bottleneck step in conducting successful clinical research studies• 50% of time is spent in recruitment
• Low participant rates (~ 5%); studies are underpowered
• Clinicians: lack resources to help patients find appropriate studies and trials
• Patients: face difficultly to find appropriate studies that are locally available
©2012 MFMER | slide-2
High-Throughput Phenotyping from EHRs
Background – Use Cases• Large-scale genomics research
• Linking biospecimens and genetic data to personal health data via biorepositories
• Need large sample sizes for study design
• Population-based epidemiological studies in understanding disease etiology• Often limited in scope or population diversity
• Quality metrics and HITECH Act• Pay-for-Performance and quality-based incentives• Population management and cohort identification is non-
trivial
©2012 MFMER | slide-3
High-Throughput Phenotyping from EHRs
Electronic health records (EHRs) driven phenotyping – The Proposed Solution
• EHRs are becoming more and more prevalent within the U.S. healthcare system• Meaningful Use is one of the major drivers
• Overarching goal• To develop techniques and algorithms that
operate on normalized EHR data to identify cohorts of potentially eligible subjects on the basis of disease, symptoms, or related findings
©2012 MFMER | slide-4
High-Throughput Phenotyping from EHRs
Advantages: EHR-derived phenotyping• There is a LOT of information about subjects
• Demographics, labs, meds, procedures, clinical notes…• Identification of otherwise latent population differences
• Minimal costs for case ascertainment, no study-specific recruitment
• Records are “retrospectively longitudinal”• Records are real world and contain many different
phenotypes• Transportability and reuse of phenotype definitions
across EHR enabled sites = power for clinical and research studies
©2012 MFMER | slide-5
High-Throughput Phenotyping from EHRs
Challenges: EHR-derived phenotyping
• There is a LOT of information about subjects…• Non-standardized, heterogeneous, unstructured
data (compared to protocol-based structured data collection)
• Measured (e.g., demographics) vs. un-measured (e.g., socio-economic status) population differences
• Hospital specialization and coding practices• Population/regional market landscape
©2012 MFMER | slide-6
High-Throughput Phenotyping from EHRs
The challenges can be addressed…if we• Develop techniques for standardization and
normalization of clinical data and phenotypes• Develop techniques for transforming and
managing unstructured clinical text into structured representations
• Develop techniques for transportability of EHR-driven phenotyping algorithms
• Develop a scalable, robust and flexible framework for demonstrating all of the above in a “real-world setting”
©2012 MFMER | slide-7
High-Throughput Phenotyping from EHRs
• Funded by the NHGRI/NIGMS• Goal: to assess utility of EHRs as resources for genome
science• Each site includes a biorepository linked to EHRs• Each project includes informatics, biostatistics, community
engagement, ELSI, genetics experts• Initial proposals included identifying a primary phenotype of
interest in 3,000 subjects and conduct of a genome-wide association study at each center: Σ=18,000
• eMERGE Phase II has a target of developing ~40 phenotype algorithms by the end of 2014
• Algorithm transportability an integral component
©2012 MFMER | slide-9
High-Throughput Phenotyping from EHRs
EHR-based Phenotyping Algorithms• Typical components
• Billing and diagnoses codes• Procedure codes• Labs• Medications• Phenotype-specific co-variates (e.g., Demographics,
Vitals, Smoking Status, CASI scores)• Pathology• Imaging?
• Organized into inclusion and exclusion criteria
©2012 MFMER | slide-10
High-Throughput Phenotyping from EHRs
EHR-based Phenotyping Algorithms
• Iteratively refine case definitions through partial manual review to achieve ~PPV ≥ 95%
• For controls, exclude all potentially overlapping syndromes and possible matches; iteratively refine such that ~NPV ≥ 98%
©2012 MFMER | slide-11
High-Throughput Phenotyping from EHRs
DataTransformTransform
Algorithm Development Process
PhenotypeAlgorithm
Visualization
Evaluation
NLP, SQL
Rules
Mappings [eMERGE Network]
©2012 MFMER | slide-12
High-Throughput Phenotyping from EHRs
Hypothyroidism: Initial Algorithm
No secondary causes (e.g., pregnancy, ablation)
No ICD-9s forHypothyroidism
NoAbnormalTSH/FT4
No Antiboides for TTG/TPO
ICD-9s forHypothyroidism
Antibodies forTTG or TPO(anti-thyroglobulin,anti-thyroperidase)
AbnormalTSH/FT4
No thyroid-altering medications (e.g., Phenytoin, Lithium)
Thyroid replace. meds
Case 1 Case 2
No thyroid replace. meds
Control
2+ non-acute visits in 3 yrs
No hx of myasthenia gravis
©2012 MFMER | slide-13
[Denny et al., 2012]
High-Throughput Phenotyping from EHRs
Hypothyroidism: Initial Algorithm
©2012 MFMER | slide-14[Conway et al. 2011]
High-Throughput Phenotyping from EHRs
Hypothyroidism: Algorithm Refinement
No secondary causes (e.g., pregnancy, ablation)
No ICD-9s forHypothyroidism
NoAbnormalTSH/FT4
No Antiboides for TTG/TPO
ICD-9s forHypothyroidism
Antiboides forTTG or TPO(anti-thyroglobulin,anti-thyroperidase)
AbnormalTSH/FT4
No thyroid-altering medications (e.g., Phenytoin, Lithium)
Thyroid replace. meds
Case 1 Case 2
No thyroid replace. meds
Control
2+ non-acute visits in 3 yrs
No hx of myasthenia gravis
©2012 MFMER | slide-15
[Denny et al., 2012]
High-Throughput Phenotyping from EHRs
New Hypothyroidism Algorithm: ValidationPositive Predictive Values (PPV) Based on Chart Review – All Sites
SiteEHR-based
Cases/Controls
Sampled forChart Review
Cases/ControlsOld CasePPV (%)
New Case PPV (%)
Group Health 430/1,188 50/50 92 98
Marshfield 509/1193 50/50 88 91
Mayo Clinic 250/2,145 100/100 76 97
Northwestern 103/516 50/50 88 98
Vanderbilt 184/1,344 50/50 90 98All sites 1,421/6,362 — 87 96
©2012 MFMER | slide-16
[Denny et al., 2012]
Data Categories used to define the EHR-driven Phenotyping Algorithms
Clinical gold standard
EHR-derived phenotype
Phenotype Definitions
Validation (PPV/NPV)
Alzheimer’s Dementia
Demographics, clinical examination of mental status, histopathologic examination
Diagnoses, medications
Demographics, laboratory tests, radiology reports
73%
Cataracts Clinical exam finding (Ophthalmologic examination)
Diagnoses, procedure codes
Demographics, medications
98%/98%
Peripheral Arterial Disease
Radiology test results (ankle-brachial index or arteriography)
Diagnoses, procedure codes, medications, radiology test results
Demographics 94%/99%
Type 2 Diabetes Laboratory Tests Diagnoses, laboratory tests, medications
Demographics, height, weight, family history
98%/100%
Cardiac Conduction
ECG measurements ECG report results Demographics, diagnoses, procedure codes, medications, laboratory tests
97%
[eMERGE Network]©2012 MFMER | slide-17
The Linked Clinical Data (LCD) Project
0.5 5
Genotype-Phenotype Association Results
0.5 50.5 5.01.0
Odds Ratio
rs2200733 Chr. 4q25rs10033464 Chr. 4q25rs11805303 IL23Rrs17234657 Chr. 5rs1000113 Chr. 5rs17221417 NOD2rs2542151 PTPN22rs3135388 DRB1*1501rs2104286 IL2RArs6897932 IL7RArs6457617 Chr. 6rs6679677 RSBN1rs2476601 PTPN22rs4506565 TCF7L2rs12255372 TCF7L2rs12243326 TCF7L2rs10811661 CDKN2Brs8050136 FTOrs5219 KCNJ11rs5215 KCNJ11rs4402960 IGF2BP2
Atrial fibrillation
Crohn's disease
Multiple sclerosis
Rheumatoid arthritis
Type 2 diabetes
disease gene / regionmarker
2.0[Ritchie et al. 2010]
observedpublished
©2012 MFMER | slide-18
High-Throughput Phenotyping from EHRs
Key lessons learned from eMERGE• Algorithm design and transportability
• Non-trivial; requires significant expert involvement• Highly iterative process• Time-consuming manual chart reviews• Representation of “phenotype logic” for transportability
is critical
• Standardized data access and representation• Importance of unified vocabularies, data elements, and
value sets• Questionable reliability of ICD & CPT codes (e.g., billing
the wrong code since it is easier to find)• Natural Language Processing (NLP) needs
©2012 MFMER | slide-19
High-Throughput Phenotyping from EHRs
DataTransformTransform
Algorithm Development Process - Modified
PhenotypeAlgorithm
Visualization
Evaluation
NLP, SQL
Rules
Mappings
Semi-Automatic Execution
[eMERGE Network]
©2012 MFMER | slide-20
High-Throughput Phenotyping from EHRs
• Mission: To enable the use of EHR data for secondary purposes, such as clinical research and public health. Leveraging clinical and health informatics to:
•generate new knowledge•improve care•address population needs
http://sharpn.org
Strategic Health IT Advance Research Projects (SHARPn): Secondary Uses of
EHR Data
©2012 MFMER | slide-21
[Chute et al. 2011]
High-Throughput Phenotyping from EHRs
SHARPn: Secondary Use of EHR DataA $15M National Consortium
• Harvard University• Intermountain Healthcare• Mayo Clinic• Mirth Corporation, Inc.• MIT • MITRE Corp. • Regenstrief Institute, Inc.• SUNY, Buffalo • University of Colorado
• Agilex Technologies• CDISC (Clinical Data Interchange
Standards Consortium)• Centerphase Solutions• Deloitte• Group Health, Seattle• IBM Watson Research Labs• University of Utah• University of Pittsburgh
©2012 MFMER | slide-22
High-Throughput Phenotyping from EHRs
Cross-integrated suite of projects
©2012 MFMER | slide-23
High-Throughput Phenotyping from EHRs
DataTransformTransform
Algorithm Development Process - Modified
PhenotypeAlgorithm
Visualization
Evaluation
NLP, SQL
Rules
Mappings
Semi-Automatic Execution
©2012 MFMER | slide-24
• Standardized representation of clinical data
• Create new and re-use existing clinical element models (CEMs)
• Standardized and structured representation of phenotype definition criteria
• Use the NQF Quality Data Model (QDM)
• Conversion of structured phenotype criteria into executable queries
• Use JBoss® Drools (DRLs)
[Welch et al. 2012][Thompson et al., submitted 2012]
[Li et al., submitted 2012]
High-Throughput Phenotyping from EHRs
The SHARPn “phenotyping funnel”
©2012 MFMER | slide-25
Phenotype specific patient cohorts
DRLs
QDMs
CEMs
[Welch et al. 2012][Thompson et al., submitted 2012]
[Li et al., submitted 2012]
Intermountain EHR
Mayo Clinic EHR
High-Throughput Phenotyping from EHRs
Clinical data normalization• Data Normalization
• Clinical data comes in all different forms even for the same kind of information
• Comparable and consistent data is foundational to secondary use
• Clinical Element Models (CEMs)• Basis for retaining computable meaning when data
is exchanged between heterogeneous computer systems
• Basis for shared computable meaning when clinical data is referenced in decision support logic
©2012 MFMER | slide-26
The Linked Clinical Data (LCD) Project
Clinical Element ModelsHigher-Order Structured Representations
©2012 MFMER | slide-27
[Stan Huff, IHC]
The Linked Clinical Data (LCD) Project
Pre- and Post-Coordination
©2012 MFMER | slide-28
[Stan Huff, IHC]
High-Throughput Phenotyping from EHRs [Stan Huff, IHC]
High-Throughput Phenotyping from EHRs
Data element harmonization
• Stan Huff (Intermountain Healthcare)• Clinical Information Model Initiative (CIMI)
• NHS Clinical Statement• CEN TC251/OpenEHR Archetypes• HL7 Templates• ISO TC215 Detailed Clinical Models• CDISC Common Clinical Elements• Intermountain/GE CEMs
©2012 MFMER | slide-30
High-Throughput Phenotyping from EHRs
SHARPn data normalization flow - I
©2012 MFMER | slide-31
©2012 MFMER | slide-32
SHARPn data normalization flow - II
CEM MySQL database with normalized patient information
[Welch et al. 2012]
High-Throughput Phenotyping from EHRs
DataTransformTransform
Algorithm Development Process - Modified
PhenotypeAlgorithm
Visualization
Evaluation
NLP, SQL
Rules
Mappings
Semi-Automatic Execution
©2012 MFMER | slide-34
• Standardized representation of clinical data
• Create new and re-use existing clinical element models (CEMs)
• Standardized and structured representation of phenotype definition criteria
• Use the NQF Quality Data Model (QDM)
[Welch et al. 2012][Thompson et al., submitted 2012]
[Li et al., submitted 2012]
High-Throughput Phenotyping from EHRs
NQF Quality Data Model (QDM) - I• Standard of the National Quality Forum (NQF)
• A standard structure and grammar to represent quality measures precisely and accurately in a standardized format that can be used across electronic patient care systems
• First (and only) standard for “eMeasures”• “All patients 65 years of age or older with at least two
provider visits during the measurement period receiving influenza vaccine subcutaneously”
• Implemented as set of XML schemas• Links to standard terminologies (ICD-9, ICD-10,
SNOMED-CT, CPT-4, LOINC, RxNorm etc.)
©2012 MFMER | slide-35
High-Throughput Phenotyping from EHRs
NQF Quality Data Model (QDM) - II• Supports temporality & sequences
• AND: "Procedure, Performed: eye exam" > 1 year(s) starts before or during "Measurement end date"
• Groups of codes in a code set (ICD9, etc.)• Can group groups• Represented by OIDs, requires lookup• "Diagnosis, Active: steroid induced diabetes" using
"steroid induced diabetes Value Set GROUPING (2.16.840.1.113883.3.464.0001.113)”
• Focus on structured data• Would require extensions for NLP
©2012 MFMER | slide-36
High-Throughput Phenotyping from EHRs ©2012 MFMER | slide-37
116 Meaningful Use Phase I Quality Measures
High-Throughput Phenotyping from EHRs
Example: Diabetes & Lipid Mgmt. - I
©2012 MFMER | slide-38
High-Throughput Phenotyping from EHRs
Example: Diabetes & Lipid Mgmt. - II
©2012 MFMER | slide-39
High-Throughput Phenotyping from EHRs
NQF Measure Authoring Tool (MAT)
©2012 MFMER | slide-40
High-Throughput Phenotyping from EHRs
Our task: human readable machine computable
©2012 MFMER | slide-41
[Thompson et al., submitted 2012]
High-Throughput Phenotyping from EHRs
DataTransformTransform
Algorithm Development Process - Modified
PhenotypeAlgorithm
Visualization
Evaluation
NLP, SQL
Rules
Mappings
Semi-Automatic Execution
©2012 MFMER | slide-42
• Standardized representation of clinical data
• Create new and re-use existing clinical element models (CEMs)
• Standardized and structured representation of phenotype definition criteria
• Use the NQF Quality Data Model (QDM)
• Conversion of structured phenotype criteria into executable queries
• Use JBoss® Drools (DRLs)
[Welch et al. 2012][Thompson et al., submitted 2012]
[Li et al., submitted 2012]
High-Throughput Phenotyping from EHRs
JBoss® open-source Drools environment
• Represents knowledge with declarative production rules• Origins in artificial intelligence expert systems• Simple when <pattern> then <action> rules
specified in text files• Separation of data and logic into separate
components• Forward chaining inference model (Rete algorithm)• Domain specific languages (DSL)
©2012 MFMER | slide-43
High-Throughput Phenotyping from EHRs
Drools inference architecture
©2012 MFMER | slide-44
Inference Execution Model Define a Knowledge Base
• Compiled Rules• Produces Production Memory
Extract Knowledge Session from Knowledge Base
Insert Facts (data) into Knowledge Session “Agenda”
Fire Rules (Race Conditions/Infinite Loops)
Retrieve End Results
High-Throughput Phenotyping from EHRs
Example Drools rule
©2012 MFMER | slide-46
rule "Glucose <= 40, Insulin On“
when $msg : GlucoseMsg(glucoseFinding <= 40,
currentInsulinDrip > 0 )then
glucoseProtocolResult.setInstruction(GlucoseInstructions.GLUCOSE_LESS_THAN_40_INSULIN_ON_MSG);end
{binding} {Java Class} {Class Getter Method}
Parameter {Java Class}
{Class Setter Method}
{Rule Name}
High-Throughput Phenotyping from EHRs
The “obvious” slide - T2DM Drools flow
©2012 MFMER | slide-47
High-Throughput Phenotyping from EHRs
Automatic translation from NQF QDM criteria to Drools
Measure Authoring
Toolkit
Drools Engine
From non-executable to executable
Data TypesXML-based structured
representation
Value Setssaved in XLS
files
MeasuresXML-basedStructured
representation
Mapping data typesand value sets
Fact Models
Converting measures to Drools scripts
Droolsscripts
©2012 MFMER | slide-48
[Li et al., submitted 2012]
High-Throughput Phenotyping from EHRs
SHARPn phenotyping architecture using CEMs, QDMs, and DRLs
©2012 MFMER | slide-49
[Welch et al. 2012]
High-Throughput Phenotyping from EHRs
The SHARPn “phenotyping funnel”
©2012 MFMER | slide-50
Phenotype specific patient cohorts
DRLs
QDMs
CEMs
[Welch et al. 2012][Thompson et al., submitted 2012]
[Li et al., submitted 2012]
Intermountain EHR
Mayo Clinic EHR
©2012 MFMER | slide-51
Phenotype library and workbench - Ihttp://phenotypeportal.org
©2012 MFMER | slide-52
Phenotype library and workbench - I
1. Converts QDM to Drools2. Rule execution by querying
the CEM database3. Generate summary reports
http://phenotypeportal.org
©2012 MFMER | slide-53
Phenotype library and workbench - IIhttp://phenotypeportal.org
©2012 MFMER | slide-54
Phenotype library and workbench - IIIhttp://phenotypeportal.org
High-Throughput Phenotyping from EHRs
Additional on-going research efforts• Machine learning and
association rule mining• Manual creation of
algorithms take time• Let computers do the
“hard work”• Validate against
expert developed ones
©2012 MFMER | slide-55
[Caroll et al. 2011]
High-Throughput Phenotyping from EHRs
Additional on-going research efforts• Machine learning and association rule mining
• Manual creation of algorithms take time• Let computers do the “hard work”• Validate against expert developed ones
• Just-in-time phenotyping• Current approach: retrospective, longitudinal
and offline data processing for phenotypes• Future: online, real-time phenotyping by
implementing “phenotype sniffers”• Applications in active syndrome surveillance
for transfusion medicine [Kor et al. 2012]
©2012 MFMER | slide-56
High-Throughput Phenotyping from EHRs
What does this R&D mean to HSR?• Common, agreed-upon and well-validated phenotype
definitions and criteria• Standardized clinical data retrieval and management• “One-stop place” for visualization, execution, and
report generation of phenotyping algorithms• Implications for (to name a few):
• Center for Science of Healthcare Delivery (SHCD)• Data Management Services (DMS/BSI)• Epidemiology & Health Care and Policy Research
(Epi./HCPR/Rochester Epi. Project)• Mayo Clinic Biobank/Genome Consortia (MayoGC)
©2012 MFMER | slide-57
High-Throughput Phenotyping from EHRs
Summary• EHRs contain a wealth of phenotypes for clinical
and translational research• EHRs represent real-world data, and hence has
challenges with interpretation, wrong diagnoses, and compliance with medications• Handling referral patients even more so
• Standardization and normalization of clinical data and phenotype definitions is critical
• Phenotyping algorithms are often transportable between multiple EHR settings• Validation is an important component
©2012 MFMER | slide-58
High-Throughput Phenotyping from EHRs
Acknowledgements: eMERGE collaborators
©2012 MFMER | slide-59
• NHGRI• Rongling Li• Teri Manolio
• Group Health• Eric Larson• Gail Jarvik• Chris Carlson• Wylie Burke• Gene Jart• David Carrell• Malia Fullerton• Walter Kukull• Paul Crane• Noah Weston
• Marshfield• Cathy McCarty• Peggy Peissig• Marilyn Ritchie• Russ Wilke
• Northwestern• Rex Chisholm• Bill Lowe• Phil Greenland• Luke Rassmussen• Justin Starren• Maureen Smith• Jen Allen-Pacheco• Will Thompson
• Mayo Clinic• Christopher G. Chute• Iftikhar J. Kullo• Suzette Bielinski• Mariza de Andrade• John Heit• Jyoti Pathak• Matt Durski• Sean Murphy• Kevin Bruce
• Vanderbilt• Dan Roden• Josh Denny• Brad Malin• Ellen Wright Clayton• Dana Crawford• Melissa Basford
High-Throughput Phenotyping from EHRs
Acknowledgement: SHARPn collaborators• Harvard University• Intermountain Healthcare• Mayo Clinic• Mirth Corporation, Inc.• MIT • MITRE Corp. • Regenstrief Institute, Inc.• SUNY, Buffalo • University of Colorado
• Agilex Technologies• CDISC (Clinical Data Interchange
Standards Consortium)• Centerphase Solutions• Deloitte• Group Health, Seattle• IBM Watson Research Labs• University of Utah• University of Pittsburgh
©2012 MFMER | slide-60
High-Throughput Phenotyping from EHRs
Thank You!
©2012 MFMER | slide-61
http://jyotishman.info