High-Throughput Phenotyping and Cohort Identification from Electronic Health Records for Clinical and Translational Research

High-Throughput Phenotyping and Cohort Identification from Electronic Health Records for Clinical and Translational ResearchJyotishman Pathak, PhDAssistant Professor of Biomedical Informatics

Health Sciences Research Grand RoundsApril 23, 2012

High-Throughput Phenotyping from EHRs

Background – The Problem

• Patient recruitment is a huge bottleneck step in conducting successful clinical research studies• 50% of time is spent in recruitment

• Low participant rates (~ 5%); studies are underpowered

• Clinicians: lack resources to help patients find appropriate studies and trials

• Patients: face difficultly to find appropriate studies that are locally available

©2012 MFMER | slide-2


Background – Use Cases• Large-scale genomics research

• Linking biospecimens and genetic data to personal health data via biorepositories

• Need large sample sizes for study design

• Population-based epidemiological studies in understanding disease etiology• Often limited in scope or population diversity

• Quality metrics and HITECH Act• Pay-for-Performance and quality-based incentives• Population management and cohort identification is non-

trivial



Electronic health records (EHRs) driven phenotyping – The Proposed Solution

• EHRs are becoming more and more prevalent within the U.S. healthcare system• Meaningful Use is one of the major drivers

• Overarching goal• To develop techniques and algorithms that

operate on normalized EHR data to identify cohorts of potentially eligible subjects on the basis of disease, symptoms, or related findings



Advantages: EHR-derived phenotyping• There is a LOT of information about subjects

• Demographics, labs, meds, procedures, clinical notes…• Identification of otherwise latent population differences

• Minimal costs for case ascertainment, no study-specific recruitment

• Records are “retrospectively longitudinal”• Records are real world and contain many different

phenotypes• Transportability and reuse of phenotype definitions

across EHR enabled sites = power for clinical and research studies



Challenges: EHR-derived phenotyping

• There is a LOT of information about subjects…• Non-standardized, heterogeneous, unstructured

data (compared to protocol-based structured data collection)

• Measured (e.g., demographics) vs. un-measured (e.g., socio-economic status) population differences

• Hospital specialization and coding practices• Population/regional market landscape



The challenges can be addressed…if we• Develop techniques for standardization and

normalization of clinical data and phenotypes• Develop techniques for transforming and

managing unstructured clinical text into structured representations

• Develop techniques for transportability of EHR-driven phenotyping algorithms

• Develop a scalable, robust and flexible framework for demonstrating all of the above in a “real-world setting”


High-Throughput Phenotyping from EHRs ©2012 MFMER | slide-8

http://gwas.org

http://gwas.org/


• Funded by the NHGRI/NIGMS• Goal: to assess utility of EHRs as resources for genome

science• Each site includes a biorepository linked to EHRs• Each project includes informatics, biostatistics, community

engagement, ELSI, genetics experts• Initial proposals included identifying a primary phenotype of

interest in 3,000 subjects and conduct of a genome-wide association study at each center: Σ=18,000

• eMERGE Phase II has a target of developing ~40 phenotype algorithms by the end of 2014

• Algorithm transportability an integral component



EHR-based Phenotyping Algorithms• Typical components

• Billing and diagnoses codes• Procedure codes• Labs• Medications• Phenotype-specific co-variates (e.g., Demographics,

Vitals, Smoking Status, CASI scores)• Pathology• Imaging?

• Organized into inclusion and exclusion criteria



EHR-based Phenotyping Algorithms

• Iteratively refine case definitions through partial manual review to achieve ~PPV ≥ 95%

• For controls, exclude all potentially overlapping syndromes and possible matches; iteratively refine such that ~NPV ≥ 98%



DataTransformTransform

Algorithm Development Process

PhenotypeAlgorithm

Visualization

Evaluation

NLP, SQL

Rules

Mappings [eMERGE Network]



Hypothyroidism: Initial Algorithm

No secondary causes (e.g., pregnancy, ablation)

No ICD-9s forHypothyroidism

NoAbnormalTSH/FT4

No Antiboides for TTG/TPO

ICD-9s forHypothyroidism

Antibodies forTTG or TPO(anti-thyroglobulin,anti-thyroperidase)

AbnormalTSH/FT4

No thyroid-altering medications (e.g., Phenytoin, Lithium)

Thyroid replace. meds

Case 1 Case 2

No thyroid replace. meds

Control

2+ non-acute visits in 3 yrs

No hx of myasthenia gravis


[Denny et al., 2012]


Hypothyroidism: Initial Algorithm

©2012 MFMER | slide-14[Conway et al. 2011]


Hypothyroidism: Algorithm Refinement

No secondary causes (e.g., pregnancy, ablation)

No ICD-9s forHypothyroidism

NoAbnormalTSH/FT4

No Antiboides for TTG/TPO

ICD-9s forHypothyroidism

Antiboides forTTG or TPO(anti-thyroglobulin,anti-thyroperidase)

AbnormalTSH/FT4

No thyroid-altering medications (e.g., Phenytoin, Lithium)

Thyroid replace. meds

Case 1 Case 2

No thyroid replace. meds

Control

2+ non-acute visits in 3 yrs

No hx of myasthenia gravis




New Hypothyroidism Algorithm: ValidationPositive Predictive Values (PPV) Based on Chart Review – All Sites

SiteEHR-based

Cases/Controls

Sampled forChart Review

Cases/ControlsOld CasePPV (%)

New Case PPV (%)

Group Health 430/1,188 50/50 92 98

Marshfield 509/1193 50/50 88 91

Mayo Clinic 250/2,145 100/100 76 97

Northwestern 103/516 50/50 88 98

Vanderbilt 184/1,344 50/50 90 98All sites 1,421/6,362 — 87 96



Data Categories used to define the EHR-driven Phenotyping Algorithms

Clinical gold standard

EHR-derived phenotype

Phenotype Definitions

Validation (PPV/NPV)

Alzheimer’s Dementia

Demographics, clinical examination of mental status, histopathologic examination

Diagnoses, medications

Demographics, laboratory tests, radiology reports

73%

Cataracts Clinical exam finding (Ophthalmologic examination)

Diagnoses, procedure codes

Demographics, medications

98%/98%

Peripheral Arterial Disease

Radiology test results (ankle-brachial index or arteriography)

Diagnoses, procedure codes, medications, radiology test results

Demographics 94%/99%

Type 2 Diabetes Laboratory Tests Diagnoses, laboratory tests, medications

Demographics, height, weight, family history

98%/100%

Cardiac Conduction

ECG measurements ECG report results Demographics, diagnoses, procedure codes, medications, laboratory tests

97%

[eMERGE Network]©2012 MFMER | slide-17

The Linked Clinical Data (LCD) Project

0.5 5

Genotype-Phenotype Association Results

0.5 50.5 5.01.0

Odds Ratio

rs2200733 Chr. 4q25rs10033464 Chr. 4q25rs11805303 IL23Rrs17234657 Chr. 5rs1000113 Chr. 5rs17221417 NOD2rs2542151 PTPN22rs3135388 DRB1*1501rs2104286 IL2RArs6897932 IL7RArs6457617 Chr. 6rs6679677 RSBN1rs2476601 PTPN22rs4506565 TCF7L2rs12255372 TCF7L2rs12243326 TCF7L2rs10811661 CDKN2Brs8050136 FTOrs5219 KCNJ11rs5215 KCNJ11rs4402960 IGF2BP2

Atrial fibrillation

Crohn's disease

Multiple sclerosis

Rheumatoid arthritis

Type 2 diabetes

disease gene / regionmarker

2.0[Ritchie et al. 2010]

observedpublished



Key lessons learned from eMERGE• Algorithm design and transportability

• Non-trivial; requires significant expert involvement• Highly iterative process• Time-consuming manual chart reviews• Representation of “phenotype logic” for transportability

is critical

• Standardized data access and representation• Importance of unified vocabularies, data elements, and

value sets• Questionable reliability of ICD & CPT codes (e.g., billing

the wrong code since it is easier to find)• Natural Language Processing (NLP) needs




Algorithm Development Process - Modified

PhenotypeAlgorithm

Visualization

Evaluation

NLP, SQL

Rules

Mappings

Semi-Automatic Execution

[eMERGE Network]



• Mission: To enable the use of EHR data for secondary purposes, such as clinical research and public health. Leveraging clinical and health informatics to:

•generate new knowledge•improve care•address population needs

http://sharpn.org

Strategic Health IT Advance Research Projects (SHARPn): Secondary Uses of

EHR Data


[Chute et al. 2011]

http://sharpn.org/

http://sharpn.org/


SHARPn: Secondary Use of EHR DataA $15M National Consortium

• Harvard University• Intermountain Healthcare• Mayo Clinic• Mirth Corporation, Inc.• MIT • MITRE Corp. • Regenstrief Institute, Inc.• SUNY, Buffalo • University of Colorado

• Agilex Technologies• CDISC (Clinical Data Interchange

Standards Consortium)• Centerphase Solutions• Deloitte• Group Health, Seattle• IBM Watson Research Labs• University of Utah• University of Pittsburgh



Cross-integrated suite of projects





PhenotypeAlgorithm

Visualization

Evaluation

NLP, SQL

Rules

Mappings



• Standardized representation of clinical data

• Create new and re-use existing clinical element models (CEMs)

• Standardized and structured representation of phenotype definition criteria

• Use the NQF Quality Data Model (QDM)

• Conversion of structured phenotype criteria into executable queries

• Use JBoss® Drools (DRLs)

[Welch et al. 2012][Thompson et al., submitted 2012]

[Li et al., submitted 2012]


The SHARPn “phenotyping funnel”


Phenotype specific patient cohorts

DRLs

QDMs

CEMs



Intermountain EHR

Mayo Clinic EHR


Clinical data normalization• Data Normalization

• Clinical data comes in all different forms even for the same kind of information

• Comparable and consistent data is foundational to secondary use

• Clinical Element Models (CEMs)• Basis for retaining computable meaning when data

is exchanged between heterogeneous computer systems

• Basis for shared computable meaning when clinical data is referenced in decision support logic



Clinical Element ModelsHigher-Order Structured Representations


[Stan Huff, IHC]


Pre- and Post-Coordination


[Stan Huff, IHC]

High-Throughput Phenotyping from EHRs [Stan Huff, IHC]


Data element harmonization

• Stan Huff (Intermountain Healthcare)• Clinical Information Model Initiative (CIMI)

• NHS Clinical Statement• CEN TC251/OpenEHR Archetypes• HL7 Templates• ISO TC215 Detailed Clinical Models• CDISC Common Clinical Elements• Intermountain/GE CEMs



SHARPn data normalization flow - I



SHARPn data normalization flow - II

CEM MySQL database with normalized patient information

[Welch et al. 2012]




PhenotypeAlgorithm

Visualization

Evaluation

NLP, SQL

Rules

Mappings










NQF Quality Data Model (QDM) - I• Standard of the National Quality Forum (NQF)

• A standard structure and grammar to represent quality measures precisely and accurately in a standardized format that can be used across electronic patient care systems

• First (and only) standard for “eMeasures”• “All patients 65 years of age or older with at least two

provider visits during the measurement period receiving influenza vaccine subcutaneously”

• Implemented as set of XML schemas• Links to standard terminologies (ICD-9, ICD-10,

SNOMED-CT, CPT-4, LOINC, RxNorm etc.)



NQF Quality Data Model (QDM) - II• Supports temporality & sequences

• AND: "Procedure, Performed: eye exam" > 1 year(s) starts before or during "Measurement end date"

• Groups of codes in a code set (ICD9, etc.)• Can group groups• Represented by OIDs, requires lookup• "Diagnosis, Active: steroid induced diabetes" using

"steroid induced diabetes Value Set GROUPING (2.16.840.1.113883.3.464.0001.113)”

• Focus on structured data• Would require extensions for NLP


High-Throughput Phenotyping from EHRs ©2012 MFMER | slide-37

116 Meaningful Use Phase I Quality Measures


Example: Diabetes & Lipid Mgmt. - I



Example: Diabetes & Lipid Mgmt. - II



NQF Measure Authoring Tool (MAT)



Our task: human readable machine computable


[Thompson et al., submitted 2012]




PhenotypeAlgorithm

Visualization

Evaluation

NLP, SQL

Rules

Mappings







• Conversion of structured phenotype criteria into executable queries

• Use JBoss® Drools (DRLs)




JBoss® open-source Drools environment

• Represents knowledge with declarative production rules• Origins in artificial intelligence expert systems• Simple when <pattern> then <action> rules

specified in text files• Separation of data and logic into separate

components• Forward chaining inference model (Rete algorithm)• Domain specific languages (DSL)



Drools inference architecture


Inference Execution Model Define a Knowledge Base

• Compiled Rules• Produces Production Memory

Extract Knowledge Session from Knowledge Base

Insert Facts (data) into Knowledge Session “Agenda”

Fire Rules (Race Conditions/Infinite Loops)

Retrieve End Results


Example Drools rule


rule "Glucose <= 40, Insulin On“

when $msg : GlucoseMsg(glucoseFinding <= 40,

currentInsulinDrip > 0 )then

glucoseProtocolResult.setInstruction(GlucoseInstructions.GLUCOSE_LESS_THAN_40_INSULIN_ON_MSG);end

{binding} {Java Class} {Class Getter Method}

Parameter {Java Class}

{Class Setter Method}

{Rule Name}


The “obvious” slide - T2DM Drools flow



Automatic translation from NQF QDM criteria to Drools

Measure Authoring

Toolkit

Drools Engine

From non-executable to executable

Data TypesXML-based structured

representation

Value Setssaved in XLS

files

MeasuresXML-basedStructured

representation

Mapping data typesand value sets

Fact Models

Converting measures to Drools scripts

Droolsscripts




SHARPn phenotyping architecture using CEMs, QDMs, and DRLs


[Welch et al. 2012]


The SHARPn “phenotyping funnel”


Phenotype specific patient cohorts

DRLs

QDMs

CEMs



Intermountain EHR

Mayo Clinic EHR


Phenotype library and workbench - Ihttp://phenotypeportal.org


Phenotype library and workbench - I

1. Converts QDM to Drools2. Rule execution by querying

the CEM database3. Generate summary reports

http://phenotypeportal.org


Phenotype library and workbench - IIhttp://phenotypeportal.org


Phenotype library and workbench - IIIhttp://phenotypeportal.org


Additional on-going research efforts• Machine learning and

association rule mining• Manual creation of

algorithms take time• Let computers do the

“hard work”• Validate against

expert developed ones


[Caroll et al. 2011]


Additional on-going research efforts• Machine learning and association rule mining

• Manual creation of algorithms take time• Let computers do the “hard work”• Validate against expert developed ones

• Just-in-time phenotyping• Current approach: retrospective, longitudinal

and offline data processing for phenotypes• Future: online, real-time phenotyping by

implementing “phenotype sniffers”• Applications in active syndrome surveillance

for transfusion medicine [Kor et al. 2012]



What does this R&D mean to HSR?• Common, agreed-upon and well-validated phenotype

definitions and criteria• Standardized clinical data retrieval and management• “One-stop place” for visualization, execution, and

report generation of phenotyping algorithms• Implications for (to name a few):

• Center for Science of Healthcare Delivery (SHCD)• Data Management Services (DMS/BSI)• Epidemiology & Health Care and Policy Research

(Epi./HCPR/Rochester Epi. Project)• Mayo Clinic Biobank/Genome Consortia (MayoGC)



Summary• EHRs contain a wealth of phenotypes for clinical

and translational research• EHRs represent real-world data, and hence has

challenges with interpretation, wrong diagnoses, and compliance with medications• Handling referral patients even more so

• Standardization and normalization of clinical data and phenotype definitions is critical

• Phenotyping algorithms are often transportable between multiple EHR settings• Validation is an important component



Acknowledgements: eMERGE collaborators


• NHGRI• Rongling Li• Teri Manolio

• Group Health• Eric Larson• Gail Jarvik• Chris Carlson• Wylie Burke• Gene Jart• David Carrell• Malia Fullerton• Walter Kukull• Paul Crane• Noah Weston

• Marshfield• Cathy McCarty• Peggy Peissig• Marilyn Ritchie• Russ Wilke

• Northwestern• Rex Chisholm• Bill Lowe• Phil Greenland• Luke Rassmussen• Justin Starren• Maureen Smith• Jen Allen-Pacheco• Will Thompson

• Mayo Clinic• Christopher G. Chute• Iftikhar J. Kullo• Suzette Bielinski• Mariza de Andrade• John Heit• Jyoti Pathak• Matt Durski• Sean Murphy• Kevin Bruce

• Vanderbilt• Dan Roden• Josh Denny• Brad Malin• Ellen Wright Clayton• Dana Crawford• Melissa Basford


Acknowledgement: SHARPn collaborators• Harvard University• Intermountain Healthcare• Mayo Clinic• Mirth Corporation, Inc.• MIT • MITRE Corp. • Regenstrief Institute, Inc.• SUNY, Buffalo • University of Colorado

• Agilex Technologies• CDISC (Clinical Data Interchange

Standards Consortium)• Centerphase Solutions• Deloitte• Group Health, Seattle• IBM Watson Research Labs• University of Utah• University of Pittsburgh



Thank You!


http://jyotishman.info

http://jyotishman.info/

High-Throughput Phenotyping and Cohort Identification from Electronic Health Records for Clinical and Translational Research

Documents

normalization of clinical

ehrs72012 mfmer slide

available2012 mfmer

unstructured data

normalized ehr data

genetic data

personal health data

clinical notesidentification