Predicting Hospital Readmission using Cascading by H. Michael Covert and Victoria Loewengart September 3, 2015 Proprietaryand Confidential
Predicting Hospital Readmission using Cascading
by H. Michael Covert and Victoria Loewengart
September 3, 2015
Proprietary and Confidential
Agenda• Analytics Inside• Predictive Analytics Use Case: Hospital
Readmissions• Technical Solution Overview • Why Cascading and Alternatives Explored• Implementation Considerations• Best Practices for Operational Readiness
2September 3, 2015 Proprietary and Confidential
Analytics Inside• A Westerville Ohio base company founded in late 2011.• Self-‐funded, profitable, and growing. • The outgrowth business intelligence and advanced analytics
consulting (2007), big data consulting (2010) , and academic research.
• Our mission is to:– Be data scientists and to develop and deliver innovative advanced technologies to
vertical industries.• Health Care• Cyber-‐security• Intelligence and Law Enforcement
– Engage and focus on endeavors that deliver social value.– To make some money and have some fun.– To partner with organizations that have delivery muscle, vertical industry
expertise, and share common values.
• We provide products, services, and training.
3September 3, 2015 Proprietary and Confidential
Analytics Inside• We are Concurrent business partners
– We deliver Concurrent Cascading training services– We also have an advanced Cascading course– We have now 18 Big Data Technology modules that can be assembled into
customized training curriculums
4September 3, 2015 Proprietary and Confidential
Phase Course Title Description
Planning Hadoop PlanningSteps involved in identifying need, planning for introduction of Hadoop into your environment, systems integration and architecture.
Introduction to Linux A comprehensive review of the Linux Operating System. Introduction to Java A comprehensive review of the Java technology stack.
Introduction to Hadoop An detailed overview of the Hadoop technology stack.Hadoop Software Development MapReduce programming, NoSQL (HBase) programming, Sqoop
Hadoop Quality Assurance Overview of standard QA processes and how they are different from standard QA.Introduction to YARN Development of Hadoop version 2 YARN development.Introduction to Spark Development of Spark systemsIntroduction to Storm Development of Storm systemsIntroduction to Kafka Development of Kafka message queueing systems
Natural Langauge Processing Overview of computational linguisticsIntroduction to Graph Theory Overview of using Hadoop and Spark to do advanced graph theoryAdvanced Text Analytics Text analytics using big data frameworks.
Real time Big DataOverview of next generation big data soultions that utilize real-‐time (non-‐batch) technology.
Cascading Developer training for the Concurrent Cascading product set.Advanced Cascading Developer training for advanced usages of the Cascading frameworkAdvanced Analytics Using Mahout to do predictive analytics.
Hadoop Systems Administration Installing, configuring, and maintaining a Hadoop clusterAdministration
Comprehension
Development
Analytics Inside Solutions• Our solution family, collectively known as RelMiner™,
consists of a set of software components that are designed to be integrated into domain specific products
• Specialization in:– Big data solutions
• Hadoop, YARN, Tez, and Spark• NoSQL Databases• Streaming – Kafka, Storm, and Spark Streaming
–Machine Learning• Predictive and Prescriptive Analytics
–Natural language processing (Computational Linguistics, Text Analytics, and Text Mining)–Graph theory and graph databases
• Cascading is core to all of our development–We started with 1.3 and have now seamlessly migrated to 3.0 using Tez
5September 3, 2015 Proprietary and Confidential
Predictive Analytics Use CaseHospital Readmissions
• A hospital patient readmission is a costly event that health care providers are attempting to reduce. – A readmission is defined as ANY reentry to a hospital 30 days or
less from a prior discharge. • The US Affordable Care Act mandates lower readmission
– If not achieved, providers face fines or reduced government reimbursement.
– Specifically, the US Medicare and Medicaid will either not pay or will reduce the payment made to hospitals for expenses incurred when readmissions occur.
– By the end of 2014, over 2600 hospitals incurred in excess of $24B of losses from a Medicare and Medicaid expenses, now expected to rise to $50B by the end of this year.
•6September 3, 2015 Proprietary and Confidential
• Predictive analytics is now being used to categorize and prioritize those patients with the highest likelihood of readmission– Impact is both clinical (better outcome for the patient) AND financial
• And better financially performing hospitals generally have better outcomes!
• Many such predictive systems exist. One is the LACE score.– Invented by the Ottawa Hospital Research Institute, Institute for Clinical
Evaluation Sciences, University of Toronto, University of Ottawa and University of Calgary
– It is a calculation that predicts the probability of readmission or death following a hospital discharge based on:• Length of Stay (days in hospital)• Accuity (acuteness or severity of condition)• Comorbidity (all of a patient’s diagnosed conditions)• Emergency Visits
7September 3, 2015 Proprietary and Confidential
Predictive Analytics Use CaseHospital Readmissions
Why This is a Big Data Problem• Typical patient can have several gigabytes of data
– Much information is hidden in unstructured chart data and clinical notes
– 68,000 diagnosis codes and a very large number of modifiers, and new diagnosis codes are constructed to contain a lot of information • Site – where on the body the diagnosis applies• Combination codes – encode two or more related conditions
– 87,000 procedure codes, again now highly encoded with information
• Machine learning uses wide and variable vector lengths• Multiple models are desirable
– Possibly trained by segmentation (Neonatal, geriatric, etc.)• Data variation is quite large – HCPCS, ICD-‐10, CPT, and more
– And researchers want to add more data!
8September 3, 2015 Proprietary and Confidential
Hospital Readmissions
9September 3, 2015 Proprietary and Confidential
LACE Subassembly
LACE Score =Length of Stay Score + Acuity Score + Emergency Visits Score + Comorbidity Score
Length of Stay in DaysAcuity (serious condition)
# Visits to Emergency Room over some period of time
Comorbidity Score = Charlson Comorbidity Index = Age Score + ∑ Diagnosis ScoresPatient Demographic Records
Patient Admission Records
Patient Diagnosis Records
MedPredictAdvanced Analytics for Health Care
Fundamentals
Classification and prediction of care outcome. It provides dashboard level reporting, early identification of potential undesirable outcomes, and actionable suggestions for remediation. It combines a variety of available data sources including:
• Patient biometric data and historic data• Value Based Purchasing scores• ICD-‐9 and ICD-‐10 diagnosis and procedures • DRG and MDC classifications• NCQA metric categories and HEDIS standard data• Pharmaceutical usage• Patient and facility demographic data• Patient psychographic data from chart data• Re-‐admissions and mortality data• Inpatient and emergency department discharge data• Patient satisfaction survey data• Meaningful Use Summary of Care records• Incorporates LACE score through a sophisticated LACE calculation
engine
MedPredict™
9September 3, 2015 Proprietary and Confidential
Advanced.Analytical.Intelligence.
MedPredict™
11September 3, 2015 Proprietary and Confidential
MedPredict™
LACE EnginePatient RecordsName/IDDOB/Age
Gender, Race, Ethnicity…
Patient Admission Records
Admittance and Discharge DateAdmission Source and Type
Admission TypeDischarge Disposition
HospitalExpenseInsurer…
Patient Diagnosis Records
Date of DiagnosisLocation of Diagnosis
Diagnosis CodePhysician NotesDRG and MDCCC/MCC…
Age Tier Scores
ER Buckets
Diagnosis Scores
Patient Summary Record
Diagnosis Vector
LACE Metrics
Patient Data
ETL
ETL
ETL
PredictionEngine
Length of Stay
Expected Expense
ExpectedOutcome
Readmission Probabilities
Look back time
Look back time
12September 3, 2015 Proprietary and Confidential
MedPredict™
Patient RecordsName/ID
DOB/Age, Gender, Race, Ethnicity
Patient Admission Records
Admittance and Discharge DateAdmission Source and Type
Admission TypeDischarge Disposition
HospitalExpenseInsurer
Patient Diagnosis Records
Date of DiagnosisLocation of Diagnosis
Diagnosis CodePhysician NotesDRG and MDC
CC/MCC
Transform and Score
Transform and Score
Transform and Score
Age Tier Score
Look back timeER Buckets
Diagnosis Score
Patient Prediction Record
Diagnosis Vector
LACE Metrics
Patient Data
Procedure Vector
LACE
ETL
ETL
ETL
PredictionEngine
Length of Stay
Expected Expense
ExpectedOutcome
Readmission Probabilities
Look back time
Look back time
Look back time
Patient Procedure Records
Date, Procedure Codes, Pharmaceuticals, Patient Chart Data
Transform and Score
Procedure Score
Appropriateness of Care Index
MedPredict™
Profitability
MedExtractAdvanced Analytics for Health Care
Fundamentals
Clinical information extraction from unstructured text.The Problem -‐ Even with the advent of data management technology, most of the patient information is recorded as unstructured text. That includes health care plans, prescriptions, doctor’s observations, and patients’ communications with their health care providers. A wealth of information is buried within these documents, yet it is difficult to find because, unlike a database, it cannot be queried with a uniform method.
The Solution -‐ MedExtract utilizes advanced Text Analytics techniques and technologies for effective information extraction from unstructured text . From the free form text it can extract:
• Diseases, diagnoses, and procedures• Names, addresses, phone numbers, locations• Drugs, dosages, and usage• Patient observations of sentiment (depression, anger, etc.)
Advanced.Analytical.Intelligence.
MedExtract™
10September 3, 2015 Proprietary and Confidential
Technology Underpinnings
14September 3, 2015 Proprietary and Confidential
Technical Solution Overview• MedPredict™ contains several parts:
– ETL• Ingestion and cleansing using subassemblies to trap and record errors• Creation of the patient diagnosis record from patient diagnosis history• Creation of the patient admittance record from hospital discharge records• Creation of full patient scoring record• Text extraction from unstructured sources
– LACE calculation• Multiple subassemblies
– Predictive modeling• Training predictive models• Model testing• Model deployment
– Nightly run• Compute patient metrics and insert new record• Sorting and prioritization -‐> Reporting and alerting• Trend analysis
– Analysis and Refinement• Adding new data• Adding new calculations• Adjusting parameters• Retraining…
15September 3, 2015 Proprietary and Confidential
ETL Training
ModelModelModel
Scoring
Reporting
Alerting
Analysis
Discharge Chart Diagnosis Procedure Patient History
Kafkapub/sub
Technology Underpinnings
16September 3, 2015 Proprietary and Confidential
Cascading
MedPredictand
MedExtract
Workflow control
TechnologyMigration
Driven
MonitoringPerformance and Tuning
MedMiner infrastructure
Mahout
OpenNLP
Why Cascading?• We had already been using Cascading!
– We started here in 2011, so we had four years under our belt– Reliability and stability was an issue. Cascading is mature, and
unlike other systems, we understand how it works.• We literally “wrote the book” on Cascading J
– Cascading has been cost-‐effective and has preserved a large initial investment for us.• Core product set was written in version 1.3 in 2011.• We moved to version 2 in 2013.• Since June 2015, we are now running under version 3 using Tez
– Cascading Test Driven Design principles have made development easier.
– Cascading subassemblies and cascades have provided us with tremendous code reuse.
17September 3, 2015 Proprietary and Confidential
Why Cascading?
• We have one (portable) language and framework to learn and maintain.
• We use Cascading dynamic control.– We heavily instrument our flows, and these metrics control the
number of iterations that we use– We tried to do this in Pig and MapReduce, but found it to be very
difficult to impossible
• RelPredict was written in Mahout. Cascading wraps this functionality and gives us “hyper-‐parallelism”
• RelExtract uses a heavily augmented OpenNLP and Cascading also wraps this functionality in a customizable “pipeline” much like UIMAfit.
18September 3, 2015 Proprietary and Confidential
Why Cascading?• Modules in MedPredict seemed naturally to fit into the Cascading model – Several discrete steps must be formed and fed forward– Pipe metaphor is ideal manner of expression– Hashjoins give huge performance benefits during calculation phase
– Checkpointing is used extensively due to expected high error rate in data
– Cascading Local mode allows smaller hospitals to use the system without requiring a full Hadoop stack
19September 3, 2015 Proprietary and Confidential
Implementation Considerations• Occasionally, we find disk spills to impact performance due to two large
multi-‐dataset joins that the LACE calculation performs. • We make extensive usage of Buffer operations, so we need to monitor
these phases closely due to memory and compute requirements. • Usually at some predefined trigger point, we have to run a very large
predictive retraining job. It is very resource intensive. Data lineage is a big issue since MedPredict's data originates from many sources.
• ETL is complex and is customized relative to the data that the app is provided. – It must be transformed into a common format before MedPredict can consume it. – Errors in incoming data streams can be quite baffling at times.
• MedExtract and Natural Language Processing is very resource intensive. – Named Entity extraction uses both machine learning and dictionaries– We produce indices of searchable terms (usually consumed by Solr/Lucene).– We produce records that augment the other ETL streams.
You have to manage and monitor all these things.20September 3, 2015 Proprietary and Confidential
MedMiner Deep Learning
21September 3, 2015 Proprietary and Confidential
InpatientAssociations
AndRelationships
Health Care Team Model
Chart Model
Pharmacy Model
VBP Model
Diagnosis, Treatment, and
Outcome
Comparative Performance
Financial Impact
Status and Discharge Disposition
Financial Cost
Emergency Room
VBP Metrics
Hospitals and
Facilities
DataMgmt
RelMiner Router
Fees and Services
Pharmacy
Diagnostic and
Procedural
Outcome Outcome
Length of Stay
Readmittance
NeonatalJuvenile
AdultGeriatric
Learning Feedback Loop
Chart and Notes
NLP
NLP
Best Practices for Operational Readiness
• We monitor end-‐to-‐end performance and use Driven to tune some of the larger flows.
• Because we use flow skipping extensively, and we monitor when steps have not been skipped (indicating that a data refresh has occurred).
• Errors in incoming data streams can be quite baffling at times. We use Traps and Checkpoints to help here extensively.
22September 3, 2015 Proprietary and Confidential
Driven, from Concurrent, is used to track the performance of MedPredict™ to solve these operational problems
We are now testing our system using Cascading version 3 and Tez.
Best Practices for Operational Readiness
• Visualize your pipelines to make sure your applications are executing as expected in dev, test and prod environments.
• If you are in a regulated industry, like Healthcare, track data lineage. You will have to report on it for internal and external audits
23September 3, 2015 Proprietary and Confidential
Driven, from Concurrent, is used to track the performance of MedPredict™ to solve these operational problems
Technical Overview -‐ Driven• MedExtract and Natural Language Processing is very resource
intensive. – Named Entity extraction uses both machine learning and dictionaries– We produce indices of searchable terms (usually consumed by
Solr/Lucene).– We produce records that augment the other ETL streams. Driven is
used here for perf/tune and to monitor the overall NLP pipes.
• We are now testing a Tez port of our system. Driven plays a roll here as well.
24September 3, 2015 Proprietary and Confidential
Questions and Answers
September 3, 2015 25Proprietary and Confidential
[email protected]@AnalyticsInside.us
http://www.AnalyticsInside.us