Top Banner
18

Herding Ponies: How big data methods facilitate collaborative analytics.

Dec 15, 2015

Download

Documents

Carla Easter
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Herding Ponies: How big data methods facilitate collaborative analytics.
Page 2: Herding Ponies: How big data methods facilitate collaborative analytics.

Herding Ponies: How big data methods facilitate collaborative analytics

Page 3: Herding Ponies: How big data methods facilitate collaborative analytics.

Changes in Outcomes Research

New monikers…

Patient Centered Outcomes Research

Health Services Research

Comparative Effectiveness Research

Safety and Surveillance

Changes in funding agencies

PCORI - AHRQ

FDA – CMS

NIH

Changes in research models

More multi-site studies

Larger “center-based” studies

Greater interest in Patient Generated Data

Greater interest in EHR-based data

Less interest in claims

Page 4: Herding Ponies: How big data methods facilitate collaborative analytics.

Collaboration Frameworks From other disciplines

Open Science Grid Physics, nanotechnolgy, structural biology

OSG: 1.4M CPU-hours/day, >90 sites, >3000 users,

>260 pubs in 2010

LIGOPhysics/Astrophysics

Established practices and metadata standards

1 PB data in last science run, distributed worldwide

ESGF

1.2 PB climate data

delivered to 23,000 users; 600+ pubs

Collage – Executable papersComputer science

Page 5: Herding Ponies: How big data methods facilitate collaborative analytics.

“Why hasn’t Outcomes Research adopted collaborative methods used in physics,

climate science, and genomics?”

- Everyone in data-driven research

Page 6: Herding Ponies: How big data methods facilitate collaborative analytics.

1. Healthcare data are not collected for research

Not standardized

Not complete

2. Privacy protection has legal and ethical implications

3. Data is an asset

4. Data sharing is not incentivized supported by journals, funding agencies, or the business of healthcare

Obtaining consent is expensive

Data hoarding is rewarded and conservative

Adapting to Collaborative Science

Page 7: Herding Ponies: How big data methods facilitate collaborative analytics.

Are Federated Research Networks the solution?

In federated models data are not centralized. AHRQ and PCORI have invested heavily this approach.

5. Each data holder independently assumes responsibility for “data wrangling” and standardization

6. Requires distributed analysis as opposed to traditional central data pooling and analysis.

If data are simply used to independently estimate one model per site, value-added for causal inference is similar to a meta-analysis

7. Requires greater levels of coordination of governance, standards, software, and policies.

8. High barriers to entry – what is the ROI?

Page 8: Herding Ponies: How big data methods facilitate collaborative analytics.

Federated Meta-Analysis vs. Distributed Analysis

Parallel Meta -analysis (Independently Estimated Results )

Data Site 1100

patients

Data Site 250

patients

Query Portal

Analysis Program

Results Site 1model fit to 100

patients

Results Site 2model fit to 50

patients

Parallel Distributed Analytics (Jointly Estimated Results )

Data Site 1100

patients

Data Site 250 patients

Query Portal

& Aggregator

IterativeAnalysis Program

Intermediate Statistics Site 1

Intermediate Statistics Site 2

Model Fit 150 Patients

Converged Estimate

Meta-analysis• 1 Independently estimated

model for each node in the network

• Not iterative

Distributed Analysis• One jointly estimated

model using data from all sites

• Typically iterative• Leverages computational

power of the entire network

Page 9: Herding Ponies: How big data methods facilitate collaborative analytics.

What does this have to do with “big data?”

Page 10: Herding Ponies: How big data methods facilitate collaborative analytics.

Two (of 8) barriers to collaborative data science solved with “Big Data” methods

Privacy protection has legal and ethical implications

If data are simply used to independently estimate one model per site, value-added for causal inference is similar to a meta-analysis

Bonus – specialized software or hardware like SAS and CMS repositories can be replaced with parallelized systems

Page 11: Herding Ponies: How big data methods facilitate collaborative analytics.

Parallel Evolution of Distributed Computing and Federated Research Networks

1993 1998 2003 2008 2013

CaGRID

Page 12: Herding Ponies: How big data methods facilitate collaborative analytics.

“Big Data” Analytics vs. Outcomes Research Analytics

“Big Data” in Distributed Environments

Outcomes Research in Federated Research

Networks

Analysis Questions PatternsPredictionsClassification

Causal InferencePredictionsHypothesis testing

Data Distribution Data can be randomly distributed across processors by a master

Data are non-randomly anchored to sites

# Nodes on network 100s or more 10s

Data Governance constraints between network nodes

Typically none or low Typically very high

Data set size Very large Relatively small

Query Distribution Platforms Apache SparkHadoop Map-ReduceApache Pig

SHRINEPopMedNetTRIAD

Common Analytic Platforms R-Volution/R-HadoopApache MahoutSpark Machine Learning LibSpark Graph X Lib

R SASStata

Size of developer community 1000s Dozens

Page 13: Herding Ponies: How big data methods facilitate collaborative analytics.

“Big-Data” Methods are Incidentally “Privacy Preserving”

Feature Clinical Research Rationale

“Big Data” Rationale

Federation in the form of multiple networked nodes or processing cores

Multiple independently operating data partners

Inefficient to rely on a single very powerful processor or specialized hardware

Distributed computation across networked nodes (instead of central pooling of data)

Transferring patient-level data incurs re-identification risks

Inefficient to transfer large data sets across the network

Page 14: Herding Ponies: How big data methods facilitate collaborative analytics.

Distributed Computing Frameworks

Grid Computing Architectures

Statistical Query OracleMostly an academic effort

HadoopFrom Google

Hundreds of developers

591 Active projects and organizations

Apache SparkBerkeley Computer Science answer to Hadoop

Most rapidly growing user base

99 Active projects and organizations

Page 15: Herding Ponies: How big data methods facilitate collaborative analytics.

Collaboration Frameworks In Outcomes Research

SHRINE for I2B2

PopMedNet – for MiniSentinel, PCORnet

TRIAD for CAGrid, SAFTINet DRN

Page 16: Herding Ponies: How big data methods facilitate collaborative analytics.

What distributed methods in the standard biostats toolbox are already supported in

“Big Data” vs. Clinical Frameworks?Algorithm/Method Apache Spark Libraries Map-Reduce Multi-

Core or RHadoopFederated Clinical Research Networks

Linear regression (weighted) X X

Logistic regression X X X

Cox Proportional Hazard X

Generalized Linear Models X

Naïve Bayes X X

Gaussian Discriminative Analysis X

K-means X X

Neural Network Backpropagation X

Matrix Factorization X

PCA * X

ICA * X

Support Vector Machine X X X

Expectation Maximization X

Random Forest Classifier X

Page 17: Herding Ponies: How big data methods facilitate collaborative analytics.

No Longer a Technical Challenge

We have the tools we need to overcome privacy and liability concerns. Now we “only” need to change culture.

Page 18: Herding Ponies: How big data methods facilitate collaborative analytics.

Moving Collaborative Outcomes Science Forward

Policies (aka incentives) Payer-driven incentives for better data hygiene and standardization

Payer incentives for sharing

Funding agency incentives for collaborative data management vs. data hoarding

Journal incentives

HIPAA Clarification

Infrastructure As a community - adopt existing easy-to-use, flexible platforms for sharing code and

data

Link clinical data and patient device infrastructure to research infrastructure

Culture Clinician demand

Patient demand

Tenure and promotion transformation

Replace “not invented here syndrome” with collective credit and shared efficiencies