DH03 Educating for the Future: Data Engineering Techniques · Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Post on 27-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

DH03 – Educating for the Future:

Data Engineering Techniques

https://education.phuse.eu/eftf/data-engineering/

Contribution:

Andy Richardson, Amy Gillespie, Beate Hientzsch, Berber Snoeijer,Beverly Hayes, JatinPatel, Karnika Dalal, Mark Bynens, Mike Carniello, Parag Shiralkar, Paul Slagle, Ralf Goetzelmann, Rohit Banga, Sagar Jain, Susan Olson, Vijay Pasapula, Vince Marinelli, Xiaohui Wang

Special thanks:Ian Fleming, James McDermott, Wendy Dobson

Presentation:Guy Garrett, Achieve IntelligenceRenu Shukla, Janssen Research & DevelopmentMohit Juneja, LyfeScience

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

https://education.phuse.eu/eftf/data-engineering/

Data Engineering Landscape

ePro

RWE

IoT

Bio-markers

eDiaries

HEcon

Wearables

Omics

RWD

Profiles

EHR

CrowdSource

eCRF

Lab

Data Pipeline

Data Lake

Centralised

Data

Hub

Data Marketplace

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

StudyStudyStudy

Data Management & Quality Control

Analysis & Reporting

Data Capture & Site Management

Master Data Management

https://education.phuse.eu/eftf/data-engineering/

Master Data Management

Data Management & Quality Control

Analysis & Reporting

Data Capture & Site Management

eCRF

DM Scripts

A&R Scripts

https://education.phuse.eu/eftf/data-engineering/

eCRF

eCRF

Lab

Others

AutomatedData Ingestion

Routines

Master Data Management

* F.A.I.R. = Findability, Accessibility, Interoperability, Reusability

https://education.phuse.eu/eftf/data-engineering/

Centralised

Data

Hub

https://education.phuse.eu/eftf/data-engineering/

Master Data Management

• Principle

Approach Investment Return on Investment

Master Data Management More up-front investment Longer term savings

Siloed Data Processing Less up-front investment Most costly long-term

https://education.phuse.eu/eftf/data-engineering/

Master Data Management

• Principle

Approach Investment Return on Investment

Master Data Management More up-front investment Longer term savings

Siloed Data Processing Less up-front investment Most costly long-term

Data Engineering Project is looking for Use Cases to evidence this principle.

1) Measure current processes2) Adopt MDM3) Measure MDM processes4) Analyse ROI

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”– Definition

– Staging & Loading

– Change Data Capture

– Data Validation

– Scheduling & Batch Processing

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

https://education.phuse.eu/eftf/data-engineering/

ETL Overview – “Transforming”

ePro

RWE

IoT

Bio-markers

eDiaries

HEcon

Wearables

Omics

RWD

Profiles

EHR

CrowdSource

eCRF

Lab

Data Pipeline

Data Lake

Centralised

Data

Hub

Data Marketplace

https://education.phuse.eu/eftf/data-engineering/

ETL Overview – “Transforming”

• Wikipedia definition

“In computing, extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s).”

https://education.phuse.eu/eftf/data-engineering/

ETL Overview – “Transforming”

• Wikipedia definition

“In computing, extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s).”

“…data engineering, is a software engineering approach to designing and developing information systems. “

i.e. It’s about building data pipelines, rather than datasets.

Data PipelineData Pipeline

ETL Overview – “Transforming”

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Data PipelineData Pipeline

Staging

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Data PipelineData Pipeline

Loading

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Data PipelineData Pipeline

Change Data Capture

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Changes: Additions, Updates, Deletions

Methods:•Database Timestamping (Control Tables)•Delta Identification (Comparison)

Data PipelineData Pipeline

Change Data Capture

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Data PipelineData Pipeline

Data Validation (At Source)

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

e.g.: EDC Patient DOB must be a valid date.

Data PipelineData Pipeline

Data Validation (Simple)

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Simple)

Scheduling Controls

e.g.: Lab Data Value unlikely to be > nnn

Failed records get diverted for cleansing processes

Data PipelineData Pipeline

Data Validation (Complex)

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

• Across different sourcesThis lab value is not consistent with that EDC value(e.g. Males can’t be pregnant)

• Summarisation Checks

Data PipelineData Pipeline

Data Validation (Complex)

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Data PipelineData Pipeline

Scheduling & Batch Processing

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

• Frequencies• Dependencies• Alerts

Data PipelineData Pipeline

ETL Overview – “Transforming”

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

https://education.phuse.eu/eftf/data-engineering/

Data Lake – “Trawling”

ePro

RWE

IoT

Bio-markers

eDiaries

HEcon

Wearables

Omics

RWD

Profiles

EHR

CrowdSource

eCRF

Lab

Data Pipeline

Data Lake

Centralised

Data

Hub

Data Marketplace

• Big-Data• Unstructured/Semi-Structured• Just-in-case - “Disk space is cheap”

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

https://education.phuse.eu/eftf/data-engineering/

Data Marketplace – “Targeting”

ePro

RWE

IoT

Bio-markers

eDiaries

HEcon

Wearables

Omics

RWD

Profiles

EHR

CrowdSource

eCRF

Lab

Data Pipeline

Data Lake

Centralised

Data

Hub

Data Marketplace

• Small-Data• Specific Information• “As required” via APIs

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

https://education.phuse.eu/eftf/data-engineering/

Conclusion

Data Engineering Project(Educating for the Future PHUSE Working Group)

BHayes2@its.jnj.com

Guy.Garrett@AchieveIntelligence.com

No Copyright infringement intended.https://education.phuse.eu/eftf/data-engineering/

top related