Top Banner
DH03 Educating for the Future: Data Engineering Techniques https://education.phuse.eu/eftf/data-engineering/ Contribution: Andy Richardson, Amy Gillespie, Beate Hientzsch, Berber Snoeijer,Beverly Hayes, Jatin Patel, Karnika Dalal, Mark Bynens, Mike Carniello, Parag Shiralkar, Paul Slagle, Ralf Goetzelmann, Rohit Banga, Sagar Jain, Susan Olson, Vijay Pasapula, Vince Marinelli, Xiaohui Wang Special thanks: Ian Fleming, James McDermott, Wendy Dobson Presentation: Guy Garrett, Achieve Intelligence Renu Shukla, Janssen Research & Development Mohit Juneja, LyfeScience
33

DH03 Educating for the Future: Data Engineering Techniques · Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Jul 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

DH03 – Educating for the Future:

Data Engineering Techniques

https://education.phuse.eu/eftf/data-engineering/

Contribution:

Andy Richardson, Amy Gillespie, Beate Hientzsch, Berber Snoeijer,Beverly Hayes, JatinPatel, Karnika Dalal, Mark Bynens, Mike Carniello, Parag Shiralkar, Paul Slagle, Ralf Goetzelmann, Rohit Banga, Sagar Jain, Susan Olson, Vijay Pasapula, Vince Marinelli, Xiaohui Wang

Special thanks:Ian Fleming, James McDermott, Wendy Dobson

Presentation:Guy Garrett, Achieve IntelligenceRenu Shukla, Janssen Research & DevelopmentMohit Juneja, LyfeScience

Page 2: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

Page 3: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

Page 4: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Data Engineering Landscape

ePro

RWE

IoT

Bio-markers

eDiaries

HEcon

Wearables

Omics

RWD

Profiles

EHR

CrowdSource

eCRF

Lab

Data Pipeline

Data Lake

Centralised

Data

Hub

Data Marketplace

Page 5: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

Page 6: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

StudyStudyStudy

Data Management & Quality Control

Analysis & Reporting

Data Capture & Site Management

Master Data Management

https://education.phuse.eu/eftf/data-engineering/

Page 7: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Master Data Management

Data Management & Quality Control

Analysis & Reporting

Data Capture & Site Management

eCRF

DM Scripts

A&R Scripts

https://education.phuse.eu/eftf/data-engineering/

Page 8: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

eCRF

eCRF

Lab

Others

AutomatedData Ingestion

Routines

Master Data Management

* F.A.I.R. = Findability, Accessibility, Interoperability, Reusability

https://education.phuse.eu/eftf/data-engineering/

Centralised

Data

Hub

Page 9: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Master Data Management

• Principle

Approach Investment Return on Investment

Master Data Management More up-front investment Longer term savings

Siloed Data Processing Less up-front investment Most costly long-term

Page 10: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Master Data Management

• Principle

Approach Investment Return on Investment

Master Data Management More up-front investment Longer term savings

Siloed Data Processing Less up-front investment Most costly long-term

Data Engineering Project is looking for Use Cases to evidence this principle.

1) Measure current processes2) Adopt MDM3) Measure MDM processes4) Analyse ROI

Page 11: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

Page 12: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”– Definition

– Staging & Loading

– Change Data Capture

– Data Validation

– Scheduling & Batch Processing

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

Page 13: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

ETL Overview – “Transforming”

ePro

RWE

IoT

Bio-markers

eDiaries

HEcon

Wearables

Omics

RWD

Profiles

EHR

CrowdSource

eCRF

Lab

Data Pipeline

Data Lake

Centralised

Data

Hub

Data Marketplace

Page 14: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

ETL Overview – “Transforming”

• Wikipedia definition

“In computing, extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s).”

Page 15: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

ETL Overview – “Transforming”

• Wikipedia definition

“In computing, extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s).”

“…data engineering, is a software engineering approach to designing and developing information systems. “

i.e. It’s about building data pipelines, rather than datasets.

Page 16: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data PipelineData Pipeline

ETL Overview – “Transforming”

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Page 17: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data PipelineData Pipeline

Staging

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Page 18: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data PipelineData Pipeline

Loading

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Page 19: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data PipelineData Pipeline

Change Data Capture

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Changes: Additions, Updates, Deletions

Methods:•Database Timestamping (Control Tables)•Delta Identification (Comparison)

Page 20: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data PipelineData Pipeline

Change Data Capture

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Page 21: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data PipelineData Pipeline

Data Validation (At Source)

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

e.g.: EDC Patient DOB must be a valid date.

Page 22: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data PipelineData Pipeline

Data Validation (Simple)

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Simple)

Scheduling Controls

e.g.: Lab Data Value unlikely to be > nnn

Failed records get diverted for cleansing processes

Page 23: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data PipelineData Pipeline

Data Validation (Complex)

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

• Across different sourcesThis lab value is not consistent with that EDC value(e.g. Males can’t be pregnant)

• Summarisation Checks

Page 24: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data PipelineData Pipeline

Data Validation (Complex)

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Page 25: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data PipelineData Pipeline

Scheduling & Batch Processing

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

• Frequencies• Dependencies• Alerts

Page 26: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data PipelineData Pipeline

ETL Overview – “Transforming”

SourceDataSourceData

https://education.phuse.eu/eftf/data-engineering/

SourceData

Data Pipelines StagingArea

CentralDataHub

Data Pipeline

ControlTables

Data Validation(Simple)

Data Validation(Complex)

DV Rules(Simple)

DV Rules(Complex)

Scheduling Controls

Page 27: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

Page 28: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Data Lake – “Trawling”

ePro

RWE

IoT

Bio-markers

eDiaries

HEcon

Wearables

Omics

RWD

Profiles

EHR

CrowdSource

eCRF

Lab

Data Pipeline

Data Lake

Centralised

Data

Hub

Data Marketplace

• Big-Data• Unstructured/Semi-Structured• Just-in-case - “Disk space is cheap”

Page 29: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

Page 30: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Data Marketplace – “Targeting”

ePro

RWE

IoT

Bio-markers

eDiaries

HEcon

Wearables

Omics

RWD

Profiles

EHR

CrowdSource

eCRF

Lab

Data Pipeline

Data Lake

Centralised

Data

Hub

Data Marketplace

• Small-Data• Specific Information• “As required” via APIs

Page 31: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Agenda

• Introduction

• Master Data Management - Principle

• ETL Overview – “Transforming”

• Data Lake – “Trawling”

• Data Marketplace – “Targeting”

• Conclusion

Page 32: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

https://education.phuse.eu/eftf/data-engineering/

Conclusion

Page 33: DH03 Educating for the Future: Data Engineering Techniques ·  Agenda • Introduction • Master Data Management - Principle • ETL Overview –“Transforming”

Data Engineering Project(Educating for the Future PHUSE Working Group)

[email protected]

[email protected]

No Copyright infringement intended.https://education.phuse.eu/eftf/data-engineering/