Page 1
DH03 – Educating for the Future:
Data Engineering Techniques
https://education.phuse.eu/eftf/data-engineering/
Contribution:
Andy Richardson, Amy Gillespie, Beate Hientzsch, Berber Snoeijer,Beverly Hayes, JatinPatel, Karnika Dalal, Mark Bynens, Mike Carniello, Parag Shiralkar, Paul Slagle, Ralf Goetzelmann, Rohit Banga, Sagar Jain, Susan Olson, Vijay Pasapula, Vince Marinelli, Xiaohui Wang
Special thanks:Ian Fleming, James McDermott, Wendy Dobson
Presentation:Guy Garrett, Achieve IntelligenceRenu Shukla, Janssen Research & DevelopmentMohit Juneja, LyfeScience
Page 2
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
Page 3
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
Page 4
https://education.phuse.eu/eftf/data-engineering/
Data Engineering Landscape
ePro
RWE
IoT
Bio-markers
eDiaries
HEcon
Wearables
Omics
RWD
Profiles
EHR
CrowdSource
eCRF
Lab
Data Pipeline
Data Lake
Centralised
Data
Hub
Data Marketplace
Page 5
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
Page 6
StudyStudyStudy
Data Management & Quality Control
Analysis & Reporting
Data Capture & Site Management
Master Data Management
https://education.phuse.eu/eftf/data-engineering/
Page 7
Master Data Management
Data Management & Quality Control
Analysis & Reporting
Data Capture & Site Management
eCRF
DM Scripts
A&R Scripts
https://education.phuse.eu/eftf/data-engineering/
Page 8
eCRF
eCRF
Lab
Others
AutomatedData Ingestion
Routines
Master Data Management
* F.A.I.R. = Findability, Accessibility, Interoperability, Reusability
https://education.phuse.eu/eftf/data-engineering/
Centralised
Data
Hub
Page 9
https://education.phuse.eu/eftf/data-engineering/
Master Data Management
• Principle
Approach Investment Return on Investment
Master Data Management More up-front investment Longer term savings
Siloed Data Processing Less up-front investment Most costly long-term
Page 10
https://education.phuse.eu/eftf/data-engineering/
Master Data Management
• Principle
Approach Investment Return on Investment
Master Data Management More up-front investment Longer term savings
Siloed Data Processing Less up-front investment Most costly long-term
Data Engineering Project is looking for Use Cases to evidence this principle.
1) Measure current processes2) Adopt MDM3) Measure MDM processes4) Analyse ROI
Page 11
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
Page 12
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”– Definition
– Staging & Loading
– Change Data Capture
– Data Validation
– Scheduling & Batch Processing
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
Page 13
https://education.phuse.eu/eftf/data-engineering/
ETL Overview – “Transforming”
ePro
RWE
IoT
Bio-markers
eDiaries
HEcon
Wearables
Omics
RWD
Profiles
EHR
CrowdSource
eCRF
Lab
Data Pipeline
Data Lake
Centralised
Data
Hub
Data Marketplace
Page 14
https://education.phuse.eu/eftf/data-engineering/
ETL Overview – “Transforming”
• Wikipedia definition
“In computing, extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s).”
Page 15
https://education.phuse.eu/eftf/data-engineering/
ETL Overview – “Transforming”
• Wikipedia definition
“In computing, extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s).”
“…data engineering, is a software engineering approach to designing and developing information systems. “
i.e. It’s about building data pipelines, rather than datasets.
Page 16
Data PipelineData Pipeline
ETL Overview – “Transforming”
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Page 17
Data PipelineData Pipeline
Staging
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Page 18
Data PipelineData Pipeline
Loading
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Page 19
Data PipelineData Pipeline
Change Data Capture
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Changes: Additions, Updates, Deletions
Methods:•Database Timestamping (Control Tables)•Delta Identification (Comparison)
Page 20
Data PipelineData Pipeline
Change Data Capture
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Page 21
Data PipelineData Pipeline
Data Validation (At Source)
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
e.g.: EDC Patient DOB must be a valid date.
Page 22
Data PipelineData Pipeline
Data Validation (Simple)
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Simple)
Scheduling Controls
e.g.: Lab Data Value unlikely to be > nnn
Failed records get diverted for cleansing processes
Page 23
Data PipelineData Pipeline
Data Validation (Complex)
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
• Across different sourcesThis lab value is not consistent with that EDC value(e.g. Males can’t be pregnant)
• Summarisation Checks
Page 24
Data PipelineData Pipeline
Data Validation (Complex)
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Page 25
Data PipelineData Pipeline
Scheduling & Batch Processing
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
• Frequencies• Dependencies• Alerts
Page 26
Data PipelineData Pipeline
ETL Overview – “Transforming”
SourceDataSourceData
https://education.phuse.eu/eftf/data-engineering/
SourceData
Data Pipelines StagingArea
CentralDataHub
Data Pipeline
ControlTables
Data Validation(Simple)
Data Validation(Complex)
DV Rules(Simple)
DV Rules(Complex)
Scheduling Controls
Page 27
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
Page 28
https://education.phuse.eu/eftf/data-engineering/
Data Lake – “Trawling”
ePro
RWE
IoT
Bio-markers
eDiaries
HEcon
Wearables
Omics
RWD
Profiles
EHR
CrowdSource
eCRF
Lab
Data Pipeline
Data Lake
Centralised
Data
Hub
Data Marketplace
• Big-Data• Unstructured/Semi-Structured• Just-in-case - “Disk space is cheap”
Page 29
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
Page 30
https://education.phuse.eu/eftf/data-engineering/
Data Marketplace – “Targeting”
ePro
RWE
IoT
Bio-markers
eDiaries
HEcon
Wearables
Omics
RWD
Profiles
EHR
CrowdSource
eCRF
Lab
Data Pipeline
Data Lake
Centralised
Data
Hub
Data Marketplace
• Small-Data• Specific Information• “As required” via APIs
Page 31
https://education.phuse.eu/eftf/data-engineering/
Agenda
• Introduction
• Master Data Management - Principle
• ETL Overview – “Transforming”
• Data Lake – “Trawling”
• Data Marketplace – “Targeting”
• Conclusion
Page 32
https://education.phuse.eu/eftf/data-engineering/
Conclusion
Page 33
Data Engineering Project(Educating for the Future PHUSE Working Group)
[email protected]
[email protected]
No Copyright infringement intended.https://education.phuse.eu/eftf/data-engineering/