Top Banner
A company of Daimler AG LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB SPECIFICS ANDREAS BUCKENHOFER, DAIMLER TSS
110

LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Oct 05, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

A company of Daimler AG

LECTURE @DHBW: DATA WAREHOUSE

PART III: ETL AND DB SPECIFICSANDREAS BUCKENHOFER, DAIMLER TSS

Page 2: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

ABOUT ME

https://de.linkedin.com/in/buckenhofer

https://twitter.com/ABuckenhofer

https://www.doag.org/de/themen/datenbank/in-memory/

http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/

https://www.xing.com/profile/Andreas_Buckenhofer2

Andreas BuckenhoferSenior DB [email protected]

Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics

Page 3: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

As a 100% Daimler subsidiary, we give

100 percent, always and never less.

We love IT and pull out all the stops to

aid Daimler's development with our

expertise on its journey into the future.

Our objective: We make Daimler the

most innovative and digital mobility

company.

NOT JUST AVERAGE: OUTSTANDING.

Daimler TSS

Page 4: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

INTERNAL IT PARTNER FOR DAIMLER

+ Holistic solutions according to the Daimler guidelines

+ IT strategy

+ Security

+ Architecture

+ Developing and securing know-how

+ TSS is a partner who can be trusted with sensitive data

As subsidiary: maximum added value for Daimler

+ Market closeness

+ Independence

+ Flexibility (short decision making process,

ability to react quickly)

Daimler TSS 4

Page 5: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Daimler TSS

LOCATIONS

Data Warehouse / DHBW

Daimler TSS China

Hub Beijing

10 employees

Daimler TSS Malaysia

Hub Kuala Lumpur

42 employeesDaimler TSS IndiaHub Bangalore22 employees

Daimler TSS Germany

7 locations

1000 employees*

Ulm (Headquarters)

Stuttgart

Berlin

Karlsruhe

* as of August 2017

5

Page 6: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

After the end of this lecture you will be able to

Understand concepts behind ETL

WHAT YOU WILL LEARN TODAY

Data Warehouse / DHBWDaimler TSS 6

Page 7: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 7

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage

Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Integration Layer

(Cleansing Layer)

Aggregation Layer

Metadata Management

Security

DWH Manager incl. Monitor

? ? ? ?

Page 8: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Extract – Transform - Load

Other term: Data integration (better, more neutral)

ETL PROCESS

Data Warehouse / DHBWDaimler TSS 8

Page 9: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• capture and copy data from source systems (e.g. operational systems)

• many different types of sources • Relational, hierarchical DBMSs

• Flat files

• Other internal/external sources

TASKS OF THE ETL PROCESS - EXTRACT

Data Warehouse / DHBWDaimler TSS 9

Page 10: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Filter data

• Integrate data

• Check and cleanse data

TASKS OF THE ETL PROCESS - TRANSFORM

Data Warehouse / DHBWDaimler TSS 10

Page 11: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Original meaning: Fast load into staging area

• General meaning: Loading data into staging area or another layer

TASKS OF THE ETL PROCESS - LOAD

Data Warehouse / DHBWDaimler TSS 11

Page 12: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

ETL often used for data integration in general (for ETL and ELT)

But: if ELT is mentioned, it is differentiated from ETL

ETL VS ELT

Data Warehouse / DHBWDaimler TSS 12

SourceDB

TargetDB

ETL Server

SourceDB

TargetDB

ELT Server

Data flow

Page 13: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

ETL VS ELT

Data Warehouse / DHBWDaimler TSS 13

ETL ELT

Data is transferred to ETL server and transferred back to DB. High network bandwidth required

Data remains in the DB except for cross Database loads (e.g. source to target)

Transformations are performed in the ETL Server Transformations are performed (in the source or) in the target

Proprietary code is executed in the ETL server Generated code, e.g. SQL, PL/SQL, SQLT

Typically used for • source to target transfer • Compute intensive transformations• Small amount of data

Typically used for • High amounts of data

Page 14: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

ETL/ELT TOOL VS MANUAL ETL/ELT

Data Warehouse / DHBWDaimler TSS 14

ETL Tool Manual ETL

Informatica, Talend, Oracle ODI, etc. SQL, PL/SQL, SQLT, etc.

Separate license No additional license

Workflow, error handling, and restart/recovery functionality included

Workflow, error handling, and restart/recovery functionality must be implemented manually

Impact analysis and where-used (lineage) functionality available

Impact analysis and where-used (lineage) functionality difficult

Faster development, easier maintenance Slower development, more difficult maintenance

Additional (Tool-) Know How required Know How often available

Page 15: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

ETL/ELT TOOL VS MANUAL ETL/ELT

Data Warehouse / DHBWDaimler TSS 15

Extract servicesLoad

services

Operations management services

Scheduler Control Repository Management

Connectors

Sorter

Connector

Sorter

Bulk Loader

Data Profiling servicesSource analysis

Data Quality servicesData cleansing

Data Transformation and Integration services

Data mapping Business rules

Slowly Changing Dimensions

Datatype conversion

Lookups

Job Monitoring Auditing Error Handling

Security

Page 16: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

MAPPING - INFORMATICA

Data Warehouse / DHBWDaimler TSS 16

Source Target

Filter

Lookup

Page 17: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

MAPPING WITH TRANSFORMATIONS - INFORMATICA

Data Warehouse / DHBWDaimler TSS 17

Sorter

Aggregator Transformation

Union Transformation

Page 18: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Specification between source and target columns

• Source tables + columns

• Target table + columns

• Join rules

• Filter criteria

• Transformation rules

DATA MAPPING

Data Warehouse / DHBWDaimler TSS 18

Page 19: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

WORKFLOW - INFORMATICA

Data Warehouse / DHBWDaimler TSS 19

Decision & coordination step

Session containing Mapping

Page 20: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

JOB MONITORING - INFORMATICA

Data Warehouse / DHBWDaimler TSS 20

Page 21: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Extracts from source systems

Initial extract for setting up the data warehouse • Initial Load

Periodical extracts for adding new/changed information to the data warehouse • Incremental Load

Question: How to determine what is new or what has changed in the source systems?

Task of „monitoring“

MONITORING (DATA CHANGE DETECTION)

Data Warehouse / DHBWDaimler TSS 21

Page 22: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Discovery of all changes vs. determining the net effect at extract/load time only

• Example: an attribute value can be changed in two ways:

• by one update operation

• by one delete and one insert operation

The net effect of both is the same

However, history information is lost if the net effect is recorded only

MONITORING: NET EFFECT OF CHANGES

Data Warehouse / DHBWDaimler TSS 22

Page 23: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Which techniques can be used to identify changes in a source system (RDBMS)?

• E.g. in OLTP system

• new products are inserted

• customer address changes

• Product is deleted because it is out of stock

How would you identify such changes? List advantages / disadvantages of possible solutions

Think about making changes in the source system. Think also about other solutions without any change in the source system.

EXERCISE MONITORING

Data Warehouse / DHBWDaimler TSS 23

Page 24: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Depend on characteristics of the data sources

The following techniques are based on modern relational DBMS

Types of techniques

Based on DBMS • Trigger-based

• Log-based discovery

• Replication techniques

Controlled by application • Timestamp-based discovery

• Snapshot-based discovery

MONITORING TECHNIQUES

Data Warehouse / DHBWDaimler TSS 24

Page 25: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Active monitoring mechanisms

Based on (database) triggers • Example:

• If new record is inserted in sales transaction table then insert transaction id and timestamp in change table

Advantage:

• Triggers do not change operational applications

Disadvantage: • Performance impact on operation systems if triggers are used extensively

• Triggers have to be implemented for every table in the source systems

TRIGGER-BASED

Data Warehouse / DHBWDaimler TSS 25

Page 26: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Sample Trigger Code, OracleCREATE [OR REPLACE] TRIGGER <trigger_name>

{BEFORE|AFTER} {INSERT|DELETE|UPDATE}

ON <table_name>

[REFERENCING [NEW AS <new_row_name>] [OLD AS <old_row_name>]]

[FOR EACH ROW [WHEN (<trigger_condition>)]]

<trigger_body>

Trigger is created for each source table in OLTP DB and stores insert/update/delete changes in a “log/journal table”

• trigger body contains insert statements into log/journal table

TRIGGER-BASED

Data Warehouse / DHBWDaimler TSS 26

Page 27: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Log-based discovery

Also often referenced as CDC (Change Data Capture)

Usage of database transaction logs to determine changes • DBMSs write transaction logs in order to be able to undo partially executed

transactions

• This information can be used to determine all changes

• Log reader identifies insert, update, delete, truncates and writes the changes as inserts into staging layer

Transaction Log files can be transferred to other systems to avoid additional load on source systems

LOG-BASED

Data Warehouse / DHBWDaimler TSS 27

Page 28: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

LOG-BASED (SAMPLE PRODUCT ARCHITECTURE IIDR)

Data Warehouse / DHBWDaimler TSS 28

Fron

tend

StandardReports

AdHocReports

IIDRReplEngine

Source

DatastoreSource

OLTPDB

IIDR ReplEngineDWH

DatastoreDWH

DWH DB

Staging Layer

Core Layer

Mart Layer

TransactionLogs

Page 29: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Replication techniques

Data replication

• Target tables not necessarily on local system

• Uses typically Transaction Logs

• Log reader identifies insert, update, delete, truncates and writes the changes into replicated tables (insert remains insert, update remains update, etc)

• Useful for 1:1 copies (e.g. ODS, Operational Data Store) but still challenge to detect changes for loading the data mart

REPLICATION-BASED

Data Warehouse / DHBWDaimler TSS 29

Page 30: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

REPLICATION-BASED (SAMPLE PRODUCT ARCHITECTUREIIDR)

Data Warehouse / DHBWDaimler TSS 30

Fron

tend

StandardReports

AdHocReports

IIDRReplEngine

Source

DatastoreSource

OLTPDB

IIDR ReplEngineDWH

DatastoreDWH

DWH DB

Staging Layer

Core Layer

Mart Layer

TransactionLogs

Page 31: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Timestamp-based discovery

• Every data item in a table is associated with timestamp information about its validity period

• Changed data can be determined from this timestamp information

TIMESTAMP-BASED

Data Warehouse / DHBWDaimler TSS 31

Page 32: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Sample customer table in OLTP

• Each table gets Change timestamp

• Delta process reads latest data only (e.g. ChangeTimestamp >= <yesterday>)

• Problem: it is not possible to identify deleted rows

TIMESTAMP-BASED

Data Warehouse / DHBWDaimler TSS 32

CustomerID Name Department Change Timestamp

1 Miller DWH 15.01.2015 17:00:01

2 Powell DB 22.03.2016 08:30:22

Page 33: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Data comparison

Comparison of snapshots of the operational data at different points in time• Compute difference between two latest snapshots

• E.g. unload all data from a table into a file and diff newest file content with latest file content

Can be very complex

Sometimes the only possibility, for instance for legacy applications

High performance impact on source

SNAPSHOT-BASED

Data Warehouse / DHBWDaimler TSS 33

Page 34: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

MONITORING TECHNIQUES COMPARISON

Data Warehouse / DHBWDaimler TSS 34

Trigger-based Replication techniques

Log-based discovery

Timestamp-based discovery

Snapshot-based discovery

Performanceimpact on source system

Medium Low Low Medium High

Performanceimpact on target system

Low Low Low Low High

Load on network Low Low Low Low High

Data loss if nologgingoperations

No Yes Yes No No

Page 35: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

MONITORING TECHNIQUES COMPARISON

Data Warehouse / DHBWDaimler TSS 35

Trigger-based Replication techniques

Log-based discovery

Timestamp-based discovery

Snapshot-based discovery

Identify DELETE operations

Yes Yes Yes No Yes

Identify ALLchanges (changes between extractions)

Yes Yes Yes No No

Page 36: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Direct Access

• Source writes data into target or

• Target reads data from source

• Security concerns

• High coupling / dependencies

DATA TRANSPORT – DIRECT ACCESS

Data Warehouse / DHBWDaimler TSS 36

Source Target

Page 37: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

File transfer (or other transport medium)

• csv, json, xml, binary, etc

• Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise service bus), SOA (service oriented architecture), etc

• Often high amounts of data, therefore bulk transfer of compressed data most widely used

• Better decoupling of source and target

DATA TRANSPORT – FILE TRANSFER

Data Warehouse / DHBWDaimler TSS 37

Source Targetfiles

Page 38: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Extraction intervals

• Periodically – in regular intervals

• Every day, week, etc.

• Instantly / Continuous

• Every change is directly propagated into the data warehouse

• „real time data warehouse“

• Depends on the requirements on timeliness of the data warehouse data

EXTRACTION INTERVALS

Data Warehouse / DHBWDaimler TSS 38

Page 39: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Triggered by a specific request

• Addition of a new product

• Query which involves more recent data

Triggered by specific events

• Number of changes in operational data exceeds threshold

EXTRACTION INTERVALS

Data Warehouse / DHBWDaimler TSS 39

Page 40: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Profile Existing Data Sources, Extracted Data

• Analyze data structure, content, and quality

• Find data relationships across systems

• Often badly documented or missing foreign keys

• Uncover data issues that can affect subsequent transformation steps

• Missing values

• Duplicates

• Inconsistencies

PREREQUISITE OF TRANSFORMATION: UNDERSTANDING THE DATA

Data Warehouse / DHBWDaimler TSS 40

Page 41: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

DATA PYRAMID AND DATA QUALITY

Data Warehouse / DHBWDaimler TSS 41

Source: By Matthew.viel - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=49310779 LinkedIn 11/2017: https://www.linkedin.com/feed/update/urn:li:activity:6334062387355746304

Page 42: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

DATA QUALITY ISSUES

Data Warehouse / DHBWDaimler TSS 42

CustomerNo Name Birthdate Age Gender Zip code

1 Miller, Tom 33.01.2001 15 M NULL

1 John Mayor 15.01.2001 15 M 98144

2 Mrs. Bush 31.10.1988 22 Q 00000

3 Martin 31.10.1988 22 M 75890

PK / Unique Key violated Data not uniform Not valid

Inconsistent Wrong value

Unknown / missing

FK violated

Page 43: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

DATA QUALITY ISSUES AND POSSIBLE SOLUTIONS IN THESOURCE RDBMS

Data Warehouse / DHBWDaimler TSS 43

Issue Solution

Wrong data e.g. 31.02.2016 Proper data type definition

Wrong values, e.g. number out of range Check constraint

Missing values NOT NULL constraint

Violated references FOREIGN KEY constraint

Duplicates PRIMARY or UNIQUE KEY constraint

Inconsistent data ACID transactions, business logic, additional checks

Page 44: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

DATA QUALITY ISSUES AND POSSIBLE SOLUTIONS IN THESOURCE RDBMS

Data Warehouse / DHBWDaimler TSS 44

Issue Solution

Wrong data e.g. 31.02.2016 Proper data type definition

Wrong values, e.g. number out of range Check constraint

Missing values NOT NULL constraint

Violated references FOREIGN KEY constraint

Duplicates PRIMARY or UNIQUE KEY constraint

Inconsistent data ACID transactions, business logic, additional checks

Page 45: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Correcting the data

• Automatically during ETL• E.g., address of a customer if a correct reference table exists

• Manually after ETL is finished• ETL stored bad data in error log tables or files

• ETL flags bad data (e.g. invalid)

DATA QUALITY ISSUES: WORKAROUNDS IN DWH

Data Warehouse / DHBWDaimler TSS 45

Page 46: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Correcting the data

• In the source systems • Common master data management across all operational applications

• Dedicated systems are “master” of e.g. customer data

• Correcting the data at the source is best approach but slow and often not feasible

DATA QUALITY ISSUES: CORRECT DATA IN THE SOURCE

Data Warehouse / DHBWDaimler TSS 46

Page 47: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Column is null

• Reject data

• Use default values

• Missing values can represent

• an unknown value Iike date of birth of a customer

• a missing value like engine_id for a car (logical not null constraint)

• Dimension tables can include some dummy values:

DATA QUALITY ISSUES: MISSING DATA

Data Warehouse / DHBWDaimler TSS 47

DimensionTable_X Description

-1 Unknown

-2 Missing

Page 48: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Data is inaccuratee.g. wrong date 32.12.2015 or wrong number 55U

• Reject data

• Replace with value that represents „Invalid“

• Dimension tables can include some dummy values:

DATA QUALITY ISSUES: MISSING DATA

Data Warehouse / DHBWDaimler TSS 48

DimensionTable_X Description

-1 Unknown

-2 Missing

-3 Invalid

Page 49: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Data has conflicts, e.g. wrong postal code 80995 Stuttgart

• Reject data

• Replace one of the values with a value that represents „Invalid“ or with corrected valueWhich value to replace? Rules necessary

DATA QUALITY ISSUES: CONFLICTING DATA

Data Warehouse / DHBWDaimler TSS 49

Page 50: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Data is inconsistent, e.g. unlikely high price for a product

• Can be discovered by statistical and data mining methods

DATA QUALITY ISSUES: INCONSISTENT DATA

Data Warehouse / DHBWDaimler TSS 50

Page 51: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Data is duplicated, e.g. „Martin Miller” vs “Miller, Martin” vs “M.Miller”

• Multiple representations for one entity • Different keys

• Different encodings

• Duplicate detection can be very difficult / tricky

• Products are available for e.g. address duplicate detection address validation (Kingstreet = does this address actually exist?)address harmonization (Kingstr, Kingstreet, King Street, etc)

• Standardize / Harmonize data during ETL flow: “unification”

DATA QUALITY ISSUES: DUPLICATES

Data Warehouse / DHBWDaimler TSS 51

Page 52: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Unification of data types

• Character string date „20.01.2006“ 20.01.2006

• Character string number „12345“ 12345

• Unification of encodings

• For instance for gender F and M

• Lookup-tables contain the mapping from old to new encodings

• Combination of different attributes to one attribute

• day, month, year date

TRANSFORM - UNIFICATION OF DATA

Data Warehouse / DHBWDaimler TSS 52

Page 53: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Split of one attribute into two or more

• Name first name, last name (“Herr Prof. Dr. Hans M. vom und zum Stein”)

• Unification of names can become very challenging “Herr Prof. Dr. Hans M. vomund zum Stein” or “Werner Martin” or “Mariae Gloria … Wilhelmine HubertaGräfin von Schönburg-Glauchau“

• Product name - „Cola, 0.33 l“ Product short name - „Cola“, size in liters - 0.33

TRANSFORM - UNIFICATION OF DATA

Data Warehouse / DHBWDaimler TSS 53

Page 54: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Unification of dates and timestamps

• Rules for representing incomplete date information If only month and year are known

• Dates and timestamps with regard to one specific timezoneImportant for multi-national organizationsUTC Coordinated Universal Time without daylight saving zone

• What can happen if clock is changed to winter time if no UTC is used?- Update arrives at 02:15 in staging layer (CDC / log-based monitor)- Clock is changed to winter time: -1h- Update of the same row arrives at 02:10 in staging layer (CDC / log-based)- How can batch load running the next night discover which update is the most recent one?

TRANSFORM - UNIFICATION OF DATA

Data Warehouse / DHBWDaimler TSS 54

Page 55: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Computation of derived values

• Profit = sales price – purchase price Without clear definition, different interpretations possible

• Net or gross sales price?

• Net or gross purchase price?

• Aggregations

• Revenue of the year computed from revenues of the dayWithout clear definition, different interpretations possible

• Calendar year?

• Fiscal year?

TRANSFORM - UNIFICATION OF DATA

Data Warehouse / DHBWDaimler TSS 55

Page 56: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Efficient load operations are important

• bulk load: Single row processing vs set based processing

• Online load • Data warehouse (especially Data Mart) is still accessible

• Offline load

• Data warehouse (especially Data Mart) is offline

• For updates that require the recomputation of a cube

• Offline load is often a Tool limit because the Tool locks data structures. But offline load could be faster.

LOAD

Data Warehouse / DHBWDaimler TSS 56

Page 57: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Specific Bulk load operations provided by RDBMS, e.g. External tables in Oracle or LOAD command in DB2

• Single row vs set based processing

BULK PROCESSING

Data Warehouse / DHBWDaimler TSS 57

Single row processing Set based processing

Cursor curs = SELECT * FROM <source>WHILE NOT EOF(curs)

FETCH NEXT ROW INTO myRoW;INSERT INTO <target> VALUES(myRow);

LOOP

INSERT INTO <target>SELECT * from <source>

Error handling easy All or nothing if there are errors

Slow for high amounts of data Performs well for small and high amounts of data

More coding Less code = less errors

Page 58: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

ETL-JOB PARALLELISM FOR LOADING DATA INTO CORE WAREHOUSE LAYER

Data Warehouse / DHBWDaimler TSS 58

HU

B lo

aded

LIN

K u

nd

HU

B-

SAT

load

ed

LIN

K-S

AT

load

ed

Dat

a V

ault

Load

Cla

ssic

alLo

ad

?

? ?

Integration of new JobsTime Windows for Loads, e.g 00:00-06:00

• Complex

• Many dependencies

• Many sequential jobs

• Systematic / Methodic

• Few, well defined dependencies

• Massive parallel

Page 59: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

EXAMPLE FOR DATA INTEGRATION IN DATA VAULT 2.0 ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 59

Source: Hans Hultgren: Modeling the agile Data Warehouse with Data Vault, New Hamilton 2012, p. 224

Hard Rules only

Soft Rules

Raw Data Vault

BusinessData Vault

ETL (E)T(L) ETL

ETL,

„M

on

ito

rin

g“

Page 60: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Draw a flow diagram how to load a HUB, LINK and SAT table and describe the SQL statements

EXERCISE: LOAD DATA VAULT TABLE

Data Warehouse / DHBWDaimler TSS 60

Page 61: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

EXERCISE: LOAD HUB TABLE

Data Warehouse / DHBWDaimler TSS 61

Source data exist

Load distinctbusiness keys

Doesbusiness

Key exist in HUB?

Insert row intoHUB

Conflict if PK HashKeycollision!

no

Rejectdata

Data loaded intoHUB

yes

Page 62: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

INSERT INTO core.fahrzeug (vehicle_hk, fin, loaddate, recordsource)

SELECT DISTINCT f.fahrzeug_hashkey

, f.fin_bk

, f.loaddate

, f.recordsource

FROM staging.fahrzeugdaten f

WHERE f.fin_bk NOT in (SELECT fin FROM core.hub_fahrzeug)

AND f.loaddate = <date to load>;

EXERCISE: LOAD HUB TABLE

Data Warehouse / DHBWDaimler TSS 62

Page 63: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

EXERCISE: LOAD LINK TABLE

Data Warehouse / DHBWDaimler TSS 63

Source data exist

Load distinctbusiness keys

Does Hash Key

relationshipexist in HUB?

Insert row intoLINK

Conflict if PK HashKeycollision!

no

Rejectdata

Data loaded intoLINK

yes

Page 64: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

INSERT INTO core.link_verbaut (verbaut_hk, motor_hk, vehicle_hk, loaddate, recordsource)

SELECT DISTINCT h.verbaut_hk

, f.motor_hashkey

, f.fahrzeug_hashkey

, f.loaddate

, f.recordsource

FROM staging.fahrzeugdaten f

WHERE (f.motor_hashkey, f.fahrzeug_hashkey) NOT in (SELECT motor_hk, vehicle_hk FROM core.link_verbaut v)

AND loaddate = <date to load>;

EXERCISE: LOAD LINK TABLE

Data Warehouse / DHBWDaimler TSS 64

Page 65: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

EXERCISE: LOAD SAT TABLE

Data Warehouse / DHBWDaimler TSS 65

Source data exist

Load distinctsource

data

MD5-HASH Diff

identical?

Insert row intoSAT

no

Rejectdata

Data loaded intoSAT

yes

Load current/

latest rowfrom SAT

table

Page 66: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

INSERT INTO core.sat_fahrzeug_text (vehicle_hk, loaddate, recordsource, md5_hash, codeleiste, kommentar)

SELECT DISTINCT f.fahrzeug_hashkey

, f.loaddate

, f.recordsource

, f.md5hash

, f.codeleiste

, f.kommentar

FROM staging.fahrzeugdaten f

LEFT OUTER JOIN (select s.vehicle_hk, s.md5_hash from s_fahrzeug s JOIN (select i.VEHICLE_HK, max(i.loaddate) as loaddate froms_fahrzeug i GROUP BY i.VEHICLE_HK) m

ON s.vehicle_hk = m.vehicle_hk AND s.loaddate = m.loaddate) k ON f.fahrzeug_hashkey = k.vehicle_hk

WHERE (k.md5_hash is null OR f.md5hash <> k.md5_hash)

AND f.loaddate = <date to load>;

EXERCISE: LOAD SAT TABLE

Data Warehouse / DHBWDaimler TSS 66

Page 67: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 67

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage

Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Integration Layer

(Cleansing Layer)

Aggregation Layer

Metadata Management

Security

DWH Manager incl. Monitor

? ? ? ?

Page 68: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

DB SPECIFICS FOR DWH

Page 69: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• After the end of this lecture you will be able to

• Understand DB techniques that are specific for DWH

• Analytic/windowing functions

• Bitemporal data

• Indexing, Partitioning, Parallelism, Compression

WHAT YOU WILL LEARN TODAY

Data Warehouse / DHBWDaimler TSS 69

Page 70: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Write an SQL statement that computes the most recent data for each customer.

Script to create the table including data: https://github.com/abuckenhofer/dwh_course/tree/master/scripts

EXERCISE: COMPUTE MOST RECENT ROWS

Data Warehouse / DHBWDaimler TSS 70

Customer_key Name Status Valid_from

1 Brown Single 01-MAY-2014

2 Bush Married 05-JAN-2015

1 Miller Married 15-DEC-2015

3 Stein 15-DEC-2015

3 Stein Single 18-DEC-2015

SIN1.sql

Page 71: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

SELECT s.*

FROM S_CUSTOMER s

JOIN (SELECT i.customer_key,

max(i.valid_from) as max_valid_from

FROM S_CUSTOMER i

GROUP BY i.customer_key) b

ON s.customer_key = b.customer_key

AND s.valid_from = b.max_valid_from;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 1: MAX-FUNCTION

Data Warehouse / DHBWDaimler TSS 71

S2IN.sql

Page 72: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

SELECT s.*

FROM S_CUSTOMER s

WHERE NOT EXISTS (SELECT 1

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key

AND s.valid_from < i.valid_from);

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 2: EXISTS

Data Warehouse / DHBWDaimler TSS 72

S2IN.sql

Page 73: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

SELECT s.*

FROM S_CUSTOMER s

WHERE s.valid_from = (SELECT MAX(i.valid_from)

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key);

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 3: MAX IN CORRELATED SUB-SELECT

Data Warehouse / DHBWDaimler TSS 73

S2IN.sql

Page 74: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

SELECT *

FROM (SELECT coalesce ((SELECT min (i.valid_from)

FROM S_CUSTOMER i

WHERE s.customer_key = i.customer_key

AND s.valid_from < i.valid_from

), to_date ('31.12.9999',

'DD.MM.YYYY')) as end_ts,

s.*

FROM S_CUSTOMER s)

WHERE end_ts = to_date ('31.12.9999', 'DD.MM.YYYY');

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 4: COALESCE WITH SUB-SELECT

Data Warehouse / DHBWDaimler TSS 74

S2IN.sql

Page 75: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

WITH max_cust as (

SELECT i.customer_key,

max(i.valid_from) as max_valid_from

FROM S_CUSTOMER i

GROUP BY i.customer_key)

SELECT s.*

FROM S_CUSTOMER s

JOIN max_cust b ON s.customer_key = b.customer_key

AND s.valid_from = b.max_valid_from;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 5: MAX-FUNCTION AND WITH-CLAUSE

Data Warehouse / DHBWDaimler TSS 75

S2IN.sql

Page 76: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

partition data

compute functions over these partitions

Rank [sequential order], first [first row], last [last row], lag [previous row], lead [next row]

return result

EXERCISE: COMPUTE MOST RECENT ROWSSQL ANALYTIC / WINDOWING FUNCTIONS

Data Warehouse / DHBWDaimler TSS 76

Page 77: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

WITH lead_cust as (

SELECT lead (s.valid_from, 1) OVER (PARTITION BY

s.customer_key

ORDER BY s.valid_from ASC) as end_ts

, s.*

FROM s_customer s)

SELECT *

FROM lead_cust b

WHERE b.end_ts IS NULL;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 6: ANALYTIC / WINDOWING FUNCTION LEAD

Data Warehouse / DHBWDaimler TSS 77

S3IN.sql

Page 78: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

WITH lead_cust as (

SELECT row_number() OVER(PARTITION BY s.customer_key

ORDER BY s.valid_from DESC) as rn

, s.*

FROM s_customer s)

SELECT *

FROM lead_cust b

WHERE b.rn = 1;

EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 7: ANALYTIC FUNCTION ROW_NUMBER

Data Warehouse / DHBWDaimler TSS 78

S3IN.sql

Page 79: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Check execution plans, execution time including service + response time, resource usage for final decision

• Solutions with Analytic / Windowing do not need self-join and show better statistics compared to the other shown solutions

• Analytic / Windowing functions are very powerful

• Remark: Usage of with-clause in SQL statements is preferable compared to sub-selects as it improves readability, understandability, maintainability

MAX OR ANALYTIC / WINDOWING FUNCTIONS WHICH ALTERNATIVE WOULD YOU RECOMMEND?

Data Warehouse / DHBWDaimler TSS 79

Page 80: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

TEMPORAL DATA STORAGE (BITEMPORAL DATA)

Data Warehouse / DHBWDaimler TSS 80

Page 81: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

TEMPORAL DATA STORAGE (BITEMPORAL DATA)

Data Warehouse / DHBWDaimler TSS 81

10.09. 20.09. 30.09. 10.10.

Time

Price: 15EUR Price: 16EUR

New Price of 16EUR is entered into the DB

ValidTime

(20.09.)

TransactionTime

(10.09.)

Page 82: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Time period when a fact is true in the real world

• The end user determines start and end date/time (or just a date/time for events)

Business validity:

Valid time

• Time period when a fact stored in the database is known

• ETL process determines start and end date/time

Technical validity:Transaction time

• Combines both Valid and Transaction TimeBitemporal data

TEMPORAL DATA STORAGE (BITEMPORAL DATA)DEFINITION

Data Warehouse / DHBWDaimler TSS 82

Page 83: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• SQL standard SQL:2011

• But different implementations by RDBMSes like Oracle, DB2, SQL Server and others

• Different syntax!

• Different coverage of standard!

• Very useful for slowly changing dimensions type 2, but also for other purposes

TEMPORAL DATA STORAGE (BITEMPORAL DATA)SQL STANDARD

Data Warehouse / DHBWDaimler TSS 83

Page 84: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

CREATE TABLE customer_address

( customerID INTEGER NOT NULL

, name VARCHAR(100)

, city VARCHAR(100)

, valid_start DATE NOT NULL

, valid_end DATE NOT NULL

, PERIOD BUSINESS_TIME(valid_start, valid_end)

, PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS) );

DB2 VALID TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 84

Page 85: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

INSERT INTO customer_address VALUES

(1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013');

UPDATE customer_address FOR PORTION OF BUSINESS_TIME

FROM '22.05.2013' TO '31.12.2013'

SET city = 'San Diego' WHERE customerID = 1;

DB2 VALID TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 85

customerID Name City Valid_start Valid_end

1 Miller Seattle 01.01.2013 22.05.2013

1 Miller San Diego 22.05.2013 31.12.2013

Page 86: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

SELECT *

FROM customer_address

FOR BUSINESS_TIME AS OF '17.05.2013';

DB2 VALID TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 86

Page 87: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

CREATE TABLE customer_info(

customerId INTEGER NOT NULL,

comment VARCHAR(1000) NOT NULL,

sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN,

sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,

PERIOD SYSTEM_TIME (sys_start, sys_end)

);

DB2 TRANSACTION TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 87

Page 88: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Transaction on 15.10.2013:

INSERT INTO customer_info VALUES( 1, 'comment 1');

Transaction on 31.10.2013

UPDATE customer_address SET comment = 'comment 2'

WHERE customerID = 1;

DB2 TRANSACTION TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 88

CustomerId comment Sys_start Sys_end

1 Comment 2 31.10.2013 31.12.2999

Page 89: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

SELECT *

FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013';

Data comes from a history table:

Valid Time and Transaction Time can be combined = Bitemporal table

DB2 TRANSACTION TIME EXAMPLE

Data Warehouse / DHBWDaimler TSS 89

CustomerId comment Sys_start Sys_end

1 Comment 1 15.10.2013 31.10.2013

Page 90: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Very important performance improvement technique

• Good for many reads with high selectivity, write penalty

• B-trees most common

INDEXING - WHY

Data Warehouse / DHBWDaimler TSS 90

root

branch branch

leaf leaf leaf

Table

Page 91: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• DBs index Primary Keys by default

• Dimension table columns that are regularly used in where clausesare candidates

• Maybe foreign Key columns in Fact table (see also later Star Transformation)

INDEXING A STAR SCHEMA – WHICH COLUMNS ARE CANDIDATES FOR AN INDEX?

Data Warehouse / DHBWDaimler TSS 91

Page 92: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Fact table has normally much more rows compared to dimension tables

• Common join techniques would need to join first dimension table with the fact table

• Alternative technique: evaluate all dimensions(cartesian join)

• Then join into fact table in last step

• Oracle uses Bitmap indexes on foreign key columns in fact tables to achieveStar Join; not supported by many DBs

STAR TRANSFORMATION

Data Warehouse / DHBWDaimler TSS 92

Page 93: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

PARTITIONING

Data Warehouse / DHBWDaimler TSS 93

Col1 Col2 Col3 col4

1 A AA AAA

2 B BB BBB

3 C CC CCC

Col1 Col2

1 A

2 B

3 C

Col3 col4

AA AAA

BB BBB

CC CCC

Col1 Col2 Col3 col4

3 C CC CCC

Col1 Col2 Col3 col4

1 A AA AAA

2 B BB BBB

Vertical partitioning (sharding) Horizontal partitioning

Page 94: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Very powerful feature in a DWH to reduce workload

• Split table into logical smaller tables

• Avoidance of full table scans

• How could a table be split?

• Introduction to (Oracle) partitioning: https://asktom.oracle.com/partitioning-for-developers.htm

HORIZONTAL PARTITIONING

Data Warehouse / DHBWDaimler TSS 94

Page 95: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• By range

• Most common

• Use date field like order data to partition table into months, days, etc

• By list

• Use field that has limited number of different values, e.g. split customer data by country if end users most likely select customers from within a country

• By hash

• Use a filed that most likely splits the data in evenly distributed chunks

HORIZONTAL PARTITIONING – SPLITTING OPTIONS

Data Warehouse / DHBWDaimler TSS 95

Page 96: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Statements are normally executed on one CPU

• Parallelism allows the DB to distribute the execution to several CPUs

• Powerful combination with partitioning

• Parallelism is limited by the number of CPUs: if parallelism is too high, performance will degrade

• Intra-query parallelism and inter-query parallelism

PARALLELISM

Data Warehouse / DHBWDaimler TSS 96

Page 97: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Data compression + Index compression

• Store more data in a block/page = read more data during I/O

• If CPU resources are available, often a very powerful feature to improve performance

• Additionally reduce storage

• Additionally reduce backup time

COMPRESSION

Data Warehouse / DHBWDaimler TSS 97

Page 98: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Relational columnar In-Memory DB

• Materialized Views / Query Tables

ALREADY COVERED IN A PREVIOUS LECTURE

Data Warehouse / DHBWDaimler TSS 98

Page 99: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Recapture ETL and DB specific topics

• Which topics do you remember or do you find important?

• Write down 1-2 topics on stick-it cards.

EXERCISE - RECAPTURE ETL AND DB SPECIFICS

Data Warehouse / DHBWDaimler TSS 99

Page 100: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle

Data Warehouse / DHBWDaimler TSS 100

THANK YOU

Page 101: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

One big job

• How much data?

• How much volume?

Many small jobs

• How many times?

• How many rows?

MOST OF THE TIME, IT’S BAD CODE OR BAD DESIGN

Data Warehouse / DHBWDaimler TSS 101

Two main strategies

Page 102: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• What about foreign key constraints?

• Create FKs or not (for performance reasons)?

• Yes!

• Normally slower during inserts / updates /deletes

• Very often faster during selects because the optimizer has more information about the data and can compute superior execution plans, e.g. table elimination

• Reading data is much more common compared to writing data

• Alternative: create informational/rely constraints if DML performance is not good enough

• Check data quality regularly if you work with no constraints or rely constraints

CONSTRAINTS

Data Warehouse / DHBWDaimler TSS 102

Page 103: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• INSERT is the fastest operation as there is no search and reduced transaction logging (just the after image). Indexes will affect insert performance. Bulk operations will do data load in chunks.

• DELETE is slower as there is a search phase to find the row which has to be deleted. Transaction logging requires to store the before image. Indexes will affect performance as each index has to be maintained.

• UPDATE is slowest as there is a search phase. Transaction logging requires to store before and after images. Indexes will affect performance as each index has to be maintained.

HOW TO UPDATE OR DELETE MILLIONS OF ROWS?INSERT, UPDATE, DELETE

Data Warehouse / DHBWDaimler TSS 103

Page 104: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

If Millions of rows have to be deleted, it’s often faster to insert the data:

• Create working table that is identical to the huge table

• Insert data from the huge table into the working table except the rows that need to be deleted

• Exchange huge table and working table

HOW TO UPDATE OR DELETE MILLIONS OF ROWS?DELETE

Data Warehouse / DHBWDaimler TSS 104

Page 105: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

If Millions of rows have to be update, it’s often faster to insert the data:

• Create working table that is identical to the huge table

• Huge table can have one partition only for partition exchange

• Use a window function to compute data from the huge table and the new incoming data (Staging)

• Insert the computed data into the working table

• Create local indexes on the working table

• Exchange huge table partition and working table

HOW TO UPDATE OR DELETE MILLIONS OF ROWS?UPDATE

Data Warehouse / DHBWDaimler TSS 105

Page 106: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Data quality is a characteristic of data to correctly represent the real-world.

• Aspects of data quality:

• Accuracy

• Completeness

• Relevance

• Consistency

• Reliability

• Trustworthiness

• Traceability

DATA QUALITY

Data WarehouseDaimler TSS 106

Page 107: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Suppose you have a fact table containing data for last 10 years with millions of rows but you are interested in only in• Data from yesterday

• From last 2 years

• How could you improve performance?

EXERCISE: PERFORMANCE

Data Warehouse / DHBWDaimler TSS 107

Page 108: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Suppose you have a fact table containing data for last 10 years with millions of rows but you are interested in only in• Columnar In-memory DB may be an option in general (the option has already

been discussed during the lecture)

• Data from yesterday

• Indexing might be a good choice as not much rows are read

• From last 2 years

• Indexing most likely is a bad choice as reading a rather high amount of data via an index quickly becomes inefficient

• Partitioning

EXERCISE: PERFORMANCE

Data Warehouse / DHBWDaimler TSS 108

Page 109: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Data that has not changed does not need to be loaded again

• Check for each column if the column has changed: if any column has changed, load the data

• Better approach: compute a Hash Key for all relevant columns

• Store an Hash Key in each table in the Core Warehouse

• Compare Hash Keys for new data with existing data only!

• Faster if there are many duplicates to detect

• Less code!

HOW TO COMPARE MANY COLUMNS?HASH KEYS

Data Warehouse / DHBWDaimler TSS 109

Page 110: LECTURE @DHBW: DATA WAREHOUSE PART III: ETL AND DB …buckenhofer/20181DWH/Buckenhofer-… · Discovery of all changes vs. determining the net effect at extract/load time only •

• Use upper only if column is case sensitive

• Md5 is just an example, you can use other hash functions, too

• Be aware that there is a danger of collisions

• Always use a SEPARATOR that is unlikely to occur in the data

• “10” + “1” and “1” + “01” would have the same hash key “101”

• There are different hash keys if a separator is used: “10|1” and “1|01”

HASH KEY COMPUTATION

Data Warehouse / DHBWDaimler TSS 110

HK = MD5_HASH (UPPER (TRIM (column-1)) || SEPARATOR ||

UPPER (TRIM (column-2)) || SEPARATOR ||

UPPER (TRIM (column-3)) || SEPARATOR ||

…)