H.Galhardas SAD 2007/08 Extract, Transform, Load (ETL) H.Galhardas SAD 2007/08 Overview • General ETL issues: – Data staging area (DSA) – Building dimensions – Building fact tables – Extract – Load – Transformation/cleaning • Commercial (and open-source) tools • The AJAX data cleaning and transformation framework
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
H.GalhardasSAD 2007/08
Extract, Transform, Load (ETL)
H.GalhardasSAD 2007/08
Overview• General ETL issues:
– Data staging area (DSA)– Building dimensions– Building fact tables– Extract– Load– Transformation/cleaning
• Commercial (and open-source) tools• The AJAX data cleaning and transformation
framework
H.GalhardasSAD 2007/08
DW PhasesDesign phase
– Modeling, DB design, source selection,…Loading phase
– First load/population of the DW– Based on all data in sources
Refreshment phase– Keep the DW up-to-date wrt. source data
changes
H.GalhardasSAD 2007/08
The ETL Process• The most underestimated process in DW development
• The most time-consuming process in DW development– Often, 80% of development time is spent on ETL
Extract– Extract relevant data
Transform– Transform data to DW format– Build keys, etc.– Cleansing of data
Load– Load data into DW– Build aggregates, etc.
H.GalhardasSAD 2007/08
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
othersources
Data Storage
OLAP Server
Data Staging
DW ArchitectureDW Architecture
H.GalhardasSAD 2007/08
Operational DBs Data
Warehouse
othersources
ExtractTransformLoadRefresh
Data Staging
ETL processETL process
Implemented using anETL tool!
Ex: SQLServer 2005Integration Services
H.GalhardasSAD 2007/08
Data Staging Area
• Transit storage for data underway in the ETLprocess– Transformations/cleansing done here
• No user queries (some do it)• Sequential operations (few) on large data volumes
– Performed by central ETL logic– Easily restarted– No need for locking, logging, etc.– RDBMS or flat files? (DBMS have become better at this)
• Finished dimensions copied from DSA to relevantmarts
H.GalhardasSAD 2007/08
ETL construction processPlan
1)Make high-level diagram of source-destination flow2)Test, choose and implement ETL tool3)Outline complex transformations, key generation and job
sequence for every destination tableConstruction of dimensions
4)Construct and test building static dimension5)Construct and test change mechanisms for one dimension6)Construct and test remaining dimension builds
Construction of fact tables and automation7)Construct and test initial fact table build8)Construct and test incremental update9)Construct and test aggregate build10)Design, construct, and test ETL automation
H.GalhardasSAD 2007/08
Building Dimensions
Static dimension table– Assignment of keys: production keys to DW using
table– Combination of data sources: find common key?
Handling dimension changes– Slowly changing dimensions– Find newest DW key for a given production key– Table for mapping production keys to DW keys
must be updatedLoad of dimensions
– Small dimensions: replace– Large dimensions: load only changes
H.GalhardasSAD 2007/08
Building fact tables
Two types of load:• Initial load
– ETL for all data up till now– Done when DW is started the first time– Often problematic to get correct historical data– Very heavy -large data volumes
• Incremental update– Move only changes since last load– Done periodically (../month/week/day/hour/...) after DW start– Less heavy -smaller data volumes
• Dimensions must be updated before facts– The relevant dimension rows for new facts must be in place– Special key considerations if initial load must be performed
again
H.GalhardasSAD 2007/08
Types of data sourcesNon-cooperative sources�Snapshot sources –provides only full copy of source�Specific sources –each is different, e.g., legacy systems�Logged sources –writes change log (DB log)�Queryable sources –provides query interface, e.g., SQLCooperative sources�Replicated sources –publish/subscribe mechanism�Call back sources –calls external code (ETL) when changes
occur�Internal action sources –only internal actions when changes
occur (DB triggers is an example)
Extract strategy is very dependent on the source types
H.GalhardasSAD 2007/08
Extract phaseGoal: fast extract of relevant data�Extract from source systems can take a long time
Types of extracts:�Extract applications (SQL): co-existence with other
applications�DB unload tools: must faster than SQL-based extracts�Extract applications sometimes the only solution
Often too time consuming to ETL– Extracts can take days/weeks– Drain on the operational systems– Drain on DW systems
=> Extract/ETL only changes since last load (delta)
H.GalhardasSAD 2007/08
Computing deltas• Much faster to only ”ETL” changes since last load
A number of methods can be used• Store sorted total extracts in DSA
– Delta can easily be computed from current+last extract+ Always possible+ Handles deletions- Does not reduce extract time
• Put update timestamp on all rows– Updated by DB trigger– Extract only where ”timestamp > time for last extract”+ Reduces extract time+/- Less operational overhead- Cannot (alone) handle deletions- Source system must be changed
H.GalhardasSAD 2007/08
Load (1)Goal: fast loading into DW
– Loading deltas is much faster than total load
• SQL-based update is slow– Large overhead (optimization,locking,etc.) for every SQL call– DB load tools are much faster– Some load tools can also perform UPDATEs
• Index on tables slows load a lot– Drop index and rebuild after load– Can be done per partition
• Parallellization– Dimensions can be loaded concurrently– Fact tables can be loaded concurrently– Partitions can be loaded concurrently
H.GalhardasSAD 2007/08
Load (2)• Relationships in the data
– Referential integrity must be ensured– Can be done by loader
• Aggregates– Must be built and loaded at the same time as the detail data– Today, RDBMSes can often do this
• Load tuning– Load without log– Sort load file first– Make only simple transformations in loader– Use loader facilities for building aggregates– Use loader within the same database
• Should DW be on-line 24*7?– Use partitions or several sets of tables
H.GalhardasSAD 2007/08
Overview• General ETL issues:
– Data staging area (DSA)– Building dimensions– Building fact tables– Extract– Load– Transformation/cleaning
• Commercial (and open-source) tools• The AJAX data cleaning and transformation
framework
H.GalhardasSAD 2007/08
Data Cleaning
Activity of converting source data into targetdata without errors, duplicates, andinconsistencies, i.e.,
Cleaning and Transforming to get…
High-quality data!
H.GalhardasSAD 2007/08
Why Data Cleaning andTransformation?
Data in the real world is dirtyincomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data• e.g., occupation=“”
noisy: containing errors or outliers (spelling, phoneticand typing errors, word transpositions, multiple valuesin a single free-form field)
• e.g., Salary=“-10”inconsistent: containing discrepancies in codes or
names (synonyms and nicknames, prefix and suffixvariations, abbreviations, truncation and initials)
• e.g., Age=“42” Birthday=“03/07/1997”• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records
H.GalhardasSAD 2007/08
Why Is Data Dirty?
• Incomplete data comes from:– non available data value when collected– different criteria between the time when the data was collected
and when it is analyzed.– human/hardware/software problems
• Noisy data comes from:– data collection: faulty instruments– data entry: human or computer errors– data transmission
• Inconsistent (and redundant) data comes from:– Different data sources, so non uniform naming conventions/data
• Approximate, ”fuzzy” joins on not-quite-matchingkeys
H.GalhardasSAD 2007/08
Data quality vs cleaning• Data quality = Data cleaning +
– Data enrichment• enhancing the value of internally held data by appending related
attributes from external sources (for example, consumerdemographic attributes or geographic descriptors).
– Data profiling• analysis of data to capture statistics (metadata) that provide
insight into the quality of the data and aid in the identification ofdata quality issues.
– Data monitoring• deployment of controls to ensure ongoing conformance of data to
business rules that define data quality for the organization.– Data stewards responsible for data quality– DW-controlled improvement– Source-controlled improvement– Construct programs to check data quality
H.GalhardasSAD 2007/08
Overview• General ETL issues:
– Data staging area (DSA)– Building dimensions– Building fact tables– Extract– Load– Transformation/cleaning
• Commercial (and open-source) tools• The AJAX data cleaning and transformation
[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and JenniferWidom. Making Views Self-Maintainable for Data Warehousing. InProceedings of the Conference on Parallel and DistributedInformation Systems. Miami Beach, Florida, USA, 1996[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95
DirtyData
Data Cleaning & Transformation
PDIS | Conference onParallel andDistributedInformation Systems
Events
QGMW96| Making Views Self-Maintainablefor Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996
PublicationsAuthors
DQua | Dallan Quass
AGup | Ashish Gupta
JWid | Jennifer Widom…..
QGMW96 | DQua
QGMW96 | AGup….
PubsAuthors
H.GalhardasSAD 2007/08
Modeling a data cleaning process
A data cleaning process is modeled by a directed acyclic graph of data transformations
DirtyData
DirtyAuthors
Authors
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles... DirtyEvents
CitiesTags
H.GalhardasSAD 2007/08
Existing technology• Ad-hoc programs written in a programming language
like C or Java or using an RDBMS proprietarylanguage– Programs difficult to optimize and maintain
• Data transformation scripts using an ETL(Extraction-Transformation-Loading) or a data quality tool
H.GalhardasSAD 2007/08
Problems of ETL and data qualitysolutions (1)
The semantics of some data transformationsis defined in terms of their implementationalgorithms
App. Domain 1
App. Domain 2
App. Domain 3
Data cleaning transformations
...
H.GalhardasSAD 2007/08
There is a lack of interactive facilities to tunea data cleaning application program
Problems of ETL and data qualitysolutions (2)
Dirty Data
Cleaning process
Clean data Rejected data
H.GalhardasSAD 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra– Physical execution algorithms
• A declarative language for logical operators– SQL extension
• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions
H.GalhardasSAD 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra– Physical execution algorithms
• A declarative language for logical operators– SQL extension
• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions
H.GalhardasSAD 2007/08
Logical level: parametricoperators
• View: arbitrary SQL query• Map: iterator-based one-to-many mapping with
arbitrary user-defined functions• Match: iterator-based approximate join• Cluster: uses an arbitrary clustering function• Merge: extends SQL group-by with user-defined
aggregate functions• Apply: executes an arbitrary user-defined
algorithm
Map Match
Merge
ClusterView
Apply
H.GalhardasSAD 2007/08
Logical level
DirtyData
DirtyAuthors
Authors
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles...
CitiesTags
H.GalhardasSAD 2007/08
Logical level
DirtyData
DirtyAuthors
Map
Cluster
Match
Merge
Authors
Map
Map
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles...
CitiesTags
DirtyData
DirtyAuthors
TC
NL
Authors
SQL Scan
Java Scan
Physical level
DirtyTitles...
Java Scan
Java Scan
CitiesTags
H.GalhardasSAD 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra– Physical execution algorithms
• A declarative language for logical operators– SQL extension
• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions
H.GalhardasSAD 2007/08
Match• Input: 2 relations• Finds data records that correspond to the
same real object• Calls distance functions for comparing field
values and computing the distancebetween input tuples
• Output: 1 relation containing matchingtuples and possibly 1 or 2 relationscontaining non-matching tuples
H.GalhardasSAD 2007/08
Example
Cluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
H.GalhardasSAD 2007/08
ExampleCREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
INTO MatchAuthorsCluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
H.GalhardasSAD 2007/08
ExampleCREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
INTO MatchAuthors
Input:DirtyAuthors(authorKey, name)861|johann christoph freytag822|jc freytag819|j freytag814|j-c freytag
CREATE MAPPING mapKeDiDaFROM DirtyData DdLET keyKdd = generateId(1){SELECT keyKdd AS paperKey, Dd.paper AS paperKEY paperKeyCONSTRAINT NOT NULL mapKeDiDa.paper}
Declarative specification
H.GalhardasSAD 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra– Physical execution algorithms
• A declarative language for logical operators– SQL extension
• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions
H.GalhardasSAD 2007/08
Management of exceptions
• Problem: to mark tuples not handled by thecleaning criteria of an operator
• Solution: to specify the generation ofexception tuples within a logical operator– exceptions are thrown by external functions– output constraints are violated
H.GalhardasSAD 2007/08
Debugger facility
• Supports the (backward and forward) dataderivation of tuples wrt an operator todebug exceptions
• Supports the interactive data modificationand, in the future, the incrementalexecution of logical operators