Top Banner
Information Systems 30 (2005) 492–525 A generic and customizable framework for the design of ETL scenarios Panos Vassiliadis a , Alkis Simitsis b , Panos Georgantas b , Manolis Terrovitis b , Spiros Skiadopoulos b a Department of Computer Science, University of Ioannina, Ioannina, Greece b Department of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece Abstract Extraction–transformation–loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we delve into the logical design of ETL scenarios and provide a generic and customizable framework in order to support the DW designer in his task. First, we present a metamodel particularly customized for the definition of ETL activities. We follow a workflow-like approach, where the output of a certain activity can either be stored persistently or passed to a subsequent activity. Also, we employ a declarative database programming language, LDL, to define the semantics of each activity. The metamodel is generic enough to capture any possible ETL activity. Nevertheless, in the pursuit of higher reusability and flexibility, we specialize the set of our generic metamodel constructs with a palette of frequently used ETL activities, which we call templates. Moreover, in order to achieve a uniform extensibility mechanism for this library of built-ins, we have to deal with specific language issues. Therefore, we also discuss the mechanics of template instantiation to concrete activities. The design concepts that we introduce have been implemented in a tool, ARKTOS II, which is also presented. r 2004 Elsevier B.V. All rights reserved. Keywords: Data warehousing; ETL 1. Introduction Data warehouse operational processes normally compose a labor-intensive workflow, involving data extraction, transformation, integration, cleaning and transport. To deal with this work- flow, specialized tools are already available in the market [1–4], under the general title Extraction— Transformation– Loading (ETL) tools. To give a general idea of the functionality of these tools we mention their most prominent tasks, which include (a) the identification of relevant information at the source side, (b) the extraction of this information, ARTICLE IN PRESS www.elsevier.com/locate/infosys 0306-4379/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2004.11.002 E-mail addresses: [email protected] (P. Vassiliadis), asimi@ dbnet.ece.ntua.gr (A. Simitsis), [email protected] (P. Georgantas), [email protected] (M. Terrovitis), [email protected] (S. Skiadopoulos).
34

Etl design document

Nov 01, 2014

Download

Documents

sgyazuddin

DESIGN DOCUMENT FOR ETL
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Etl design document

ARTICLE IN PRESS

0306-4379$ - se

doi101016jis

E-mail addre

dbnetecentuag

(P Georgantas)

spirosdbnete

Information Systems 30 (2005) 492ndash525

wwwelseviercomlocateinfosys

A generic and customizable framework for the designof ETL scenarios

Panos Vassiliadisa Alkis Simitsisb Panos Georgantasb Manolis TerrovitisbSpiros Skiadopoulosb

aDepartment of Computer Science University of Ioannina Ioannina GreecebDepartment of Electrical and Computer Engineering National Technical University of Athens Athens Greece

Abstract

Extractionndashtransformationndashloading (ETL) tools are pieces of software responsible for the extraction of data from

several sources their cleansing customization and insertion into a data warehouse In this paper we delve into the

logical design of ETL scenarios and provide a generic and customizable framework in order to support the DW

designer in his task First we present a metamodel particularly customized for the definition of ETL activities We

follow a workflow-like approach where the output of a certain activity can either be stored persistently or passed to a

subsequent activity Also we employ a declarative database programming language LDL to define the semantics of

each activity The metamodel is generic enough to capture any possible ETL activity Nevertheless in the pursuit of

higher reusability and flexibility we specialize the set of our generic metamodel constructs with a palette of frequently

used ETL activities which we call templates Moreover in order to achieve a uniform extensibility mechanism for this

library of built-ins we have to deal with specific language issues Therefore we also discuss the mechanics of template

instantiation to concrete activities The design concepts that we introduce have been implemented in a tool ARKTOS II

which is also presented

r 2004 Elsevier BV All rights reserved

Keywords Data warehousing ETL

1 Introduction

Data warehouse operational processes normallycompose a labor-intensive workflow involving

e front matter r 2004 Elsevier BV All rights reserve

200411002

sses pvassilcsuoigr (P Vassiliadis) asimi

r (A Simitsis) pgeordbnetecentuagr

mterdbnetecentuagr (M Terrovitis)

centuagr (S Skiadopoulos)

data extraction transformation integrationcleaning and transport To deal with this work-flow specialized tools are already available in themarket [1ndash4] under the general title Extractionmdash

Transformationndash Loading (ETL) tools To give ageneral idea of the functionality of these tools wemention their most prominent tasks which include(a) the identification of relevant information at thesource side (b) the extraction of this information

d

ARTICLE IN PRESS

Security amp Access Rights Management

Recovery Plan

Execution Schedule

Execution Sequence

Monitoring amp Logging

Data Flowfor Logical ExceptionsPrimary Data Flow

Execution Plan

Administration Plan

Relationshipwith data

Resource Layer

Operational Layer

Logical Perspective Physical Perspective

Fig 1 Different perspectives for an ETL workflow

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 493

(c) the customization and integration of theinformation coming from multiple sources into acommon format (d) the cleaning of the resultingdata set on the basis of database and businessrules and (e) the propagation of the data to thedata warehouse andor data martsIf we treat an ETL scenario as a composite

workflow in a traditional way its designer isobliged to define several of its parameters (Fig 1)Here we follow a multi-perspective approach thatenables to separate these parameters and studythem in a principled approach We are mainlyinterested in the design and administration parts ofthe lifecycle of the overall ETL process and wedepict them at the upper and lower part of Fig 1respectively At the top of Fig 1 we are mainlyconcerned with the static design artifacts for aworkflow environment We will follow a tradi-tional approach and group the design artifacts intological and physical with each category compris-ing its own perspective We depict the logicalperspective on the left-hand side of Fig 1 and thephysical perspective on the right-hand side At thelogical perspective we classify the design artifactsthat give an abstract description of the workflowenvironment First the designer is responsible for

defining an execution plan for the scenario Thedefinition of an execution plan can be seen fromvarious perspectives The execution sequence in-volves the specification of which activity runs firstsecond and so on which activities run in parallelor when a semaphore is defined so that severalactivities are synchronized at a rendezvous pointETL activities normally run in batch so thedesigner needs to specify an execution scheduleie the time points or events that trigger theexecution of the scenario as a whole Finally dueto system crashes it is imperative that there existsa recovery plan specifying the sequence of steps tobe taken in the case of failure for a certain activity(eg retry to execute the activity or undo anyintermediate results produced so far) On the right-hand side of Fig 1 we can also see the physicalperspective involving the registration of the actualentities that exist in the real world We will reusethe terminology of [5] for the physical perspectiveThe resource layer comprises the definition of roles(human or software) that are responsible forexecuting the activities of the workflow Theoperational layer at the same time comprises thesoftware modules that implement the designentities of the logical perspective in the real world

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525494

In other words the activities defined at the logicallayer (in an abstract way) are materialized andexecuted through the specific software modules ofthe physical perspectiveAt the lower part of Fig 1 we are dealing with

the tasks that concern the administration of theworkflow environment and their dynamic beha-vior at runtime First an administration plan

should be specified involving the notification ofthe administrator either on-line (monitoring) oroff-line (logging) for the status of an executedactivity as well as the security and authenticationmanagement for the ETL environmentWe find that research has not dealt with the

definition of data-centric workflows to the entiretyof its extent In the ETL case for example due tothe data centric nature of the process the designermust deal with the relationship of the involved

activities with the underlying data This involves thedefinition of a primary data flow that describes theroute of data from the sources towards their finaldestination in the data warehouse as they passthrough the activities of the scenario Also due topossible quality problems of the processed datathe designer is obliged to define a data flow for

logical exceptions ie a flow for the problematicdata ie the rows that violate integrity or businessrules It is the combination of the executionsequence and the data flow that generates thesemantics of the ETL workflow the data flowdefines what each activity does and the executionplan defines in which order and combinationIn this paper we work in the internals of the

data flow of ETL scenarios First we present ametamodel particularly customized for the defini-tion of ETL activities We follow a workflow-likeapproach where the output of a certain activitycan either be stored persistently or passed to asubsequent activity Moreover we employ adeclarative database programming languageLDL to define the semantics of each activityThe metamodel is generic enough to capture anypossible ETL activity nevertheless reusability andease-of-use dictate that we can do better in aidingthe data warehouse designer in his task In thispursuit of higher reusability and flexibility wespecialize the set of our generic metamodelconstructs with a palette of frequently used ETL

activities which we call templates Moreover inorder to achieve a uniform extensibility mechan-ism for this library of built-ins we have to dealwith specific language issues thus we also discussthe mechanics of template instantiation to concreteactivities The design concepts that we introducehave been implemented in a tool ARKTOS II whichis also presentedOur contributions can be listed as follows

First we define a formal metamodel as an

abstraction of ETL processes at the logical levelThe data stores activities and their constituentparts are formally defined An activity is definedas an entity with possibly more than one inputschemata an output schema and a parameterschema so that the activity is populated eachtime with its proper parameter values The flowof data from producers towards their consumersis achieved through the usage of provider

relationships that map the attributes of theformer to the respective attributes of the latterA serializable combination of ETL activitiesprovider relationships and data stores constitu-tes an ETL scenario

Second we provide a reusability framework thatcomplements the genericity of the metamodelPractically this is achieved from a set of lsquolsquobuilt-inrsquorsquo specializations of the entities of the meta-model layer specifically tailored for the mostfrequent elements of ETL scenarios This paletteof template activities will be referred to astemplate layer and it is characterized by itsextensibility in fact due to language considera-tions we provide the details of the mechanismthat instantiates templates to specific activities

Finally we discuss implementation issues and wepresent a graphical tool ARKTOS II that facil-itates the design of ETL scenarios based on ourmodel

This paper is organized as follows In Section 2we present a generic model of ETL activitiesSection 3 describes the mechanism for specifyingand materializing template definitions of fre-quently used ETL activities Section 4 presentsARKTOS II a prototype graphical tool In Section 5we survey related work In Section 6 we make a

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 495

general discussion on the completeness and generalapplicability of our approach Section 7 offersconclusions and presents topics for future re-search Short versions of parts of this paper havebeen presented in [67]

1In data warehousing terminology a DSA is an intermediate

area of the data warehouse specifically destined to enable the

transformation cleaning and integration of source data before

being loaded to the warehouse2The technical points like FTP are mostly employed to show

what kind of problems someone has to deal with in a practical

situation rather than to relate this kind of physical operations

to a logical model In terms of logical modelling this is a simple

passing of data from one site to another

2 Generic model of ETL activities

The purpose of this section is to present a formallogical model for the activities of an ETLenvironment This model abstracts from thetechnicalities of monitoring scheduling and log-ging while it concentrates on the flow of data fromthe sources towards the data warehouse throughthe composition of activities and data stores Thefull layout of an ETL scenario involving activitiesrecordsets and functions can be modeled by agraph which we call the architecture graph Weemploy a uniform graph-modeling framework forboth the modeling of the internal structure ofactivities and the ETL scenario at large whichenables the treatment of the ETL environmentfrom different viewpoints First the architecturegraph comprises all the activities and data stores ofa scenario along with their components Secondthe architecture graph captures the data flowwithin the ETL environment Finally the informa-tion on the typing of the involved entities and theregulation of the execution of a scenario throughspecific parameters are also covered

21 Graphical notation and motivating example

Being a graph the architecture graph of an ETLscenario comprises nodes and edges The involveddata types function types constants attributesactivities recordsets parameters and functionsconstitute the nodes of the graph The differentkinds of relationships among these entities aremodeled as the edges of the graph In Fig 2 wegive the graphical notation for all the modelingconstructs that will be presented in the sequel

Motivating example To motivate our discus-sion we will present an example involving thepropagation of data from a certain source S1towards a data warehouse DW through intermedi-ate recordsets These recordsets belong to a data

staging area (DSA)1 DS The scenario involves thepropagation of data from the table PARTSUPP ofsource S1 to the data warehouse DW TableDWPARTSUPP (PKEY SOURCE DATE QTYCOST) stores information for the available quan-tity (QTY) and cost (COST) of parts (PKEY)per source (SOURCE) The data source S1PARTSUPP (PKEY DATE QTY COST) recordsthe supplies from a specific geographical regioneg Europe All the attributes except for the datesare instances of the Integer type The scenario isgraphically depicted in Fig 3 and involves thefollowing transformations

1

First we transfer via FTP_PS1 the snapshotfrom the source S1PARTSUPP to the fileDSPS1_NEW of the DSA2

2

In the DSA we maintain locally a copy of thesnapshot of the source as it was at the previousloading (we assume here the case of theincremental maintenance of the DW instead ofthe case of the initial loading of the DW) Therecordset DSPS1_NEW (PKEY DATE QTYCOST) stands for the last transferred snapshotof S1PARTSUPP By detecting the differenceof this snapshot with the respective version ofthe previous loading DSPS1_OLD (PKEYDATE QTY COST) we can derive the newlyinserted rows in S1PARTSUPP Note that thedifference activity that we employ namelyDiff_PS1 checks for differences only on theprimary key of the recordsets thus we ignorehere any possible deletions or updates for theattributes COST QTY of existing rows Any notnewly inserted row is rejected and so it ispropagated to Diff_PS1_REJ that stores allthe rejected rows The schema of Diff_PS1_REJ is identical to the input schema of theactivity Diff_PS1

ARTICLE IN PRESS

Add_Attr1 SK1

DSPS1_NEW

DSPS1_OLD

FTP_PS1

Diff_PS1 DWPARTSUPP

S1PARTSUPP

LOOKUP

DSPS1_NEWPKEY=

DSPS1_OLDPKEYSOURCE = 1

DSPS1PKEYLOOKUPPKEY

LOOKUPSOURCELOOKUPSKEY

NotNu111

COST

Diff_PS1_REJ

Not Nul 111_REJ

DSA

Source

DataWarehouse

DSPS1

Fig 3 Birdrsquos-eye view of the motivating example

Data Types Black ellipsoid RecordSets Cylinders

Function

TypesBlack rectangles Functions Gray rectangles

Constants Black circles Parameters White rectangles

Attributes Unshaded ellipsoid Activities Triangles

Part-Of

Relationships

Simple lines with

diamond edges

Provider

Relationships

Bold solid arrows

(from provider to

consumer)

Instance-Of

Relationships

Dotted arrows

(from instance

towards the type)

Derived

Provider

Relationships

Bold dotted

arrows (from

provider to

consumer)

Regulator

RelationshipsDotted lines

We annotate the part-of relationship among afunction and its return type with a directed edge todistinguish it from therest of the parameters

1

Fig 2 Graphical notation for the architecture graph

P Vassiliadis et al Information Systems 30 (2005) 492ndash525496

3

The rows that pass the activity Diff_PS1 arechecked for null values of the attribute COSTthrough the activity NotNull1 Rows having aNULL value for their COST are kept in the

Diff_PS1_REJ recordset for further examina-tion by the data warehouse administrator

4

Although we consider the data flow for onlyone source namely S1 the data warehouse can

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 497

clearly have more sources for part supplies Inorder to keep track of the source of each rowentering into the DW we need to add a lsquoflagrsquoattribute namely SOURCE indicating the re-spective source This is achieved through theactivity Add_Attr1 We store the rows thatstem from this process in the recordset DSPS1(PKEY SOURCE DATE QTY COST)

5

Next we assign a surrogate key on PKEY In thedata warehouse context it is common tactics toreplace the keys of the production systems witha uniform key which we call a surrogate key [8]The basic reasons for this replacement areperformance and semantic homogeneity Tex-tual attributes are not the best candidates forindexed keys and thus they need to be replacedby integer keys At the same time differentproduction systems might use different keys forthe same object or the same key for differentobjects resulting in the need for a globalreplacement of these values in the data ware-house This replacement is performed through alookup table of the form L (PRODKEYSOURCE SKEY) The SOURCE column is dueto the fact that there can be synonyms in thedifferent sources which are mapped to differentobjects in the data warehouse In our case theactivity that performs the surrogate key assign-ment for the attribute PKEY is SK1 It uses thelookup table LOOKUP (PKEY SOURCESKEY) Finally we populate the data ware-house with the output of the previous activity

The role of rejected rows depends on thepeculiarities of each ETL scenario If the designerneeds to administrate these rows further then heshe should use intermediate storage recordsetswith the burden of an extra IO cost If the rejectedrows should not have a special treatment then thebest solution is to be ignored thus in this case weavoid overloading the scenario with any extrastorage recordset In our case we annotate onlytwo of the presented activities with a destina-tion for rejected rows Out of these whileNotNull1_REJ absolutely makes sense as aplaceholder for problematic rows having non-acceptable NULL values Diff_PS1_REJ is pre-sented for demonstration reasons only

Finally before proceeding we would like tostress that we do not anticipate a manualconstruction of the graph by the designer ratherwe employ this section to clarify how the graphwill look once constructed To assist a moreautomatic construction of ETL scenarios we haveimplemented the ARKTOS II tool that supports thedesigning process through a friendly GUI Wepresent ARKTOS II in Section 4

22 Preliminaries

In this subsection we will introduce the formalmodeling of data types data stores and functionsbefore proceeding to the modeling of ETLactivities

Elementary entities We assume the existence ofa countable set of data types Each data type T ischaracterized by a name and a domain ie acountable set of values called dom (T) Thevalues of the domains are also referred to asconstantsWe also assume the existence of a countable set

of attributes which constitute the most elementarygranules of the infrastructure of the informationsystem Attributes are characterized by their nameand data type The domain of an attribute is asubset of the domain of its data type Attributesand constants are uniformly referred to as terms

A schema is a finite list of attributes Each entitythat is characterized by one or more schemata willbe called structured entity Moreover we assumethe existence of a special family of schemata allunder the general name of NULL schemadetermined to act as placeholders for data whichare not to be stored permanently in some datastore We refer to a family instead of a singleNULL schema due to a subtle technicalityinvolving the number of attributes of such aschema (this will become clear in the sequel)

Recordsets We define a record as the instantia-tion of a schema to a list of values belonging tothe domains of the respective schema attributesWe can treat any data structure as a re-cordset provided that there are ways to logically

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525498

restructure it into a flat typed record schemaFormally a recordset is characterized by its nameits (logical) schema and its (physical) extension(ie a finite set of records under the recordsetschema) If we consider a schema S frac14

[A1yAk] for a certain recordset its extensionis a mapping S frac14 [A1yAk]-dom(A1)y

dom(Ak) Thus the extension of the recordsetis a finite subset of dom(A1)ydom(Ak) anda record is the instance of a mapping dom(A1)ydom(Ak)-[x1yxk] xiAdom(Ai)In the rest of this paper we will mainly deal withthe two most popular types of recordsets namelyrelational tables and record files A database is afinite set of relational tables

Functions We assume the existence of acountable set of built-in system function types Afunction type comprises a name a finite list ofparameter data types and a single return data typeA function is an instance of a function typeConsequently it is characterized by a name a listof input parameters and a parameter for its returnvalue The data types of the parameters of thegenerating function type also define (a) the datatypes of the parameters of the function and (b) thelegal candidates for the function parameters (ieattributes or constants of a suitable data type)

23 Activities

Activities are the backbone of the structure ofany information system We adopt the WfMCterminology [9] for processesprograms and we willcall them activities in the sequel An activity is anamount of lsquolsquowork which is processed by acombination of resource and computer applica-tionsrsquorsquo [9] In our framework activities are logicalabstractions representing parts or full modules ofcodeThe execution of an activity is performed from a

particular program Normally ETL activities willbe either performed in a black-box manner by adedicated tool or they will be expressed in somelanguage (eg PLSQL Perl C) Still we want todeal with the general case of ETL activities Weemploy an abstraction of the source code of anactivity in the form of an LDL statement Using

LDL we avoid dealing with the peculiarities of aparticular programming language Once again wewant to stress that the presented LDL descriptionis intended to capture the semantics of eachactivity instead of the way these activities areactually implementedAn elementary activity is formally described by

the following elements

Name A unique identifier for the activity

Input schemata A finite set of one or more inputschemata that receives data from the dataproviders of the activity

Output schema A schema that describes theplaceholder for the rows that pass the checkperformed by the elementary activity

Rejections schema A schema that describes theplaceholder for the rows that do not pass thecheck performed by the activity or their valuesare not appropriate for the performed transfor-mation

Parameter list A set of pairs which act asregulators for the functionality of the activity(the target attribute of a foreign key check forexample) The first component of the pair is aname and the second is a schema an attribute afunction or a constant

Output operational semantics An LDL state-ment describing the content passed to theoutput of the operation with respect to itsinput This LDL statement defines (a) theoperation performed on the rows that passthrough the activity and (b) an implicit mappingbetween the attributes of the input schema(ta)and the respective attributes of the outputschema

Rejection operational semantics An LDL state-ment describing the rejected records in a sensesimilar to the output operational semanticsThis statement is by default considered to be thecomplement of the output operational seman-tics except if explicitly defined differently

There are two issues that we would like toelaborate on here

NULL schemata Whenever we do not specifya data consumer for the output or rejec-tion schemata the respective NULL schema

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 499

(involving the correct number of attributes) isimplied This practically means that the datatargeted for this schema will neither be stored tosome persistent data store nor will they bepropagated to another activity but they willsimply be ignored

Language issues Initially we used to specify thesemantics of activities with SQL statementsStill although clear and easy to write andunderstand SQL is rather hard to use if one isto perform rewriting and composition of state-ments Thus we have supplemented SQL withLDL [10] a logic programming declarativelanguage as the basis of our scenario definitionLDL is a Datalog variant based on a Horn-clause logic that supports recursion complexobjects and negation In the context of itsimplementation in an actual deductive databasemanagement system LDL++ [11] the lan-guage has been extended to support externalfunctions choice aggregation (and even user-defined aggregation) updates and several otherfeatures

24 Relationships in the architecture graph

In this subsection we will elaborate on thedifferent kinds of relationships that the entities ofan ETL scenario have Whereas these entities aremodeled as the nodes of the architecture graphrelationships are modeled as its edges Due to theirdiversity before proceeding we list these types ofrelationships along with the related terminologythat we will use in this paper The graphical

Date

DSPS1

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT INSK1

Fig 4 Instance-of relationships

notation of entities (nodes) and relationships(edges) is presented in Fig 2

Part-of relationships These relationships in-volve attributes and parameters and relate themto the respective activity recordset or functionto which they belongInstance-of relationships These relationships aredefined among a datafunction type and itsinstancesProvider relationships These are relationshipsthat involve attributes with a providerndashconsu-mer relationshipRegulator relationships These relationships aredefined among the parameters of activities andthe terms that populate these activitiesDerived provider relationships A special case ofprovider relationships that occurs wheneveroutput attributes are computed through thecomposition of input attributes and parametersDerived provider relationships can be deducedfrom a simple rule and do not originallyconstitute a part of the graph

In the rest of this subsection we will detail thenotions pertaining to the relationships of theArchitecture Graph the knowledgeable reader isreferred to Section 25 where we discuss the issueof scenarios We will base our discussions on apart of the scenario of the motivating example(presented in Section 21) including activity SK1

Data types and instance-of relationships Tocapture typing information on attributes and

SKEY

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT IN DWPARTS

UPP

Integer

of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525500

functions the architecture graph comprises dataand function types Instantiation relationships aredepicted as dotted arrows that stem from theinstances and head toward the datafunction typesIn Fig 4 we observe the attributes of the twoactivities of our example and their correspondenceto two data types namely integer and dateFor reasons of presentation we merge severalinstantiation edges so that the figure does notbecome too crowded

Attributes and part-of relationships The firstthing to incorporate in the architecture graph isthe structured entities (activities and recordsets)along with all the attributes of their schemata Wechoose to avoid overloading the notation byincorporating the schemata per se instead weapply a direct part-of relationship between anactivity node and the respective attributes Weannotate each such relationship with the name ofthe schema (by default we assume a IN OUTPAR REJ tag to denote whether the attributebelongs to the input output parameter or rejec-

DSPS1OUT

OUT

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

PKEY

PKEY

LSKEY

LPKEY

SKEY

SOURCE

SOURCE LSOURCLOOKUP

INSK1

P

Fig 5 Part-of regulator and provider rela

tion schema of the activity respectively) Natu-rally if the activity involves more than one inputschemata the relationship is tagged with an INitag for the ith input schema We also incorporatethe functions along with their respective para-meters and the part-of relationships among theformer and the latter We annotate the part-ofrelationship with the return type with a directededge to distinguish it from the rest of theparametersFig 5 depicts a part of the motivating example

In terms of part-of relationships we present thedecomposition of (a) the recordsets DSPS1LOOKUP DWPARTSUPP and (b) the activity SK1and the attributes of its input and outputschemata Note the tagging of the schemata ofthe involved activity We do not consider therejection schemata in order to avoid crowding thepicture Also note how the parameters of theactivity are also incorporated in the architecturegraph Activity SK1 has five parameters (a) PKEYwhich stands for the production key to bereplaced (b) SOURCE which stands for an integer

OUT

PKEY

SKEY

QTY

COST

DATE

SOURCE

E

PKEY

QTY

COST

DATE

SOURCE

IN

AR

DWPARTS

UPP

tionships of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 501

value that characterizes which sourcersquos data areprocessed (c) LPKEY which stands for theattribute of the lookup table which contains theproduction keys (d) LSOURCE which stands forthe attribute of the lookup table which containsthe source value (corresponding to the aforemen-tioned SOURCE parameter) (e) LSKEY whichstands for the attribute of the lookup table whichcontains the surrogate keys

Parameters and regulator relationships Once thepart-of and instantiation relationships have beenestablished it is time to establish the regulatorrelationships of the scenario In this case we linkthe parameters of the activities to the terms(attributes or constants) that populate them Wedepict regulator relationships with simple dottededgesIn the example of Fig 5 we can also observe

how the parameters of activity SK1 are populatedthrough regulator relationships The parametersin and out are mapped to the respective termsthrough regulator relationships All the para-meters of SK1 namely PKEY SOURCE LPKEYLSOURCE and LSKEY are mapped to the respec-tive attributes of either the activityrsquos input schemaor the employed lookup table LOOKUP Theparameter LSKEY deserves particular attentionThis parameter is (a) populated from the attributeSKEY of the lookup table and (b) used to populatethe attribute SKEY of the output schema of theactivity Thus two regulator relationships arerelated with parameter LSKEY one for each ofthe aforementioned attributes The existence of aregulator relationship among a parameter and anoutput attribute of an activity normally denotesthat some external data provider is employed inorder to derive a new attribute through therespective parameter

Provider relationships The flow of data from thedata sources towards the data warehouse isperformed through the composition of activitiesin a larger scenario In this context the input foran activity can be either a persistent data store oranother activity Usually this applies for theoutput of an activity too We capture the passingof data from providers to consumers by a provider

relationship among the attributes of the involvedschemataFormally a provider relationship is defined by

the following elements

Name A unique identifier for the providerrelationship

Mapping An ordered pair The first part of thepair is a term (ie an attribute or constant)acting as a provider and the second part is anattribute acting as the consumer

The mapping need not necessarily be 11 fromprovider to consumer attributes since an inputattribute can be mapped to more than oneconsumer attributes Still the opposite does nothold Note that a consumer attribute can also bepopulated by a constant in certain casesIn order to achieve the flow of data from the

providers of an activity towards its consumers weneed the following three groups of providerrelationships

1

A mapping between the input schemata of theactivity and the output schema of their dataproviders In other words for each attribute ofan input schema of an activity there must existan attribute of the data provider or a constantwhich is mapped to the former attribute

2

Amapping between the attributes of the activityinput schemata and the activity output (orrejection respectively) schema

3

A mapping between the output or rejectionschema of the activity and the (input) schema ofits data consumer

The mappings of the second type are internal tothe activity Basically they can be derived from theLDL statement for each of the outputrejectionschemata As far as the first and the third types ofprovider relationships are concerned the map-pings must be provided during the construction ofthe ETL scenario This means that they are either(a) by default assumed by the order of theattributes of the involved schemata or (b) hard-coded by the user Provider relationships aredepicted with bold solid arrows that stem fromthe provider and end in the consumer attribute

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525502

Observe Fig 5 The flow starts from tableDSPS1 of the data staging area Each of theattributes of this table is mapped to an attribute ofthe input schema of activity SK1 The attributes ofthe input schema of the latter are subsequentlymapped to the attributes of the output schema ofthe activity The flow continues to DWPARTSUPPAnother interesting thing is that during the dataflow new attributes are generated resulting on newstreams of data whereas the flow seems to stop forother attributes Observe the rightmost part ofFig 5 where the values of attribute PKEY are notfurther propagated (remember that the reason forthe application of a surrogate key transformation isto replace the production keys of the source data toa homogeneous surrogate for the records of thedata warehouse which is independent of the sourcethey have been collected from) Instead of thevalues of the production key the values from theattribute SKEY will be used to denote the uniqueidentifier for a part in the rest of the flowIn Fig 6 we depict the LDL definition of this

part of the motivating example The three rulescorrespond to the three categories of provider

addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYds_ps1(A_OUT_PKEYA_OUT_DATEA_OUT_QTYA_OUTA_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTY addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYlookup(A_IN1_SOURCEA_IN1_PKEYA_OUT_SKEY)A_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

dw_partsupp(PKEYDATEQTYCOSTSOURCE) addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTYDATE=A_IN1_DATE

QTY=A_IN1_QTYCOST=A_IN1_COSTSOURCE=A_IN1_SOURCEPKEY=A_IN1_SKEY

NOTE For reasonsof readability we do not rethe activity name ieA_OUT_PKEYshould be

Fig 6 LDL specification of t

relationships previously discussed the first ruleexplains how the data from the DSPS1 recordsetare fed into the input schema of the activity thesecond rule explains the semantics of activity (iehow the surrogate key is generated) and finallythe third rule shows how the DWPARTSUPPrecordset is populated from the output schema ofthe activity SK1

Derived provider relationships As we havealready mentioned there are certain outputattributes that are computed through the composi-tion of input attributes and parameters A derived

provider relationship is another form of providerrelationship that captures the flow from the inputto the respective output attributesFormally assume that (a) source is a term in

the architecture graph (b) target is an attributeof the output schema of an activity A and (c) xyare parameters in the parameter list of A (notnecessary different) Then a derived providerrelationship pr(source target) exists iff thefollowing regulator relationships (ie edges) existrr1(source x) and rr2(y target)

A_IN1_COSTA_IN1_SOURCE)_COSTA_OUT_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)A_IN1_COSTA_IN1_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)

place the Ain attribute names with diffPS1_OUT_PKEY

he motivating example

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 2: Etl design document

ARTICLE IN PRESS

Security amp Access Rights Management

Recovery Plan

Execution Schedule

Execution Sequence

Monitoring amp Logging

Data Flowfor Logical ExceptionsPrimary Data Flow

Execution Plan

Administration Plan

Relationshipwith data

Resource Layer

Operational Layer

Logical Perspective Physical Perspective

Fig 1 Different perspectives for an ETL workflow

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 493

(c) the customization and integration of theinformation coming from multiple sources into acommon format (d) the cleaning of the resultingdata set on the basis of database and businessrules and (e) the propagation of the data to thedata warehouse andor data martsIf we treat an ETL scenario as a composite

workflow in a traditional way its designer isobliged to define several of its parameters (Fig 1)Here we follow a multi-perspective approach thatenables to separate these parameters and studythem in a principled approach We are mainlyinterested in the design and administration parts ofthe lifecycle of the overall ETL process and wedepict them at the upper and lower part of Fig 1respectively At the top of Fig 1 we are mainlyconcerned with the static design artifacts for aworkflow environment We will follow a tradi-tional approach and group the design artifacts intological and physical with each category compris-ing its own perspective We depict the logicalperspective on the left-hand side of Fig 1 and thephysical perspective on the right-hand side At thelogical perspective we classify the design artifactsthat give an abstract description of the workflowenvironment First the designer is responsible for

defining an execution plan for the scenario Thedefinition of an execution plan can be seen fromvarious perspectives The execution sequence in-volves the specification of which activity runs firstsecond and so on which activities run in parallelor when a semaphore is defined so that severalactivities are synchronized at a rendezvous pointETL activities normally run in batch so thedesigner needs to specify an execution scheduleie the time points or events that trigger theexecution of the scenario as a whole Finally dueto system crashes it is imperative that there existsa recovery plan specifying the sequence of steps tobe taken in the case of failure for a certain activity(eg retry to execute the activity or undo anyintermediate results produced so far) On the right-hand side of Fig 1 we can also see the physicalperspective involving the registration of the actualentities that exist in the real world We will reusethe terminology of [5] for the physical perspectiveThe resource layer comprises the definition of roles(human or software) that are responsible forexecuting the activities of the workflow Theoperational layer at the same time comprises thesoftware modules that implement the designentities of the logical perspective in the real world

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525494

In other words the activities defined at the logicallayer (in an abstract way) are materialized andexecuted through the specific software modules ofthe physical perspectiveAt the lower part of Fig 1 we are dealing with

the tasks that concern the administration of theworkflow environment and their dynamic beha-vior at runtime First an administration plan

should be specified involving the notification ofthe administrator either on-line (monitoring) oroff-line (logging) for the status of an executedactivity as well as the security and authenticationmanagement for the ETL environmentWe find that research has not dealt with the

definition of data-centric workflows to the entiretyof its extent In the ETL case for example due tothe data centric nature of the process the designermust deal with the relationship of the involved

activities with the underlying data This involves thedefinition of a primary data flow that describes theroute of data from the sources towards their finaldestination in the data warehouse as they passthrough the activities of the scenario Also due topossible quality problems of the processed datathe designer is obliged to define a data flow for

logical exceptions ie a flow for the problematicdata ie the rows that violate integrity or businessrules It is the combination of the executionsequence and the data flow that generates thesemantics of the ETL workflow the data flowdefines what each activity does and the executionplan defines in which order and combinationIn this paper we work in the internals of the

data flow of ETL scenarios First we present ametamodel particularly customized for the defini-tion of ETL activities We follow a workflow-likeapproach where the output of a certain activitycan either be stored persistently or passed to asubsequent activity Moreover we employ adeclarative database programming languageLDL to define the semantics of each activityThe metamodel is generic enough to capture anypossible ETL activity nevertheless reusability andease-of-use dictate that we can do better in aidingthe data warehouse designer in his task In thispursuit of higher reusability and flexibility wespecialize the set of our generic metamodelconstructs with a palette of frequently used ETL

activities which we call templates Moreover inorder to achieve a uniform extensibility mechan-ism for this library of built-ins we have to dealwith specific language issues thus we also discussthe mechanics of template instantiation to concreteactivities The design concepts that we introducehave been implemented in a tool ARKTOS II whichis also presentedOur contributions can be listed as follows

First we define a formal metamodel as an

abstraction of ETL processes at the logical levelThe data stores activities and their constituentparts are formally defined An activity is definedas an entity with possibly more than one inputschemata an output schema and a parameterschema so that the activity is populated eachtime with its proper parameter values The flowof data from producers towards their consumersis achieved through the usage of provider

relationships that map the attributes of theformer to the respective attributes of the latterA serializable combination of ETL activitiesprovider relationships and data stores constitu-tes an ETL scenario

Second we provide a reusability framework thatcomplements the genericity of the metamodelPractically this is achieved from a set of lsquolsquobuilt-inrsquorsquo specializations of the entities of the meta-model layer specifically tailored for the mostfrequent elements of ETL scenarios This paletteof template activities will be referred to astemplate layer and it is characterized by itsextensibility in fact due to language considera-tions we provide the details of the mechanismthat instantiates templates to specific activities

Finally we discuss implementation issues and wepresent a graphical tool ARKTOS II that facil-itates the design of ETL scenarios based on ourmodel

This paper is organized as follows In Section 2we present a generic model of ETL activitiesSection 3 describes the mechanism for specifyingand materializing template definitions of fre-quently used ETL activities Section 4 presentsARKTOS II a prototype graphical tool In Section 5we survey related work In Section 6 we make a

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 495

general discussion on the completeness and generalapplicability of our approach Section 7 offersconclusions and presents topics for future re-search Short versions of parts of this paper havebeen presented in [67]

1In data warehousing terminology a DSA is an intermediate

area of the data warehouse specifically destined to enable the

transformation cleaning and integration of source data before

being loaded to the warehouse2The technical points like FTP are mostly employed to show

what kind of problems someone has to deal with in a practical

situation rather than to relate this kind of physical operations

to a logical model In terms of logical modelling this is a simple

passing of data from one site to another

2 Generic model of ETL activities

The purpose of this section is to present a formallogical model for the activities of an ETLenvironment This model abstracts from thetechnicalities of monitoring scheduling and log-ging while it concentrates on the flow of data fromthe sources towards the data warehouse throughthe composition of activities and data stores Thefull layout of an ETL scenario involving activitiesrecordsets and functions can be modeled by agraph which we call the architecture graph Weemploy a uniform graph-modeling framework forboth the modeling of the internal structure ofactivities and the ETL scenario at large whichenables the treatment of the ETL environmentfrom different viewpoints First the architecturegraph comprises all the activities and data stores ofa scenario along with their components Secondthe architecture graph captures the data flowwithin the ETL environment Finally the informa-tion on the typing of the involved entities and theregulation of the execution of a scenario throughspecific parameters are also covered

21 Graphical notation and motivating example

Being a graph the architecture graph of an ETLscenario comprises nodes and edges The involveddata types function types constants attributesactivities recordsets parameters and functionsconstitute the nodes of the graph The differentkinds of relationships among these entities aremodeled as the edges of the graph In Fig 2 wegive the graphical notation for all the modelingconstructs that will be presented in the sequel

Motivating example To motivate our discus-sion we will present an example involving thepropagation of data from a certain source S1towards a data warehouse DW through intermedi-ate recordsets These recordsets belong to a data

staging area (DSA)1 DS The scenario involves thepropagation of data from the table PARTSUPP ofsource S1 to the data warehouse DW TableDWPARTSUPP (PKEY SOURCE DATE QTYCOST) stores information for the available quan-tity (QTY) and cost (COST) of parts (PKEY)per source (SOURCE) The data source S1PARTSUPP (PKEY DATE QTY COST) recordsthe supplies from a specific geographical regioneg Europe All the attributes except for the datesare instances of the Integer type The scenario isgraphically depicted in Fig 3 and involves thefollowing transformations

1

First we transfer via FTP_PS1 the snapshotfrom the source S1PARTSUPP to the fileDSPS1_NEW of the DSA2

2

In the DSA we maintain locally a copy of thesnapshot of the source as it was at the previousloading (we assume here the case of theincremental maintenance of the DW instead ofthe case of the initial loading of the DW) Therecordset DSPS1_NEW (PKEY DATE QTYCOST) stands for the last transferred snapshotof S1PARTSUPP By detecting the differenceof this snapshot with the respective version ofthe previous loading DSPS1_OLD (PKEYDATE QTY COST) we can derive the newlyinserted rows in S1PARTSUPP Note that thedifference activity that we employ namelyDiff_PS1 checks for differences only on theprimary key of the recordsets thus we ignorehere any possible deletions or updates for theattributes COST QTY of existing rows Any notnewly inserted row is rejected and so it ispropagated to Diff_PS1_REJ that stores allthe rejected rows The schema of Diff_PS1_REJ is identical to the input schema of theactivity Diff_PS1

ARTICLE IN PRESS

Add_Attr1 SK1

DSPS1_NEW

DSPS1_OLD

FTP_PS1

Diff_PS1 DWPARTSUPP

S1PARTSUPP

LOOKUP

DSPS1_NEWPKEY=

DSPS1_OLDPKEYSOURCE = 1

DSPS1PKEYLOOKUPPKEY

LOOKUPSOURCELOOKUPSKEY

NotNu111

COST

Diff_PS1_REJ

Not Nul 111_REJ

DSA

Source

DataWarehouse

DSPS1

Fig 3 Birdrsquos-eye view of the motivating example

Data Types Black ellipsoid RecordSets Cylinders

Function

TypesBlack rectangles Functions Gray rectangles

Constants Black circles Parameters White rectangles

Attributes Unshaded ellipsoid Activities Triangles

Part-Of

Relationships

Simple lines with

diamond edges

Provider

Relationships

Bold solid arrows

(from provider to

consumer)

Instance-Of

Relationships

Dotted arrows

(from instance

towards the type)

Derived

Provider

Relationships

Bold dotted

arrows (from

provider to

consumer)

Regulator

RelationshipsDotted lines

We annotate the part-of relationship among afunction and its return type with a directed edge todistinguish it from therest of the parameters

1

Fig 2 Graphical notation for the architecture graph

P Vassiliadis et al Information Systems 30 (2005) 492ndash525496

3

The rows that pass the activity Diff_PS1 arechecked for null values of the attribute COSTthrough the activity NotNull1 Rows having aNULL value for their COST are kept in the

Diff_PS1_REJ recordset for further examina-tion by the data warehouse administrator

4

Although we consider the data flow for onlyone source namely S1 the data warehouse can

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 497

clearly have more sources for part supplies Inorder to keep track of the source of each rowentering into the DW we need to add a lsquoflagrsquoattribute namely SOURCE indicating the re-spective source This is achieved through theactivity Add_Attr1 We store the rows thatstem from this process in the recordset DSPS1(PKEY SOURCE DATE QTY COST)

5

Next we assign a surrogate key on PKEY In thedata warehouse context it is common tactics toreplace the keys of the production systems witha uniform key which we call a surrogate key [8]The basic reasons for this replacement areperformance and semantic homogeneity Tex-tual attributes are not the best candidates forindexed keys and thus they need to be replacedby integer keys At the same time differentproduction systems might use different keys forthe same object or the same key for differentobjects resulting in the need for a globalreplacement of these values in the data ware-house This replacement is performed through alookup table of the form L (PRODKEYSOURCE SKEY) The SOURCE column is dueto the fact that there can be synonyms in thedifferent sources which are mapped to differentobjects in the data warehouse In our case theactivity that performs the surrogate key assign-ment for the attribute PKEY is SK1 It uses thelookup table LOOKUP (PKEY SOURCESKEY) Finally we populate the data ware-house with the output of the previous activity

The role of rejected rows depends on thepeculiarities of each ETL scenario If the designerneeds to administrate these rows further then heshe should use intermediate storage recordsetswith the burden of an extra IO cost If the rejectedrows should not have a special treatment then thebest solution is to be ignored thus in this case weavoid overloading the scenario with any extrastorage recordset In our case we annotate onlytwo of the presented activities with a destina-tion for rejected rows Out of these whileNotNull1_REJ absolutely makes sense as aplaceholder for problematic rows having non-acceptable NULL values Diff_PS1_REJ is pre-sented for demonstration reasons only

Finally before proceeding we would like tostress that we do not anticipate a manualconstruction of the graph by the designer ratherwe employ this section to clarify how the graphwill look once constructed To assist a moreautomatic construction of ETL scenarios we haveimplemented the ARKTOS II tool that supports thedesigning process through a friendly GUI Wepresent ARKTOS II in Section 4

22 Preliminaries

In this subsection we will introduce the formalmodeling of data types data stores and functionsbefore proceeding to the modeling of ETLactivities

Elementary entities We assume the existence ofa countable set of data types Each data type T ischaracterized by a name and a domain ie acountable set of values called dom (T) Thevalues of the domains are also referred to asconstantsWe also assume the existence of a countable set

of attributes which constitute the most elementarygranules of the infrastructure of the informationsystem Attributes are characterized by their nameand data type The domain of an attribute is asubset of the domain of its data type Attributesand constants are uniformly referred to as terms

A schema is a finite list of attributes Each entitythat is characterized by one or more schemata willbe called structured entity Moreover we assumethe existence of a special family of schemata allunder the general name of NULL schemadetermined to act as placeholders for data whichare not to be stored permanently in some datastore We refer to a family instead of a singleNULL schema due to a subtle technicalityinvolving the number of attributes of such aschema (this will become clear in the sequel)

Recordsets We define a record as the instantia-tion of a schema to a list of values belonging tothe domains of the respective schema attributesWe can treat any data structure as a re-cordset provided that there are ways to logically

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525498

restructure it into a flat typed record schemaFormally a recordset is characterized by its nameits (logical) schema and its (physical) extension(ie a finite set of records under the recordsetschema) If we consider a schema S frac14

[A1yAk] for a certain recordset its extensionis a mapping S frac14 [A1yAk]-dom(A1)y

dom(Ak) Thus the extension of the recordsetis a finite subset of dom(A1)ydom(Ak) anda record is the instance of a mapping dom(A1)ydom(Ak)-[x1yxk] xiAdom(Ai)In the rest of this paper we will mainly deal withthe two most popular types of recordsets namelyrelational tables and record files A database is afinite set of relational tables

Functions We assume the existence of acountable set of built-in system function types Afunction type comprises a name a finite list ofparameter data types and a single return data typeA function is an instance of a function typeConsequently it is characterized by a name a listof input parameters and a parameter for its returnvalue The data types of the parameters of thegenerating function type also define (a) the datatypes of the parameters of the function and (b) thelegal candidates for the function parameters (ieattributes or constants of a suitable data type)

23 Activities

Activities are the backbone of the structure ofany information system We adopt the WfMCterminology [9] for processesprograms and we willcall them activities in the sequel An activity is anamount of lsquolsquowork which is processed by acombination of resource and computer applica-tionsrsquorsquo [9] In our framework activities are logicalabstractions representing parts or full modules ofcodeThe execution of an activity is performed from a

particular program Normally ETL activities willbe either performed in a black-box manner by adedicated tool or they will be expressed in somelanguage (eg PLSQL Perl C) Still we want todeal with the general case of ETL activities Weemploy an abstraction of the source code of anactivity in the form of an LDL statement Using

LDL we avoid dealing with the peculiarities of aparticular programming language Once again wewant to stress that the presented LDL descriptionis intended to capture the semantics of eachactivity instead of the way these activities areactually implementedAn elementary activity is formally described by

the following elements

Name A unique identifier for the activity

Input schemata A finite set of one or more inputschemata that receives data from the dataproviders of the activity

Output schema A schema that describes theplaceholder for the rows that pass the checkperformed by the elementary activity

Rejections schema A schema that describes theplaceholder for the rows that do not pass thecheck performed by the activity or their valuesare not appropriate for the performed transfor-mation

Parameter list A set of pairs which act asregulators for the functionality of the activity(the target attribute of a foreign key check forexample) The first component of the pair is aname and the second is a schema an attribute afunction or a constant

Output operational semantics An LDL state-ment describing the content passed to theoutput of the operation with respect to itsinput This LDL statement defines (a) theoperation performed on the rows that passthrough the activity and (b) an implicit mappingbetween the attributes of the input schema(ta)and the respective attributes of the outputschema

Rejection operational semantics An LDL state-ment describing the rejected records in a sensesimilar to the output operational semanticsThis statement is by default considered to be thecomplement of the output operational seman-tics except if explicitly defined differently

There are two issues that we would like toelaborate on here

NULL schemata Whenever we do not specifya data consumer for the output or rejec-tion schemata the respective NULL schema

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 499

(involving the correct number of attributes) isimplied This practically means that the datatargeted for this schema will neither be stored tosome persistent data store nor will they bepropagated to another activity but they willsimply be ignored

Language issues Initially we used to specify thesemantics of activities with SQL statementsStill although clear and easy to write andunderstand SQL is rather hard to use if one isto perform rewriting and composition of state-ments Thus we have supplemented SQL withLDL [10] a logic programming declarativelanguage as the basis of our scenario definitionLDL is a Datalog variant based on a Horn-clause logic that supports recursion complexobjects and negation In the context of itsimplementation in an actual deductive databasemanagement system LDL++ [11] the lan-guage has been extended to support externalfunctions choice aggregation (and even user-defined aggregation) updates and several otherfeatures

24 Relationships in the architecture graph

In this subsection we will elaborate on thedifferent kinds of relationships that the entities ofan ETL scenario have Whereas these entities aremodeled as the nodes of the architecture graphrelationships are modeled as its edges Due to theirdiversity before proceeding we list these types ofrelationships along with the related terminologythat we will use in this paper The graphical

Date

DSPS1

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT INSK1

Fig 4 Instance-of relationships

notation of entities (nodes) and relationships(edges) is presented in Fig 2

Part-of relationships These relationships in-volve attributes and parameters and relate themto the respective activity recordset or functionto which they belongInstance-of relationships These relationships aredefined among a datafunction type and itsinstancesProvider relationships These are relationshipsthat involve attributes with a providerndashconsu-mer relationshipRegulator relationships These relationships aredefined among the parameters of activities andthe terms that populate these activitiesDerived provider relationships A special case ofprovider relationships that occurs wheneveroutput attributes are computed through thecomposition of input attributes and parametersDerived provider relationships can be deducedfrom a simple rule and do not originallyconstitute a part of the graph

In the rest of this subsection we will detail thenotions pertaining to the relationships of theArchitecture Graph the knowledgeable reader isreferred to Section 25 where we discuss the issueof scenarios We will base our discussions on apart of the scenario of the motivating example(presented in Section 21) including activity SK1

Data types and instance-of relationships Tocapture typing information on attributes and

SKEY

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT IN DWPARTS

UPP

Integer

of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525500

functions the architecture graph comprises dataand function types Instantiation relationships aredepicted as dotted arrows that stem from theinstances and head toward the datafunction typesIn Fig 4 we observe the attributes of the twoactivities of our example and their correspondenceto two data types namely integer and dateFor reasons of presentation we merge severalinstantiation edges so that the figure does notbecome too crowded

Attributes and part-of relationships The firstthing to incorporate in the architecture graph isthe structured entities (activities and recordsets)along with all the attributes of their schemata Wechoose to avoid overloading the notation byincorporating the schemata per se instead weapply a direct part-of relationship between anactivity node and the respective attributes Weannotate each such relationship with the name ofthe schema (by default we assume a IN OUTPAR REJ tag to denote whether the attributebelongs to the input output parameter or rejec-

DSPS1OUT

OUT

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

PKEY

PKEY

LSKEY

LPKEY

SKEY

SOURCE

SOURCE LSOURCLOOKUP

INSK1

P

Fig 5 Part-of regulator and provider rela

tion schema of the activity respectively) Natu-rally if the activity involves more than one inputschemata the relationship is tagged with an INitag for the ith input schema We also incorporatethe functions along with their respective para-meters and the part-of relationships among theformer and the latter We annotate the part-ofrelationship with the return type with a directededge to distinguish it from the rest of theparametersFig 5 depicts a part of the motivating example

In terms of part-of relationships we present thedecomposition of (a) the recordsets DSPS1LOOKUP DWPARTSUPP and (b) the activity SK1and the attributes of its input and outputschemata Note the tagging of the schemata ofthe involved activity We do not consider therejection schemata in order to avoid crowding thepicture Also note how the parameters of theactivity are also incorporated in the architecturegraph Activity SK1 has five parameters (a) PKEYwhich stands for the production key to bereplaced (b) SOURCE which stands for an integer

OUT

PKEY

SKEY

QTY

COST

DATE

SOURCE

E

PKEY

QTY

COST

DATE

SOURCE

IN

AR

DWPARTS

UPP

tionships of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 501

value that characterizes which sourcersquos data areprocessed (c) LPKEY which stands for theattribute of the lookup table which contains theproduction keys (d) LSOURCE which stands forthe attribute of the lookup table which containsthe source value (corresponding to the aforemen-tioned SOURCE parameter) (e) LSKEY whichstands for the attribute of the lookup table whichcontains the surrogate keys

Parameters and regulator relationships Once thepart-of and instantiation relationships have beenestablished it is time to establish the regulatorrelationships of the scenario In this case we linkthe parameters of the activities to the terms(attributes or constants) that populate them Wedepict regulator relationships with simple dottededgesIn the example of Fig 5 we can also observe

how the parameters of activity SK1 are populatedthrough regulator relationships The parametersin and out are mapped to the respective termsthrough regulator relationships All the para-meters of SK1 namely PKEY SOURCE LPKEYLSOURCE and LSKEY are mapped to the respec-tive attributes of either the activityrsquos input schemaor the employed lookup table LOOKUP Theparameter LSKEY deserves particular attentionThis parameter is (a) populated from the attributeSKEY of the lookup table and (b) used to populatethe attribute SKEY of the output schema of theactivity Thus two regulator relationships arerelated with parameter LSKEY one for each ofthe aforementioned attributes The existence of aregulator relationship among a parameter and anoutput attribute of an activity normally denotesthat some external data provider is employed inorder to derive a new attribute through therespective parameter

Provider relationships The flow of data from thedata sources towards the data warehouse isperformed through the composition of activitiesin a larger scenario In this context the input foran activity can be either a persistent data store oranother activity Usually this applies for theoutput of an activity too We capture the passingof data from providers to consumers by a provider

relationship among the attributes of the involvedschemataFormally a provider relationship is defined by

the following elements

Name A unique identifier for the providerrelationship

Mapping An ordered pair The first part of thepair is a term (ie an attribute or constant)acting as a provider and the second part is anattribute acting as the consumer

The mapping need not necessarily be 11 fromprovider to consumer attributes since an inputattribute can be mapped to more than oneconsumer attributes Still the opposite does nothold Note that a consumer attribute can also bepopulated by a constant in certain casesIn order to achieve the flow of data from the

providers of an activity towards its consumers weneed the following three groups of providerrelationships

1

A mapping between the input schemata of theactivity and the output schema of their dataproviders In other words for each attribute ofan input schema of an activity there must existan attribute of the data provider or a constantwhich is mapped to the former attribute

2

Amapping between the attributes of the activityinput schemata and the activity output (orrejection respectively) schema

3

A mapping between the output or rejectionschema of the activity and the (input) schema ofits data consumer

The mappings of the second type are internal tothe activity Basically they can be derived from theLDL statement for each of the outputrejectionschemata As far as the first and the third types ofprovider relationships are concerned the map-pings must be provided during the construction ofthe ETL scenario This means that they are either(a) by default assumed by the order of theattributes of the involved schemata or (b) hard-coded by the user Provider relationships aredepicted with bold solid arrows that stem fromthe provider and end in the consumer attribute

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525502

Observe Fig 5 The flow starts from tableDSPS1 of the data staging area Each of theattributes of this table is mapped to an attribute ofthe input schema of activity SK1 The attributes ofthe input schema of the latter are subsequentlymapped to the attributes of the output schema ofthe activity The flow continues to DWPARTSUPPAnother interesting thing is that during the dataflow new attributes are generated resulting on newstreams of data whereas the flow seems to stop forother attributes Observe the rightmost part ofFig 5 where the values of attribute PKEY are notfurther propagated (remember that the reason forthe application of a surrogate key transformation isto replace the production keys of the source data toa homogeneous surrogate for the records of thedata warehouse which is independent of the sourcethey have been collected from) Instead of thevalues of the production key the values from theattribute SKEY will be used to denote the uniqueidentifier for a part in the rest of the flowIn Fig 6 we depict the LDL definition of this

part of the motivating example The three rulescorrespond to the three categories of provider

addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYds_ps1(A_OUT_PKEYA_OUT_DATEA_OUT_QTYA_OUTA_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTY addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYlookup(A_IN1_SOURCEA_IN1_PKEYA_OUT_SKEY)A_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

dw_partsupp(PKEYDATEQTYCOSTSOURCE) addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTYDATE=A_IN1_DATE

QTY=A_IN1_QTYCOST=A_IN1_COSTSOURCE=A_IN1_SOURCEPKEY=A_IN1_SKEY

NOTE For reasonsof readability we do not rethe activity name ieA_OUT_PKEYshould be

Fig 6 LDL specification of t

relationships previously discussed the first ruleexplains how the data from the DSPS1 recordsetare fed into the input schema of the activity thesecond rule explains the semantics of activity (iehow the surrogate key is generated) and finallythe third rule shows how the DWPARTSUPPrecordset is populated from the output schema ofthe activity SK1

Derived provider relationships As we havealready mentioned there are certain outputattributes that are computed through the composi-tion of input attributes and parameters A derived

provider relationship is another form of providerrelationship that captures the flow from the inputto the respective output attributesFormally assume that (a) source is a term in

the architecture graph (b) target is an attributeof the output schema of an activity A and (c) xyare parameters in the parameter list of A (notnecessary different) Then a derived providerrelationship pr(source target) exists iff thefollowing regulator relationships (ie edges) existrr1(source x) and rr2(y target)

A_IN1_COSTA_IN1_SOURCE)_COSTA_OUT_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)A_IN1_COSTA_IN1_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)

place the Ain attribute names with diffPS1_OUT_PKEY

he motivating example

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 3: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525494

In other words the activities defined at the logicallayer (in an abstract way) are materialized andexecuted through the specific software modules ofthe physical perspectiveAt the lower part of Fig 1 we are dealing with

the tasks that concern the administration of theworkflow environment and their dynamic beha-vior at runtime First an administration plan

should be specified involving the notification ofthe administrator either on-line (monitoring) oroff-line (logging) for the status of an executedactivity as well as the security and authenticationmanagement for the ETL environmentWe find that research has not dealt with the

definition of data-centric workflows to the entiretyof its extent In the ETL case for example due tothe data centric nature of the process the designermust deal with the relationship of the involved

activities with the underlying data This involves thedefinition of a primary data flow that describes theroute of data from the sources towards their finaldestination in the data warehouse as they passthrough the activities of the scenario Also due topossible quality problems of the processed datathe designer is obliged to define a data flow for

logical exceptions ie a flow for the problematicdata ie the rows that violate integrity or businessrules It is the combination of the executionsequence and the data flow that generates thesemantics of the ETL workflow the data flowdefines what each activity does and the executionplan defines in which order and combinationIn this paper we work in the internals of the

data flow of ETL scenarios First we present ametamodel particularly customized for the defini-tion of ETL activities We follow a workflow-likeapproach where the output of a certain activitycan either be stored persistently or passed to asubsequent activity Moreover we employ adeclarative database programming languageLDL to define the semantics of each activityThe metamodel is generic enough to capture anypossible ETL activity nevertheless reusability andease-of-use dictate that we can do better in aidingthe data warehouse designer in his task In thispursuit of higher reusability and flexibility wespecialize the set of our generic metamodelconstructs with a palette of frequently used ETL

activities which we call templates Moreover inorder to achieve a uniform extensibility mechan-ism for this library of built-ins we have to dealwith specific language issues thus we also discussthe mechanics of template instantiation to concreteactivities The design concepts that we introducehave been implemented in a tool ARKTOS II whichis also presentedOur contributions can be listed as follows

First we define a formal metamodel as an

abstraction of ETL processes at the logical levelThe data stores activities and their constituentparts are formally defined An activity is definedas an entity with possibly more than one inputschemata an output schema and a parameterschema so that the activity is populated eachtime with its proper parameter values The flowof data from producers towards their consumersis achieved through the usage of provider

relationships that map the attributes of theformer to the respective attributes of the latterA serializable combination of ETL activitiesprovider relationships and data stores constitu-tes an ETL scenario

Second we provide a reusability framework thatcomplements the genericity of the metamodelPractically this is achieved from a set of lsquolsquobuilt-inrsquorsquo specializations of the entities of the meta-model layer specifically tailored for the mostfrequent elements of ETL scenarios This paletteof template activities will be referred to astemplate layer and it is characterized by itsextensibility in fact due to language considera-tions we provide the details of the mechanismthat instantiates templates to specific activities

Finally we discuss implementation issues and wepresent a graphical tool ARKTOS II that facil-itates the design of ETL scenarios based on ourmodel

This paper is organized as follows In Section 2we present a generic model of ETL activitiesSection 3 describes the mechanism for specifyingand materializing template definitions of fre-quently used ETL activities Section 4 presentsARKTOS II a prototype graphical tool In Section 5we survey related work In Section 6 we make a

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 495

general discussion on the completeness and generalapplicability of our approach Section 7 offersconclusions and presents topics for future re-search Short versions of parts of this paper havebeen presented in [67]

1In data warehousing terminology a DSA is an intermediate

area of the data warehouse specifically destined to enable the

transformation cleaning and integration of source data before

being loaded to the warehouse2The technical points like FTP are mostly employed to show

what kind of problems someone has to deal with in a practical

situation rather than to relate this kind of physical operations

to a logical model In terms of logical modelling this is a simple

passing of data from one site to another

2 Generic model of ETL activities

The purpose of this section is to present a formallogical model for the activities of an ETLenvironment This model abstracts from thetechnicalities of monitoring scheduling and log-ging while it concentrates on the flow of data fromthe sources towards the data warehouse throughthe composition of activities and data stores Thefull layout of an ETL scenario involving activitiesrecordsets and functions can be modeled by agraph which we call the architecture graph Weemploy a uniform graph-modeling framework forboth the modeling of the internal structure ofactivities and the ETL scenario at large whichenables the treatment of the ETL environmentfrom different viewpoints First the architecturegraph comprises all the activities and data stores ofa scenario along with their components Secondthe architecture graph captures the data flowwithin the ETL environment Finally the informa-tion on the typing of the involved entities and theregulation of the execution of a scenario throughspecific parameters are also covered

21 Graphical notation and motivating example

Being a graph the architecture graph of an ETLscenario comprises nodes and edges The involveddata types function types constants attributesactivities recordsets parameters and functionsconstitute the nodes of the graph The differentkinds of relationships among these entities aremodeled as the edges of the graph In Fig 2 wegive the graphical notation for all the modelingconstructs that will be presented in the sequel

Motivating example To motivate our discus-sion we will present an example involving thepropagation of data from a certain source S1towards a data warehouse DW through intermedi-ate recordsets These recordsets belong to a data

staging area (DSA)1 DS The scenario involves thepropagation of data from the table PARTSUPP ofsource S1 to the data warehouse DW TableDWPARTSUPP (PKEY SOURCE DATE QTYCOST) stores information for the available quan-tity (QTY) and cost (COST) of parts (PKEY)per source (SOURCE) The data source S1PARTSUPP (PKEY DATE QTY COST) recordsthe supplies from a specific geographical regioneg Europe All the attributes except for the datesare instances of the Integer type The scenario isgraphically depicted in Fig 3 and involves thefollowing transformations

1

First we transfer via FTP_PS1 the snapshotfrom the source S1PARTSUPP to the fileDSPS1_NEW of the DSA2

2

In the DSA we maintain locally a copy of thesnapshot of the source as it was at the previousloading (we assume here the case of theincremental maintenance of the DW instead ofthe case of the initial loading of the DW) Therecordset DSPS1_NEW (PKEY DATE QTYCOST) stands for the last transferred snapshotof S1PARTSUPP By detecting the differenceof this snapshot with the respective version ofthe previous loading DSPS1_OLD (PKEYDATE QTY COST) we can derive the newlyinserted rows in S1PARTSUPP Note that thedifference activity that we employ namelyDiff_PS1 checks for differences only on theprimary key of the recordsets thus we ignorehere any possible deletions or updates for theattributes COST QTY of existing rows Any notnewly inserted row is rejected and so it ispropagated to Diff_PS1_REJ that stores allthe rejected rows The schema of Diff_PS1_REJ is identical to the input schema of theactivity Diff_PS1

ARTICLE IN PRESS

Add_Attr1 SK1

DSPS1_NEW

DSPS1_OLD

FTP_PS1

Diff_PS1 DWPARTSUPP

S1PARTSUPP

LOOKUP

DSPS1_NEWPKEY=

DSPS1_OLDPKEYSOURCE = 1

DSPS1PKEYLOOKUPPKEY

LOOKUPSOURCELOOKUPSKEY

NotNu111

COST

Diff_PS1_REJ

Not Nul 111_REJ

DSA

Source

DataWarehouse

DSPS1

Fig 3 Birdrsquos-eye view of the motivating example

Data Types Black ellipsoid RecordSets Cylinders

Function

TypesBlack rectangles Functions Gray rectangles

Constants Black circles Parameters White rectangles

Attributes Unshaded ellipsoid Activities Triangles

Part-Of

Relationships

Simple lines with

diamond edges

Provider

Relationships

Bold solid arrows

(from provider to

consumer)

Instance-Of

Relationships

Dotted arrows

(from instance

towards the type)

Derived

Provider

Relationships

Bold dotted

arrows (from

provider to

consumer)

Regulator

RelationshipsDotted lines

We annotate the part-of relationship among afunction and its return type with a directed edge todistinguish it from therest of the parameters

1

Fig 2 Graphical notation for the architecture graph

P Vassiliadis et al Information Systems 30 (2005) 492ndash525496

3

The rows that pass the activity Diff_PS1 arechecked for null values of the attribute COSTthrough the activity NotNull1 Rows having aNULL value for their COST are kept in the

Diff_PS1_REJ recordset for further examina-tion by the data warehouse administrator

4

Although we consider the data flow for onlyone source namely S1 the data warehouse can

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 497

clearly have more sources for part supplies Inorder to keep track of the source of each rowentering into the DW we need to add a lsquoflagrsquoattribute namely SOURCE indicating the re-spective source This is achieved through theactivity Add_Attr1 We store the rows thatstem from this process in the recordset DSPS1(PKEY SOURCE DATE QTY COST)

5

Next we assign a surrogate key on PKEY In thedata warehouse context it is common tactics toreplace the keys of the production systems witha uniform key which we call a surrogate key [8]The basic reasons for this replacement areperformance and semantic homogeneity Tex-tual attributes are not the best candidates forindexed keys and thus they need to be replacedby integer keys At the same time differentproduction systems might use different keys forthe same object or the same key for differentobjects resulting in the need for a globalreplacement of these values in the data ware-house This replacement is performed through alookup table of the form L (PRODKEYSOURCE SKEY) The SOURCE column is dueto the fact that there can be synonyms in thedifferent sources which are mapped to differentobjects in the data warehouse In our case theactivity that performs the surrogate key assign-ment for the attribute PKEY is SK1 It uses thelookup table LOOKUP (PKEY SOURCESKEY) Finally we populate the data ware-house with the output of the previous activity

The role of rejected rows depends on thepeculiarities of each ETL scenario If the designerneeds to administrate these rows further then heshe should use intermediate storage recordsetswith the burden of an extra IO cost If the rejectedrows should not have a special treatment then thebest solution is to be ignored thus in this case weavoid overloading the scenario with any extrastorage recordset In our case we annotate onlytwo of the presented activities with a destina-tion for rejected rows Out of these whileNotNull1_REJ absolutely makes sense as aplaceholder for problematic rows having non-acceptable NULL values Diff_PS1_REJ is pre-sented for demonstration reasons only

Finally before proceeding we would like tostress that we do not anticipate a manualconstruction of the graph by the designer ratherwe employ this section to clarify how the graphwill look once constructed To assist a moreautomatic construction of ETL scenarios we haveimplemented the ARKTOS II tool that supports thedesigning process through a friendly GUI Wepresent ARKTOS II in Section 4

22 Preliminaries

In this subsection we will introduce the formalmodeling of data types data stores and functionsbefore proceeding to the modeling of ETLactivities

Elementary entities We assume the existence ofa countable set of data types Each data type T ischaracterized by a name and a domain ie acountable set of values called dom (T) Thevalues of the domains are also referred to asconstantsWe also assume the existence of a countable set

of attributes which constitute the most elementarygranules of the infrastructure of the informationsystem Attributes are characterized by their nameand data type The domain of an attribute is asubset of the domain of its data type Attributesand constants are uniformly referred to as terms

A schema is a finite list of attributes Each entitythat is characterized by one or more schemata willbe called structured entity Moreover we assumethe existence of a special family of schemata allunder the general name of NULL schemadetermined to act as placeholders for data whichare not to be stored permanently in some datastore We refer to a family instead of a singleNULL schema due to a subtle technicalityinvolving the number of attributes of such aschema (this will become clear in the sequel)

Recordsets We define a record as the instantia-tion of a schema to a list of values belonging tothe domains of the respective schema attributesWe can treat any data structure as a re-cordset provided that there are ways to logically

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525498

restructure it into a flat typed record schemaFormally a recordset is characterized by its nameits (logical) schema and its (physical) extension(ie a finite set of records under the recordsetschema) If we consider a schema S frac14

[A1yAk] for a certain recordset its extensionis a mapping S frac14 [A1yAk]-dom(A1)y

dom(Ak) Thus the extension of the recordsetis a finite subset of dom(A1)ydom(Ak) anda record is the instance of a mapping dom(A1)ydom(Ak)-[x1yxk] xiAdom(Ai)In the rest of this paper we will mainly deal withthe two most popular types of recordsets namelyrelational tables and record files A database is afinite set of relational tables

Functions We assume the existence of acountable set of built-in system function types Afunction type comprises a name a finite list ofparameter data types and a single return data typeA function is an instance of a function typeConsequently it is characterized by a name a listof input parameters and a parameter for its returnvalue The data types of the parameters of thegenerating function type also define (a) the datatypes of the parameters of the function and (b) thelegal candidates for the function parameters (ieattributes or constants of a suitable data type)

23 Activities

Activities are the backbone of the structure ofany information system We adopt the WfMCterminology [9] for processesprograms and we willcall them activities in the sequel An activity is anamount of lsquolsquowork which is processed by acombination of resource and computer applica-tionsrsquorsquo [9] In our framework activities are logicalabstractions representing parts or full modules ofcodeThe execution of an activity is performed from a

particular program Normally ETL activities willbe either performed in a black-box manner by adedicated tool or they will be expressed in somelanguage (eg PLSQL Perl C) Still we want todeal with the general case of ETL activities Weemploy an abstraction of the source code of anactivity in the form of an LDL statement Using

LDL we avoid dealing with the peculiarities of aparticular programming language Once again wewant to stress that the presented LDL descriptionis intended to capture the semantics of eachactivity instead of the way these activities areactually implementedAn elementary activity is formally described by

the following elements

Name A unique identifier for the activity

Input schemata A finite set of one or more inputschemata that receives data from the dataproviders of the activity

Output schema A schema that describes theplaceholder for the rows that pass the checkperformed by the elementary activity

Rejections schema A schema that describes theplaceholder for the rows that do not pass thecheck performed by the activity or their valuesare not appropriate for the performed transfor-mation

Parameter list A set of pairs which act asregulators for the functionality of the activity(the target attribute of a foreign key check forexample) The first component of the pair is aname and the second is a schema an attribute afunction or a constant

Output operational semantics An LDL state-ment describing the content passed to theoutput of the operation with respect to itsinput This LDL statement defines (a) theoperation performed on the rows that passthrough the activity and (b) an implicit mappingbetween the attributes of the input schema(ta)and the respective attributes of the outputschema

Rejection operational semantics An LDL state-ment describing the rejected records in a sensesimilar to the output operational semanticsThis statement is by default considered to be thecomplement of the output operational seman-tics except if explicitly defined differently

There are two issues that we would like toelaborate on here

NULL schemata Whenever we do not specifya data consumer for the output or rejec-tion schemata the respective NULL schema

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 499

(involving the correct number of attributes) isimplied This practically means that the datatargeted for this schema will neither be stored tosome persistent data store nor will they bepropagated to another activity but they willsimply be ignored

Language issues Initially we used to specify thesemantics of activities with SQL statementsStill although clear and easy to write andunderstand SQL is rather hard to use if one isto perform rewriting and composition of state-ments Thus we have supplemented SQL withLDL [10] a logic programming declarativelanguage as the basis of our scenario definitionLDL is a Datalog variant based on a Horn-clause logic that supports recursion complexobjects and negation In the context of itsimplementation in an actual deductive databasemanagement system LDL++ [11] the lan-guage has been extended to support externalfunctions choice aggregation (and even user-defined aggregation) updates and several otherfeatures

24 Relationships in the architecture graph

In this subsection we will elaborate on thedifferent kinds of relationships that the entities ofan ETL scenario have Whereas these entities aremodeled as the nodes of the architecture graphrelationships are modeled as its edges Due to theirdiversity before proceeding we list these types ofrelationships along with the related terminologythat we will use in this paper The graphical

Date

DSPS1

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT INSK1

Fig 4 Instance-of relationships

notation of entities (nodes) and relationships(edges) is presented in Fig 2

Part-of relationships These relationships in-volve attributes and parameters and relate themto the respective activity recordset or functionto which they belongInstance-of relationships These relationships aredefined among a datafunction type and itsinstancesProvider relationships These are relationshipsthat involve attributes with a providerndashconsu-mer relationshipRegulator relationships These relationships aredefined among the parameters of activities andthe terms that populate these activitiesDerived provider relationships A special case ofprovider relationships that occurs wheneveroutput attributes are computed through thecomposition of input attributes and parametersDerived provider relationships can be deducedfrom a simple rule and do not originallyconstitute a part of the graph

In the rest of this subsection we will detail thenotions pertaining to the relationships of theArchitecture Graph the knowledgeable reader isreferred to Section 25 where we discuss the issueof scenarios We will base our discussions on apart of the scenario of the motivating example(presented in Section 21) including activity SK1

Data types and instance-of relationships Tocapture typing information on attributes and

SKEY

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT IN DWPARTS

UPP

Integer

of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525500

functions the architecture graph comprises dataand function types Instantiation relationships aredepicted as dotted arrows that stem from theinstances and head toward the datafunction typesIn Fig 4 we observe the attributes of the twoactivities of our example and their correspondenceto two data types namely integer and dateFor reasons of presentation we merge severalinstantiation edges so that the figure does notbecome too crowded

Attributes and part-of relationships The firstthing to incorporate in the architecture graph isthe structured entities (activities and recordsets)along with all the attributes of their schemata Wechoose to avoid overloading the notation byincorporating the schemata per se instead weapply a direct part-of relationship between anactivity node and the respective attributes Weannotate each such relationship with the name ofthe schema (by default we assume a IN OUTPAR REJ tag to denote whether the attributebelongs to the input output parameter or rejec-

DSPS1OUT

OUT

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

PKEY

PKEY

LSKEY

LPKEY

SKEY

SOURCE

SOURCE LSOURCLOOKUP

INSK1

P

Fig 5 Part-of regulator and provider rela

tion schema of the activity respectively) Natu-rally if the activity involves more than one inputschemata the relationship is tagged with an INitag for the ith input schema We also incorporatethe functions along with their respective para-meters and the part-of relationships among theformer and the latter We annotate the part-ofrelationship with the return type with a directededge to distinguish it from the rest of theparametersFig 5 depicts a part of the motivating example

In terms of part-of relationships we present thedecomposition of (a) the recordsets DSPS1LOOKUP DWPARTSUPP and (b) the activity SK1and the attributes of its input and outputschemata Note the tagging of the schemata ofthe involved activity We do not consider therejection schemata in order to avoid crowding thepicture Also note how the parameters of theactivity are also incorporated in the architecturegraph Activity SK1 has five parameters (a) PKEYwhich stands for the production key to bereplaced (b) SOURCE which stands for an integer

OUT

PKEY

SKEY

QTY

COST

DATE

SOURCE

E

PKEY

QTY

COST

DATE

SOURCE

IN

AR

DWPARTS

UPP

tionships of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 501

value that characterizes which sourcersquos data areprocessed (c) LPKEY which stands for theattribute of the lookup table which contains theproduction keys (d) LSOURCE which stands forthe attribute of the lookup table which containsthe source value (corresponding to the aforemen-tioned SOURCE parameter) (e) LSKEY whichstands for the attribute of the lookup table whichcontains the surrogate keys

Parameters and regulator relationships Once thepart-of and instantiation relationships have beenestablished it is time to establish the regulatorrelationships of the scenario In this case we linkthe parameters of the activities to the terms(attributes or constants) that populate them Wedepict regulator relationships with simple dottededgesIn the example of Fig 5 we can also observe

how the parameters of activity SK1 are populatedthrough regulator relationships The parametersin and out are mapped to the respective termsthrough regulator relationships All the para-meters of SK1 namely PKEY SOURCE LPKEYLSOURCE and LSKEY are mapped to the respec-tive attributes of either the activityrsquos input schemaor the employed lookup table LOOKUP Theparameter LSKEY deserves particular attentionThis parameter is (a) populated from the attributeSKEY of the lookup table and (b) used to populatethe attribute SKEY of the output schema of theactivity Thus two regulator relationships arerelated with parameter LSKEY one for each ofthe aforementioned attributes The existence of aregulator relationship among a parameter and anoutput attribute of an activity normally denotesthat some external data provider is employed inorder to derive a new attribute through therespective parameter

Provider relationships The flow of data from thedata sources towards the data warehouse isperformed through the composition of activitiesin a larger scenario In this context the input foran activity can be either a persistent data store oranother activity Usually this applies for theoutput of an activity too We capture the passingof data from providers to consumers by a provider

relationship among the attributes of the involvedschemataFormally a provider relationship is defined by

the following elements

Name A unique identifier for the providerrelationship

Mapping An ordered pair The first part of thepair is a term (ie an attribute or constant)acting as a provider and the second part is anattribute acting as the consumer

The mapping need not necessarily be 11 fromprovider to consumer attributes since an inputattribute can be mapped to more than oneconsumer attributes Still the opposite does nothold Note that a consumer attribute can also bepopulated by a constant in certain casesIn order to achieve the flow of data from the

providers of an activity towards its consumers weneed the following three groups of providerrelationships

1

A mapping between the input schemata of theactivity and the output schema of their dataproviders In other words for each attribute ofan input schema of an activity there must existan attribute of the data provider or a constantwhich is mapped to the former attribute

2

Amapping between the attributes of the activityinput schemata and the activity output (orrejection respectively) schema

3

A mapping between the output or rejectionschema of the activity and the (input) schema ofits data consumer

The mappings of the second type are internal tothe activity Basically they can be derived from theLDL statement for each of the outputrejectionschemata As far as the first and the third types ofprovider relationships are concerned the map-pings must be provided during the construction ofthe ETL scenario This means that they are either(a) by default assumed by the order of theattributes of the involved schemata or (b) hard-coded by the user Provider relationships aredepicted with bold solid arrows that stem fromthe provider and end in the consumer attribute

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525502

Observe Fig 5 The flow starts from tableDSPS1 of the data staging area Each of theattributes of this table is mapped to an attribute ofthe input schema of activity SK1 The attributes ofthe input schema of the latter are subsequentlymapped to the attributes of the output schema ofthe activity The flow continues to DWPARTSUPPAnother interesting thing is that during the dataflow new attributes are generated resulting on newstreams of data whereas the flow seems to stop forother attributes Observe the rightmost part ofFig 5 where the values of attribute PKEY are notfurther propagated (remember that the reason forthe application of a surrogate key transformation isto replace the production keys of the source data toa homogeneous surrogate for the records of thedata warehouse which is independent of the sourcethey have been collected from) Instead of thevalues of the production key the values from theattribute SKEY will be used to denote the uniqueidentifier for a part in the rest of the flowIn Fig 6 we depict the LDL definition of this

part of the motivating example The three rulescorrespond to the three categories of provider

addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYds_ps1(A_OUT_PKEYA_OUT_DATEA_OUT_QTYA_OUTA_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTY addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYlookup(A_IN1_SOURCEA_IN1_PKEYA_OUT_SKEY)A_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

dw_partsupp(PKEYDATEQTYCOSTSOURCE) addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTYDATE=A_IN1_DATE

QTY=A_IN1_QTYCOST=A_IN1_COSTSOURCE=A_IN1_SOURCEPKEY=A_IN1_SKEY

NOTE For reasonsof readability we do not rethe activity name ieA_OUT_PKEYshould be

Fig 6 LDL specification of t

relationships previously discussed the first ruleexplains how the data from the DSPS1 recordsetare fed into the input schema of the activity thesecond rule explains the semantics of activity (iehow the surrogate key is generated) and finallythe third rule shows how the DWPARTSUPPrecordset is populated from the output schema ofthe activity SK1

Derived provider relationships As we havealready mentioned there are certain outputattributes that are computed through the composi-tion of input attributes and parameters A derived

provider relationship is another form of providerrelationship that captures the flow from the inputto the respective output attributesFormally assume that (a) source is a term in

the architecture graph (b) target is an attributeof the output schema of an activity A and (c) xyare parameters in the parameter list of A (notnecessary different) Then a derived providerrelationship pr(source target) exists iff thefollowing regulator relationships (ie edges) existrr1(source x) and rr2(y target)

A_IN1_COSTA_IN1_SOURCE)_COSTA_OUT_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)A_IN1_COSTA_IN1_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)

place the Ain attribute names with diffPS1_OUT_PKEY

he motivating example

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 4: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 495

general discussion on the completeness and generalapplicability of our approach Section 7 offersconclusions and presents topics for future re-search Short versions of parts of this paper havebeen presented in [67]

1In data warehousing terminology a DSA is an intermediate

area of the data warehouse specifically destined to enable the

transformation cleaning and integration of source data before

being loaded to the warehouse2The technical points like FTP are mostly employed to show

what kind of problems someone has to deal with in a practical

situation rather than to relate this kind of physical operations

to a logical model In terms of logical modelling this is a simple

passing of data from one site to another

2 Generic model of ETL activities

The purpose of this section is to present a formallogical model for the activities of an ETLenvironment This model abstracts from thetechnicalities of monitoring scheduling and log-ging while it concentrates on the flow of data fromthe sources towards the data warehouse throughthe composition of activities and data stores Thefull layout of an ETL scenario involving activitiesrecordsets and functions can be modeled by agraph which we call the architecture graph Weemploy a uniform graph-modeling framework forboth the modeling of the internal structure ofactivities and the ETL scenario at large whichenables the treatment of the ETL environmentfrom different viewpoints First the architecturegraph comprises all the activities and data stores ofa scenario along with their components Secondthe architecture graph captures the data flowwithin the ETL environment Finally the informa-tion on the typing of the involved entities and theregulation of the execution of a scenario throughspecific parameters are also covered

21 Graphical notation and motivating example

Being a graph the architecture graph of an ETLscenario comprises nodes and edges The involveddata types function types constants attributesactivities recordsets parameters and functionsconstitute the nodes of the graph The differentkinds of relationships among these entities aremodeled as the edges of the graph In Fig 2 wegive the graphical notation for all the modelingconstructs that will be presented in the sequel

Motivating example To motivate our discus-sion we will present an example involving thepropagation of data from a certain source S1towards a data warehouse DW through intermedi-ate recordsets These recordsets belong to a data

staging area (DSA)1 DS The scenario involves thepropagation of data from the table PARTSUPP ofsource S1 to the data warehouse DW TableDWPARTSUPP (PKEY SOURCE DATE QTYCOST) stores information for the available quan-tity (QTY) and cost (COST) of parts (PKEY)per source (SOURCE) The data source S1PARTSUPP (PKEY DATE QTY COST) recordsthe supplies from a specific geographical regioneg Europe All the attributes except for the datesare instances of the Integer type The scenario isgraphically depicted in Fig 3 and involves thefollowing transformations

1

First we transfer via FTP_PS1 the snapshotfrom the source S1PARTSUPP to the fileDSPS1_NEW of the DSA2

2

In the DSA we maintain locally a copy of thesnapshot of the source as it was at the previousloading (we assume here the case of theincremental maintenance of the DW instead ofthe case of the initial loading of the DW) Therecordset DSPS1_NEW (PKEY DATE QTYCOST) stands for the last transferred snapshotof S1PARTSUPP By detecting the differenceof this snapshot with the respective version ofthe previous loading DSPS1_OLD (PKEYDATE QTY COST) we can derive the newlyinserted rows in S1PARTSUPP Note that thedifference activity that we employ namelyDiff_PS1 checks for differences only on theprimary key of the recordsets thus we ignorehere any possible deletions or updates for theattributes COST QTY of existing rows Any notnewly inserted row is rejected and so it ispropagated to Diff_PS1_REJ that stores allthe rejected rows The schema of Diff_PS1_REJ is identical to the input schema of theactivity Diff_PS1

ARTICLE IN PRESS

Add_Attr1 SK1

DSPS1_NEW

DSPS1_OLD

FTP_PS1

Diff_PS1 DWPARTSUPP

S1PARTSUPP

LOOKUP

DSPS1_NEWPKEY=

DSPS1_OLDPKEYSOURCE = 1

DSPS1PKEYLOOKUPPKEY

LOOKUPSOURCELOOKUPSKEY

NotNu111

COST

Diff_PS1_REJ

Not Nul 111_REJ

DSA

Source

DataWarehouse

DSPS1

Fig 3 Birdrsquos-eye view of the motivating example

Data Types Black ellipsoid RecordSets Cylinders

Function

TypesBlack rectangles Functions Gray rectangles

Constants Black circles Parameters White rectangles

Attributes Unshaded ellipsoid Activities Triangles

Part-Of

Relationships

Simple lines with

diamond edges

Provider

Relationships

Bold solid arrows

(from provider to

consumer)

Instance-Of

Relationships

Dotted arrows

(from instance

towards the type)

Derived

Provider

Relationships

Bold dotted

arrows (from

provider to

consumer)

Regulator

RelationshipsDotted lines

We annotate the part-of relationship among afunction and its return type with a directed edge todistinguish it from therest of the parameters

1

Fig 2 Graphical notation for the architecture graph

P Vassiliadis et al Information Systems 30 (2005) 492ndash525496

3

The rows that pass the activity Diff_PS1 arechecked for null values of the attribute COSTthrough the activity NotNull1 Rows having aNULL value for their COST are kept in the

Diff_PS1_REJ recordset for further examina-tion by the data warehouse administrator

4

Although we consider the data flow for onlyone source namely S1 the data warehouse can

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 497

clearly have more sources for part supplies Inorder to keep track of the source of each rowentering into the DW we need to add a lsquoflagrsquoattribute namely SOURCE indicating the re-spective source This is achieved through theactivity Add_Attr1 We store the rows thatstem from this process in the recordset DSPS1(PKEY SOURCE DATE QTY COST)

5

Next we assign a surrogate key on PKEY In thedata warehouse context it is common tactics toreplace the keys of the production systems witha uniform key which we call a surrogate key [8]The basic reasons for this replacement areperformance and semantic homogeneity Tex-tual attributes are not the best candidates forindexed keys and thus they need to be replacedby integer keys At the same time differentproduction systems might use different keys forthe same object or the same key for differentobjects resulting in the need for a globalreplacement of these values in the data ware-house This replacement is performed through alookup table of the form L (PRODKEYSOURCE SKEY) The SOURCE column is dueto the fact that there can be synonyms in thedifferent sources which are mapped to differentobjects in the data warehouse In our case theactivity that performs the surrogate key assign-ment for the attribute PKEY is SK1 It uses thelookup table LOOKUP (PKEY SOURCESKEY) Finally we populate the data ware-house with the output of the previous activity

The role of rejected rows depends on thepeculiarities of each ETL scenario If the designerneeds to administrate these rows further then heshe should use intermediate storage recordsetswith the burden of an extra IO cost If the rejectedrows should not have a special treatment then thebest solution is to be ignored thus in this case weavoid overloading the scenario with any extrastorage recordset In our case we annotate onlytwo of the presented activities with a destina-tion for rejected rows Out of these whileNotNull1_REJ absolutely makes sense as aplaceholder for problematic rows having non-acceptable NULL values Diff_PS1_REJ is pre-sented for demonstration reasons only

Finally before proceeding we would like tostress that we do not anticipate a manualconstruction of the graph by the designer ratherwe employ this section to clarify how the graphwill look once constructed To assist a moreautomatic construction of ETL scenarios we haveimplemented the ARKTOS II tool that supports thedesigning process through a friendly GUI Wepresent ARKTOS II in Section 4

22 Preliminaries

In this subsection we will introduce the formalmodeling of data types data stores and functionsbefore proceeding to the modeling of ETLactivities

Elementary entities We assume the existence ofa countable set of data types Each data type T ischaracterized by a name and a domain ie acountable set of values called dom (T) Thevalues of the domains are also referred to asconstantsWe also assume the existence of a countable set

of attributes which constitute the most elementarygranules of the infrastructure of the informationsystem Attributes are characterized by their nameand data type The domain of an attribute is asubset of the domain of its data type Attributesand constants are uniformly referred to as terms

A schema is a finite list of attributes Each entitythat is characterized by one or more schemata willbe called structured entity Moreover we assumethe existence of a special family of schemata allunder the general name of NULL schemadetermined to act as placeholders for data whichare not to be stored permanently in some datastore We refer to a family instead of a singleNULL schema due to a subtle technicalityinvolving the number of attributes of such aschema (this will become clear in the sequel)

Recordsets We define a record as the instantia-tion of a schema to a list of values belonging tothe domains of the respective schema attributesWe can treat any data structure as a re-cordset provided that there are ways to logically

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525498

restructure it into a flat typed record schemaFormally a recordset is characterized by its nameits (logical) schema and its (physical) extension(ie a finite set of records under the recordsetschema) If we consider a schema S frac14

[A1yAk] for a certain recordset its extensionis a mapping S frac14 [A1yAk]-dom(A1)y

dom(Ak) Thus the extension of the recordsetis a finite subset of dom(A1)ydom(Ak) anda record is the instance of a mapping dom(A1)ydom(Ak)-[x1yxk] xiAdom(Ai)In the rest of this paper we will mainly deal withthe two most popular types of recordsets namelyrelational tables and record files A database is afinite set of relational tables

Functions We assume the existence of acountable set of built-in system function types Afunction type comprises a name a finite list ofparameter data types and a single return data typeA function is an instance of a function typeConsequently it is characterized by a name a listof input parameters and a parameter for its returnvalue The data types of the parameters of thegenerating function type also define (a) the datatypes of the parameters of the function and (b) thelegal candidates for the function parameters (ieattributes or constants of a suitable data type)

23 Activities

Activities are the backbone of the structure ofany information system We adopt the WfMCterminology [9] for processesprograms and we willcall them activities in the sequel An activity is anamount of lsquolsquowork which is processed by acombination of resource and computer applica-tionsrsquorsquo [9] In our framework activities are logicalabstractions representing parts or full modules ofcodeThe execution of an activity is performed from a

particular program Normally ETL activities willbe either performed in a black-box manner by adedicated tool or they will be expressed in somelanguage (eg PLSQL Perl C) Still we want todeal with the general case of ETL activities Weemploy an abstraction of the source code of anactivity in the form of an LDL statement Using

LDL we avoid dealing with the peculiarities of aparticular programming language Once again wewant to stress that the presented LDL descriptionis intended to capture the semantics of eachactivity instead of the way these activities areactually implementedAn elementary activity is formally described by

the following elements

Name A unique identifier for the activity

Input schemata A finite set of one or more inputschemata that receives data from the dataproviders of the activity

Output schema A schema that describes theplaceholder for the rows that pass the checkperformed by the elementary activity

Rejections schema A schema that describes theplaceholder for the rows that do not pass thecheck performed by the activity or their valuesare not appropriate for the performed transfor-mation

Parameter list A set of pairs which act asregulators for the functionality of the activity(the target attribute of a foreign key check forexample) The first component of the pair is aname and the second is a schema an attribute afunction or a constant

Output operational semantics An LDL state-ment describing the content passed to theoutput of the operation with respect to itsinput This LDL statement defines (a) theoperation performed on the rows that passthrough the activity and (b) an implicit mappingbetween the attributes of the input schema(ta)and the respective attributes of the outputschema

Rejection operational semantics An LDL state-ment describing the rejected records in a sensesimilar to the output operational semanticsThis statement is by default considered to be thecomplement of the output operational seman-tics except if explicitly defined differently

There are two issues that we would like toelaborate on here

NULL schemata Whenever we do not specifya data consumer for the output or rejec-tion schemata the respective NULL schema

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 499

(involving the correct number of attributes) isimplied This practically means that the datatargeted for this schema will neither be stored tosome persistent data store nor will they bepropagated to another activity but they willsimply be ignored

Language issues Initially we used to specify thesemantics of activities with SQL statementsStill although clear and easy to write andunderstand SQL is rather hard to use if one isto perform rewriting and composition of state-ments Thus we have supplemented SQL withLDL [10] a logic programming declarativelanguage as the basis of our scenario definitionLDL is a Datalog variant based on a Horn-clause logic that supports recursion complexobjects and negation In the context of itsimplementation in an actual deductive databasemanagement system LDL++ [11] the lan-guage has been extended to support externalfunctions choice aggregation (and even user-defined aggregation) updates and several otherfeatures

24 Relationships in the architecture graph

In this subsection we will elaborate on thedifferent kinds of relationships that the entities ofan ETL scenario have Whereas these entities aremodeled as the nodes of the architecture graphrelationships are modeled as its edges Due to theirdiversity before proceeding we list these types ofrelationships along with the related terminologythat we will use in this paper The graphical

Date

DSPS1

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT INSK1

Fig 4 Instance-of relationships

notation of entities (nodes) and relationships(edges) is presented in Fig 2

Part-of relationships These relationships in-volve attributes and parameters and relate themto the respective activity recordset or functionto which they belongInstance-of relationships These relationships aredefined among a datafunction type and itsinstancesProvider relationships These are relationshipsthat involve attributes with a providerndashconsu-mer relationshipRegulator relationships These relationships aredefined among the parameters of activities andthe terms that populate these activitiesDerived provider relationships A special case ofprovider relationships that occurs wheneveroutput attributes are computed through thecomposition of input attributes and parametersDerived provider relationships can be deducedfrom a simple rule and do not originallyconstitute a part of the graph

In the rest of this subsection we will detail thenotions pertaining to the relationships of theArchitecture Graph the knowledgeable reader isreferred to Section 25 where we discuss the issueof scenarios We will base our discussions on apart of the scenario of the motivating example(presented in Section 21) including activity SK1

Data types and instance-of relationships Tocapture typing information on attributes and

SKEY

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT IN DWPARTS

UPP

Integer

of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525500

functions the architecture graph comprises dataand function types Instantiation relationships aredepicted as dotted arrows that stem from theinstances and head toward the datafunction typesIn Fig 4 we observe the attributes of the twoactivities of our example and their correspondenceto two data types namely integer and dateFor reasons of presentation we merge severalinstantiation edges so that the figure does notbecome too crowded

Attributes and part-of relationships The firstthing to incorporate in the architecture graph isthe structured entities (activities and recordsets)along with all the attributes of their schemata Wechoose to avoid overloading the notation byincorporating the schemata per se instead weapply a direct part-of relationship between anactivity node and the respective attributes Weannotate each such relationship with the name ofthe schema (by default we assume a IN OUTPAR REJ tag to denote whether the attributebelongs to the input output parameter or rejec-

DSPS1OUT

OUT

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

PKEY

PKEY

LSKEY

LPKEY

SKEY

SOURCE

SOURCE LSOURCLOOKUP

INSK1

P

Fig 5 Part-of regulator and provider rela

tion schema of the activity respectively) Natu-rally if the activity involves more than one inputschemata the relationship is tagged with an INitag for the ith input schema We also incorporatethe functions along with their respective para-meters and the part-of relationships among theformer and the latter We annotate the part-ofrelationship with the return type with a directededge to distinguish it from the rest of theparametersFig 5 depicts a part of the motivating example

In terms of part-of relationships we present thedecomposition of (a) the recordsets DSPS1LOOKUP DWPARTSUPP and (b) the activity SK1and the attributes of its input and outputschemata Note the tagging of the schemata ofthe involved activity We do not consider therejection schemata in order to avoid crowding thepicture Also note how the parameters of theactivity are also incorporated in the architecturegraph Activity SK1 has five parameters (a) PKEYwhich stands for the production key to bereplaced (b) SOURCE which stands for an integer

OUT

PKEY

SKEY

QTY

COST

DATE

SOURCE

E

PKEY

QTY

COST

DATE

SOURCE

IN

AR

DWPARTS

UPP

tionships of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 501

value that characterizes which sourcersquos data areprocessed (c) LPKEY which stands for theattribute of the lookup table which contains theproduction keys (d) LSOURCE which stands forthe attribute of the lookup table which containsthe source value (corresponding to the aforemen-tioned SOURCE parameter) (e) LSKEY whichstands for the attribute of the lookup table whichcontains the surrogate keys

Parameters and regulator relationships Once thepart-of and instantiation relationships have beenestablished it is time to establish the regulatorrelationships of the scenario In this case we linkthe parameters of the activities to the terms(attributes or constants) that populate them Wedepict regulator relationships with simple dottededgesIn the example of Fig 5 we can also observe

how the parameters of activity SK1 are populatedthrough regulator relationships The parametersin and out are mapped to the respective termsthrough regulator relationships All the para-meters of SK1 namely PKEY SOURCE LPKEYLSOURCE and LSKEY are mapped to the respec-tive attributes of either the activityrsquos input schemaor the employed lookup table LOOKUP Theparameter LSKEY deserves particular attentionThis parameter is (a) populated from the attributeSKEY of the lookup table and (b) used to populatethe attribute SKEY of the output schema of theactivity Thus two regulator relationships arerelated with parameter LSKEY one for each ofthe aforementioned attributes The existence of aregulator relationship among a parameter and anoutput attribute of an activity normally denotesthat some external data provider is employed inorder to derive a new attribute through therespective parameter

Provider relationships The flow of data from thedata sources towards the data warehouse isperformed through the composition of activitiesin a larger scenario In this context the input foran activity can be either a persistent data store oranother activity Usually this applies for theoutput of an activity too We capture the passingof data from providers to consumers by a provider

relationship among the attributes of the involvedschemataFormally a provider relationship is defined by

the following elements

Name A unique identifier for the providerrelationship

Mapping An ordered pair The first part of thepair is a term (ie an attribute or constant)acting as a provider and the second part is anattribute acting as the consumer

The mapping need not necessarily be 11 fromprovider to consumer attributes since an inputattribute can be mapped to more than oneconsumer attributes Still the opposite does nothold Note that a consumer attribute can also bepopulated by a constant in certain casesIn order to achieve the flow of data from the

providers of an activity towards its consumers weneed the following three groups of providerrelationships

1

A mapping between the input schemata of theactivity and the output schema of their dataproviders In other words for each attribute ofan input schema of an activity there must existan attribute of the data provider or a constantwhich is mapped to the former attribute

2

Amapping between the attributes of the activityinput schemata and the activity output (orrejection respectively) schema

3

A mapping between the output or rejectionschema of the activity and the (input) schema ofits data consumer

The mappings of the second type are internal tothe activity Basically they can be derived from theLDL statement for each of the outputrejectionschemata As far as the first and the third types ofprovider relationships are concerned the map-pings must be provided during the construction ofthe ETL scenario This means that they are either(a) by default assumed by the order of theattributes of the involved schemata or (b) hard-coded by the user Provider relationships aredepicted with bold solid arrows that stem fromthe provider and end in the consumer attribute

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525502

Observe Fig 5 The flow starts from tableDSPS1 of the data staging area Each of theattributes of this table is mapped to an attribute ofthe input schema of activity SK1 The attributes ofthe input schema of the latter are subsequentlymapped to the attributes of the output schema ofthe activity The flow continues to DWPARTSUPPAnother interesting thing is that during the dataflow new attributes are generated resulting on newstreams of data whereas the flow seems to stop forother attributes Observe the rightmost part ofFig 5 where the values of attribute PKEY are notfurther propagated (remember that the reason forthe application of a surrogate key transformation isto replace the production keys of the source data toa homogeneous surrogate for the records of thedata warehouse which is independent of the sourcethey have been collected from) Instead of thevalues of the production key the values from theattribute SKEY will be used to denote the uniqueidentifier for a part in the rest of the flowIn Fig 6 we depict the LDL definition of this

part of the motivating example The three rulescorrespond to the three categories of provider

addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYds_ps1(A_OUT_PKEYA_OUT_DATEA_OUT_QTYA_OUTA_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTY addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYlookup(A_IN1_SOURCEA_IN1_PKEYA_OUT_SKEY)A_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

dw_partsupp(PKEYDATEQTYCOSTSOURCE) addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTYDATE=A_IN1_DATE

QTY=A_IN1_QTYCOST=A_IN1_COSTSOURCE=A_IN1_SOURCEPKEY=A_IN1_SKEY

NOTE For reasonsof readability we do not rethe activity name ieA_OUT_PKEYshould be

Fig 6 LDL specification of t

relationships previously discussed the first ruleexplains how the data from the DSPS1 recordsetare fed into the input schema of the activity thesecond rule explains the semantics of activity (iehow the surrogate key is generated) and finallythe third rule shows how the DWPARTSUPPrecordset is populated from the output schema ofthe activity SK1

Derived provider relationships As we havealready mentioned there are certain outputattributes that are computed through the composi-tion of input attributes and parameters A derived

provider relationship is another form of providerrelationship that captures the flow from the inputto the respective output attributesFormally assume that (a) source is a term in

the architecture graph (b) target is an attributeof the output schema of an activity A and (c) xyare parameters in the parameter list of A (notnecessary different) Then a derived providerrelationship pr(source target) exists iff thefollowing regulator relationships (ie edges) existrr1(source x) and rr2(y target)

A_IN1_COSTA_IN1_SOURCE)_COSTA_OUT_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)A_IN1_COSTA_IN1_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)

place the Ain attribute names with diffPS1_OUT_PKEY

he motivating example

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 5: Etl design document

ARTICLE IN PRESS

Add_Attr1 SK1

DSPS1_NEW

DSPS1_OLD

FTP_PS1

Diff_PS1 DWPARTSUPP

S1PARTSUPP

LOOKUP

DSPS1_NEWPKEY=

DSPS1_OLDPKEYSOURCE = 1

DSPS1PKEYLOOKUPPKEY

LOOKUPSOURCELOOKUPSKEY

NotNu111

COST

Diff_PS1_REJ

Not Nul 111_REJ

DSA

Source

DataWarehouse

DSPS1

Fig 3 Birdrsquos-eye view of the motivating example

Data Types Black ellipsoid RecordSets Cylinders

Function

TypesBlack rectangles Functions Gray rectangles

Constants Black circles Parameters White rectangles

Attributes Unshaded ellipsoid Activities Triangles

Part-Of

Relationships

Simple lines with

diamond edges

Provider

Relationships

Bold solid arrows

(from provider to

consumer)

Instance-Of

Relationships

Dotted arrows

(from instance

towards the type)

Derived

Provider

Relationships

Bold dotted

arrows (from

provider to

consumer)

Regulator

RelationshipsDotted lines

We annotate the part-of relationship among afunction and its return type with a directed edge todistinguish it from therest of the parameters

1

Fig 2 Graphical notation for the architecture graph

P Vassiliadis et al Information Systems 30 (2005) 492ndash525496

3

The rows that pass the activity Diff_PS1 arechecked for null values of the attribute COSTthrough the activity NotNull1 Rows having aNULL value for their COST are kept in the

Diff_PS1_REJ recordset for further examina-tion by the data warehouse administrator

4

Although we consider the data flow for onlyone source namely S1 the data warehouse can

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 497

clearly have more sources for part supplies Inorder to keep track of the source of each rowentering into the DW we need to add a lsquoflagrsquoattribute namely SOURCE indicating the re-spective source This is achieved through theactivity Add_Attr1 We store the rows thatstem from this process in the recordset DSPS1(PKEY SOURCE DATE QTY COST)

5

Next we assign a surrogate key on PKEY In thedata warehouse context it is common tactics toreplace the keys of the production systems witha uniform key which we call a surrogate key [8]The basic reasons for this replacement areperformance and semantic homogeneity Tex-tual attributes are not the best candidates forindexed keys and thus they need to be replacedby integer keys At the same time differentproduction systems might use different keys forthe same object or the same key for differentobjects resulting in the need for a globalreplacement of these values in the data ware-house This replacement is performed through alookup table of the form L (PRODKEYSOURCE SKEY) The SOURCE column is dueto the fact that there can be synonyms in thedifferent sources which are mapped to differentobjects in the data warehouse In our case theactivity that performs the surrogate key assign-ment for the attribute PKEY is SK1 It uses thelookup table LOOKUP (PKEY SOURCESKEY) Finally we populate the data ware-house with the output of the previous activity

The role of rejected rows depends on thepeculiarities of each ETL scenario If the designerneeds to administrate these rows further then heshe should use intermediate storage recordsetswith the burden of an extra IO cost If the rejectedrows should not have a special treatment then thebest solution is to be ignored thus in this case weavoid overloading the scenario with any extrastorage recordset In our case we annotate onlytwo of the presented activities with a destina-tion for rejected rows Out of these whileNotNull1_REJ absolutely makes sense as aplaceholder for problematic rows having non-acceptable NULL values Diff_PS1_REJ is pre-sented for demonstration reasons only

Finally before proceeding we would like tostress that we do not anticipate a manualconstruction of the graph by the designer ratherwe employ this section to clarify how the graphwill look once constructed To assist a moreautomatic construction of ETL scenarios we haveimplemented the ARKTOS II tool that supports thedesigning process through a friendly GUI Wepresent ARKTOS II in Section 4

22 Preliminaries

In this subsection we will introduce the formalmodeling of data types data stores and functionsbefore proceeding to the modeling of ETLactivities

Elementary entities We assume the existence ofa countable set of data types Each data type T ischaracterized by a name and a domain ie acountable set of values called dom (T) Thevalues of the domains are also referred to asconstantsWe also assume the existence of a countable set

of attributes which constitute the most elementarygranules of the infrastructure of the informationsystem Attributes are characterized by their nameand data type The domain of an attribute is asubset of the domain of its data type Attributesand constants are uniformly referred to as terms

A schema is a finite list of attributes Each entitythat is characterized by one or more schemata willbe called structured entity Moreover we assumethe existence of a special family of schemata allunder the general name of NULL schemadetermined to act as placeholders for data whichare not to be stored permanently in some datastore We refer to a family instead of a singleNULL schema due to a subtle technicalityinvolving the number of attributes of such aschema (this will become clear in the sequel)

Recordsets We define a record as the instantia-tion of a schema to a list of values belonging tothe domains of the respective schema attributesWe can treat any data structure as a re-cordset provided that there are ways to logically

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525498

restructure it into a flat typed record schemaFormally a recordset is characterized by its nameits (logical) schema and its (physical) extension(ie a finite set of records under the recordsetschema) If we consider a schema S frac14

[A1yAk] for a certain recordset its extensionis a mapping S frac14 [A1yAk]-dom(A1)y

dom(Ak) Thus the extension of the recordsetis a finite subset of dom(A1)ydom(Ak) anda record is the instance of a mapping dom(A1)ydom(Ak)-[x1yxk] xiAdom(Ai)In the rest of this paper we will mainly deal withthe two most popular types of recordsets namelyrelational tables and record files A database is afinite set of relational tables

Functions We assume the existence of acountable set of built-in system function types Afunction type comprises a name a finite list ofparameter data types and a single return data typeA function is an instance of a function typeConsequently it is characterized by a name a listof input parameters and a parameter for its returnvalue The data types of the parameters of thegenerating function type also define (a) the datatypes of the parameters of the function and (b) thelegal candidates for the function parameters (ieattributes or constants of a suitable data type)

23 Activities

Activities are the backbone of the structure ofany information system We adopt the WfMCterminology [9] for processesprograms and we willcall them activities in the sequel An activity is anamount of lsquolsquowork which is processed by acombination of resource and computer applica-tionsrsquorsquo [9] In our framework activities are logicalabstractions representing parts or full modules ofcodeThe execution of an activity is performed from a

particular program Normally ETL activities willbe either performed in a black-box manner by adedicated tool or they will be expressed in somelanguage (eg PLSQL Perl C) Still we want todeal with the general case of ETL activities Weemploy an abstraction of the source code of anactivity in the form of an LDL statement Using

LDL we avoid dealing with the peculiarities of aparticular programming language Once again wewant to stress that the presented LDL descriptionis intended to capture the semantics of eachactivity instead of the way these activities areactually implementedAn elementary activity is formally described by

the following elements

Name A unique identifier for the activity

Input schemata A finite set of one or more inputschemata that receives data from the dataproviders of the activity

Output schema A schema that describes theplaceholder for the rows that pass the checkperformed by the elementary activity

Rejections schema A schema that describes theplaceholder for the rows that do not pass thecheck performed by the activity or their valuesare not appropriate for the performed transfor-mation

Parameter list A set of pairs which act asregulators for the functionality of the activity(the target attribute of a foreign key check forexample) The first component of the pair is aname and the second is a schema an attribute afunction or a constant

Output operational semantics An LDL state-ment describing the content passed to theoutput of the operation with respect to itsinput This LDL statement defines (a) theoperation performed on the rows that passthrough the activity and (b) an implicit mappingbetween the attributes of the input schema(ta)and the respective attributes of the outputschema

Rejection operational semantics An LDL state-ment describing the rejected records in a sensesimilar to the output operational semanticsThis statement is by default considered to be thecomplement of the output operational seman-tics except if explicitly defined differently

There are two issues that we would like toelaborate on here

NULL schemata Whenever we do not specifya data consumer for the output or rejec-tion schemata the respective NULL schema

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 499

(involving the correct number of attributes) isimplied This practically means that the datatargeted for this schema will neither be stored tosome persistent data store nor will they bepropagated to another activity but they willsimply be ignored

Language issues Initially we used to specify thesemantics of activities with SQL statementsStill although clear and easy to write andunderstand SQL is rather hard to use if one isto perform rewriting and composition of state-ments Thus we have supplemented SQL withLDL [10] a logic programming declarativelanguage as the basis of our scenario definitionLDL is a Datalog variant based on a Horn-clause logic that supports recursion complexobjects and negation In the context of itsimplementation in an actual deductive databasemanagement system LDL++ [11] the lan-guage has been extended to support externalfunctions choice aggregation (and even user-defined aggregation) updates and several otherfeatures

24 Relationships in the architecture graph

In this subsection we will elaborate on thedifferent kinds of relationships that the entities ofan ETL scenario have Whereas these entities aremodeled as the nodes of the architecture graphrelationships are modeled as its edges Due to theirdiversity before proceeding we list these types ofrelationships along with the related terminologythat we will use in this paper The graphical

Date

DSPS1

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT INSK1

Fig 4 Instance-of relationships

notation of entities (nodes) and relationships(edges) is presented in Fig 2

Part-of relationships These relationships in-volve attributes and parameters and relate themto the respective activity recordset or functionto which they belongInstance-of relationships These relationships aredefined among a datafunction type and itsinstancesProvider relationships These are relationshipsthat involve attributes with a providerndashconsu-mer relationshipRegulator relationships These relationships aredefined among the parameters of activities andthe terms that populate these activitiesDerived provider relationships A special case ofprovider relationships that occurs wheneveroutput attributes are computed through thecomposition of input attributes and parametersDerived provider relationships can be deducedfrom a simple rule and do not originallyconstitute a part of the graph

In the rest of this subsection we will detail thenotions pertaining to the relationships of theArchitecture Graph the knowledgeable reader isreferred to Section 25 where we discuss the issueof scenarios We will base our discussions on apart of the scenario of the motivating example(presented in Section 21) including activity SK1

Data types and instance-of relationships Tocapture typing information on attributes and

SKEY

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT IN DWPARTS

UPP

Integer

of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525500

functions the architecture graph comprises dataand function types Instantiation relationships aredepicted as dotted arrows that stem from theinstances and head toward the datafunction typesIn Fig 4 we observe the attributes of the twoactivities of our example and their correspondenceto two data types namely integer and dateFor reasons of presentation we merge severalinstantiation edges so that the figure does notbecome too crowded

Attributes and part-of relationships The firstthing to incorporate in the architecture graph isthe structured entities (activities and recordsets)along with all the attributes of their schemata Wechoose to avoid overloading the notation byincorporating the schemata per se instead weapply a direct part-of relationship between anactivity node and the respective attributes Weannotate each such relationship with the name ofthe schema (by default we assume a IN OUTPAR REJ tag to denote whether the attributebelongs to the input output parameter or rejec-

DSPS1OUT

OUT

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

PKEY

PKEY

LSKEY

LPKEY

SKEY

SOURCE

SOURCE LSOURCLOOKUP

INSK1

P

Fig 5 Part-of regulator and provider rela

tion schema of the activity respectively) Natu-rally if the activity involves more than one inputschemata the relationship is tagged with an INitag for the ith input schema We also incorporatethe functions along with their respective para-meters and the part-of relationships among theformer and the latter We annotate the part-ofrelationship with the return type with a directededge to distinguish it from the rest of theparametersFig 5 depicts a part of the motivating example

In terms of part-of relationships we present thedecomposition of (a) the recordsets DSPS1LOOKUP DWPARTSUPP and (b) the activity SK1and the attributes of its input and outputschemata Note the tagging of the schemata ofthe involved activity We do not consider therejection schemata in order to avoid crowding thepicture Also note how the parameters of theactivity are also incorporated in the architecturegraph Activity SK1 has five parameters (a) PKEYwhich stands for the production key to bereplaced (b) SOURCE which stands for an integer

OUT

PKEY

SKEY

QTY

COST

DATE

SOURCE

E

PKEY

QTY

COST

DATE

SOURCE

IN

AR

DWPARTS

UPP

tionships of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 501

value that characterizes which sourcersquos data areprocessed (c) LPKEY which stands for theattribute of the lookup table which contains theproduction keys (d) LSOURCE which stands forthe attribute of the lookup table which containsthe source value (corresponding to the aforemen-tioned SOURCE parameter) (e) LSKEY whichstands for the attribute of the lookup table whichcontains the surrogate keys

Parameters and regulator relationships Once thepart-of and instantiation relationships have beenestablished it is time to establish the regulatorrelationships of the scenario In this case we linkthe parameters of the activities to the terms(attributes or constants) that populate them Wedepict regulator relationships with simple dottededgesIn the example of Fig 5 we can also observe

how the parameters of activity SK1 are populatedthrough regulator relationships The parametersin and out are mapped to the respective termsthrough regulator relationships All the para-meters of SK1 namely PKEY SOURCE LPKEYLSOURCE and LSKEY are mapped to the respec-tive attributes of either the activityrsquos input schemaor the employed lookup table LOOKUP Theparameter LSKEY deserves particular attentionThis parameter is (a) populated from the attributeSKEY of the lookup table and (b) used to populatethe attribute SKEY of the output schema of theactivity Thus two regulator relationships arerelated with parameter LSKEY one for each ofthe aforementioned attributes The existence of aregulator relationship among a parameter and anoutput attribute of an activity normally denotesthat some external data provider is employed inorder to derive a new attribute through therespective parameter

Provider relationships The flow of data from thedata sources towards the data warehouse isperformed through the composition of activitiesin a larger scenario In this context the input foran activity can be either a persistent data store oranother activity Usually this applies for theoutput of an activity too We capture the passingof data from providers to consumers by a provider

relationship among the attributes of the involvedschemataFormally a provider relationship is defined by

the following elements

Name A unique identifier for the providerrelationship

Mapping An ordered pair The first part of thepair is a term (ie an attribute or constant)acting as a provider and the second part is anattribute acting as the consumer

The mapping need not necessarily be 11 fromprovider to consumer attributes since an inputattribute can be mapped to more than oneconsumer attributes Still the opposite does nothold Note that a consumer attribute can also bepopulated by a constant in certain casesIn order to achieve the flow of data from the

providers of an activity towards its consumers weneed the following three groups of providerrelationships

1

A mapping between the input schemata of theactivity and the output schema of their dataproviders In other words for each attribute ofan input schema of an activity there must existan attribute of the data provider or a constantwhich is mapped to the former attribute

2

Amapping between the attributes of the activityinput schemata and the activity output (orrejection respectively) schema

3

A mapping between the output or rejectionschema of the activity and the (input) schema ofits data consumer

The mappings of the second type are internal tothe activity Basically they can be derived from theLDL statement for each of the outputrejectionschemata As far as the first and the third types ofprovider relationships are concerned the map-pings must be provided during the construction ofthe ETL scenario This means that they are either(a) by default assumed by the order of theattributes of the involved schemata or (b) hard-coded by the user Provider relationships aredepicted with bold solid arrows that stem fromthe provider and end in the consumer attribute

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525502

Observe Fig 5 The flow starts from tableDSPS1 of the data staging area Each of theattributes of this table is mapped to an attribute ofthe input schema of activity SK1 The attributes ofthe input schema of the latter are subsequentlymapped to the attributes of the output schema ofthe activity The flow continues to DWPARTSUPPAnother interesting thing is that during the dataflow new attributes are generated resulting on newstreams of data whereas the flow seems to stop forother attributes Observe the rightmost part ofFig 5 where the values of attribute PKEY are notfurther propagated (remember that the reason forthe application of a surrogate key transformation isto replace the production keys of the source data toa homogeneous surrogate for the records of thedata warehouse which is independent of the sourcethey have been collected from) Instead of thevalues of the production key the values from theattribute SKEY will be used to denote the uniqueidentifier for a part in the rest of the flowIn Fig 6 we depict the LDL definition of this

part of the motivating example The three rulescorrespond to the three categories of provider

addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYds_ps1(A_OUT_PKEYA_OUT_DATEA_OUT_QTYA_OUTA_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTY addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYlookup(A_IN1_SOURCEA_IN1_PKEYA_OUT_SKEY)A_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

dw_partsupp(PKEYDATEQTYCOSTSOURCE) addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTYDATE=A_IN1_DATE

QTY=A_IN1_QTYCOST=A_IN1_COSTSOURCE=A_IN1_SOURCEPKEY=A_IN1_SKEY

NOTE For reasonsof readability we do not rethe activity name ieA_OUT_PKEYshould be

Fig 6 LDL specification of t

relationships previously discussed the first ruleexplains how the data from the DSPS1 recordsetare fed into the input schema of the activity thesecond rule explains the semantics of activity (iehow the surrogate key is generated) and finallythe third rule shows how the DWPARTSUPPrecordset is populated from the output schema ofthe activity SK1

Derived provider relationships As we havealready mentioned there are certain outputattributes that are computed through the composi-tion of input attributes and parameters A derived

provider relationship is another form of providerrelationship that captures the flow from the inputto the respective output attributesFormally assume that (a) source is a term in

the architecture graph (b) target is an attributeof the output schema of an activity A and (c) xyare parameters in the parameter list of A (notnecessary different) Then a derived providerrelationship pr(source target) exists iff thefollowing regulator relationships (ie edges) existrr1(source x) and rr2(y target)

A_IN1_COSTA_IN1_SOURCE)_COSTA_OUT_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)A_IN1_COSTA_IN1_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)

place the Ain attribute names with diffPS1_OUT_PKEY

he motivating example

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 6: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 497

clearly have more sources for part supplies Inorder to keep track of the source of each rowentering into the DW we need to add a lsquoflagrsquoattribute namely SOURCE indicating the re-spective source This is achieved through theactivity Add_Attr1 We store the rows thatstem from this process in the recordset DSPS1(PKEY SOURCE DATE QTY COST)

5

Next we assign a surrogate key on PKEY In thedata warehouse context it is common tactics toreplace the keys of the production systems witha uniform key which we call a surrogate key [8]The basic reasons for this replacement areperformance and semantic homogeneity Tex-tual attributes are not the best candidates forindexed keys and thus they need to be replacedby integer keys At the same time differentproduction systems might use different keys forthe same object or the same key for differentobjects resulting in the need for a globalreplacement of these values in the data ware-house This replacement is performed through alookup table of the form L (PRODKEYSOURCE SKEY) The SOURCE column is dueto the fact that there can be synonyms in thedifferent sources which are mapped to differentobjects in the data warehouse In our case theactivity that performs the surrogate key assign-ment for the attribute PKEY is SK1 It uses thelookup table LOOKUP (PKEY SOURCESKEY) Finally we populate the data ware-house with the output of the previous activity

The role of rejected rows depends on thepeculiarities of each ETL scenario If the designerneeds to administrate these rows further then heshe should use intermediate storage recordsetswith the burden of an extra IO cost If the rejectedrows should not have a special treatment then thebest solution is to be ignored thus in this case weavoid overloading the scenario with any extrastorage recordset In our case we annotate onlytwo of the presented activities with a destina-tion for rejected rows Out of these whileNotNull1_REJ absolutely makes sense as aplaceholder for problematic rows having non-acceptable NULL values Diff_PS1_REJ is pre-sented for demonstration reasons only

Finally before proceeding we would like tostress that we do not anticipate a manualconstruction of the graph by the designer ratherwe employ this section to clarify how the graphwill look once constructed To assist a moreautomatic construction of ETL scenarios we haveimplemented the ARKTOS II tool that supports thedesigning process through a friendly GUI Wepresent ARKTOS II in Section 4

22 Preliminaries

In this subsection we will introduce the formalmodeling of data types data stores and functionsbefore proceeding to the modeling of ETLactivities

Elementary entities We assume the existence ofa countable set of data types Each data type T ischaracterized by a name and a domain ie acountable set of values called dom (T) Thevalues of the domains are also referred to asconstantsWe also assume the existence of a countable set

of attributes which constitute the most elementarygranules of the infrastructure of the informationsystem Attributes are characterized by their nameand data type The domain of an attribute is asubset of the domain of its data type Attributesand constants are uniformly referred to as terms

A schema is a finite list of attributes Each entitythat is characterized by one or more schemata willbe called structured entity Moreover we assumethe existence of a special family of schemata allunder the general name of NULL schemadetermined to act as placeholders for data whichare not to be stored permanently in some datastore We refer to a family instead of a singleNULL schema due to a subtle technicalityinvolving the number of attributes of such aschema (this will become clear in the sequel)

Recordsets We define a record as the instantia-tion of a schema to a list of values belonging tothe domains of the respective schema attributesWe can treat any data structure as a re-cordset provided that there are ways to logically

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525498

restructure it into a flat typed record schemaFormally a recordset is characterized by its nameits (logical) schema and its (physical) extension(ie a finite set of records under the recordsetschema) If we consider a schema S frac14

[A1yAk] for a certain recordset its extensionis a mapping S frac14 [A1yAk]-dom(A1)y

dom(Ak) Thus the extension of the recordsetis a finite subset of dom(A1)ydom(Ak) anda record is the instance of a mapping dom(A1)ydom(Ak)-[x1yxk] xiAdom(Ai)In the rest of this paper we will mainly deal withthe two most popular types of recordsets namelyrelational tables and record files A database is afinite set of relational tables

Functions We assume the existence of acountable set of built-in system function types Afunction type comprises a name a finite list ofparameter data types and a single return data typeA function is an instance of a function typeConsequently it is characterized by a name a listof input parameters and a parameter for its returnvalue The data types of the parameters of thegenerating function type also define (a) the datatypes of the parameters of the function and (b) thelegal candidates for the function parameters (ieattributes or constants of a suitable data type)

23 Activities

Activities are the backbone of the structure ofany information system We adopt the WfMCterminology [9] for processesprograms and we willcall them activities in the sequel An activity is anamount of lsquolsquowork which is processed by acombination of resource and computer applica-tionsrsquorsquo [9] In our framework activities are logicalabstractions representing parts or full modules ofcodeThe execution of an activity is performed from a

particular program Normally ETL activities willbe either performed in a black-box manner by adedicated tool or they will be expressed in somelanguage (eg PLSQL Perl C) Still we want todeal with the general case of ETL activities Weemploy an abstraction of the source code of anactivity in the form of an LDL statement Using

LDL we avoid dealing with the peculiarities of aparticular programming language Once again wewant to stress that the presented LDL descriptionis intended to capture the semantics of eachactivity instead of the way these activities areactually implementedAn elementary activity is formally described by

the following elements

Name A unique identifier for the activity

Input schemata A finite set of one or more inputschemata that receives data from the dataproviders of the activity

Output schema A schema that describes theplaceholder for the rows that pass the checkperformed by the elementary activity

Rejections schema A schema that describes theplaceholder for the rows that do not pass thecheck performed by the activity or their valuesare not appropriate for the performed transfor-mation

Parameter list A set of pairs which act asregulators for the functionality of the activity(the target attribute of a foreign key check forexample) The first component of the pair is aname and the second is a schema an attribute afunction or a constant

Output operational semantics An LDL state-ment describing the content passed to theoutput of the operation with respect to itsinput This LDL statement defines (a) theoperation performed on the rows that passthrough the activity and (b) an implicit mappingbetween the attributes of the input schema(ta)and the respective attributes of the outputschema

Rejection operational semantics An LDL state-ment describing the rejected records in a sensesimilar to the output operational semanticsThis statement is by default considered to be thecomplement of the output operational seman-tics except if explicitly defined differently

There are two issues that we would like toelaborate on here

NULL schemata Whenever we do not specifya data consumer for the output or rejec-tion schemata the respective NULL schema

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 499

(involving the correct number of attributes) isimplied This practically means that the datatargeted for this schema will neither be stored tosome persistent data store nor will they bepropagated to another activity but they willsimply be ignored

Language issues Initially we used to specify thesemantics of activities with SQL statementsStill although clear and easy to write andunderstand SQL is rather hard to use if one isto perform rewriting and composition of state-ments Thus we have supplemented SQL withLDL [10] a logic programming declarativelanguage as the basis of our scenario definitionLDL is a Datalog variant based on a Horn-clause logic that supports recursion complexobjects and negation In the context of itsimplementation in an actual deductive databasemanagement system LDL++ [11] the lan-guage has been extended to support externalfunctions choice aggregation (and even user-defined aggregation) updates and several otherfeatures

24 Relationships in the architecture graph

In this subsection we will elaborate on thedifferent kinds of relationships that the entities ofan ETL scenario have Whereas these entities aremodeled as the nodes of the architecture graphrelationships are modeled as its edges Due to theirdiversity before proceeding we list these types ofrelationships along with the related terminologythat we will use in this paper The graphical

Date

DSPS1

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT INSK1

Fig 4 Instance-of relationships

notation of entities (nodes) and relationships(edges) is presented in Fig 2

Part-of relationships These relationships in-volve attributes and parameters and relate themto the respective activity recordset or functionto which they belongInstance-of relationships These relationships aredefined among a datafunction type and itsinstancesProvider relationships These are relationshipsthat involve attributes with a providerndashconsu-mer relationshipRegulator relationships These relationships aredefined among the parameters of activities andthe terms that populate these activitiesDerived provider relationships A special case ofprovider relationships that occurs wheneveroutput attributes are computed through thecomposition of input attributes and parametersDerived provider relationships can be deducedfrom a simple rule and do not originallyconstitute a part of the graph

In the rest of this subsection we will detail thenotions pertaining to the relationships of theArchitecture Graph the knowledgeable reader isreferred to Section 25 where we discuss the issueof scenarios We will base our discussions on apart of the scenario of the motivating example(presented in Section 21) including activity SK1

Data types and instance-of relationships Tocapture typing information on attributes and

SKEY

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT IN DWPARTS

UPP

Integer

of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525500

functions the architecture graph comprises dataand function types Instantiation relationships aredepicted as dotted arrows that stem from theinstances and head toward the datafunction typesIn Fig 4 we observe the attributes of the twoactivities of our example and their correspondenceto two data types namely integer and dateFor reasons of presentation we merge severalinstantiation edges so that the figure does notbecome too crowded

Attributes and part-of relationships The firstthing to incorporate in the architecture graph isthe structured entities (activities and recordsets)along with all the attributes of their schemata Wechoose to avoid overloading the notation byincorporating the schemata per se instead weapply a direct part-of relationship between anactivity node and the respective attributes Weannotate each such relationship with the name ofthe schema (by default we assume a IN OUTPAR REJ tag to denote whether the attributebelongs to the input output parameter or rejec-

DSPS1OUT

OUT

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

PKEY

PKEY

LSKEY

LPKEY

SKEY

SOURCE

SOURCE LSOURCLOOKUP

INSK1

P

Fig 5 Part-of regulator and provider rela

tion schema of the activity respectively) Natu-rally if the activity involves more than one inputschemata the relationship is tagged with an INitag for the ith input schema We also incorporatethe functions along with their respective para-meters and the part-of relationships among theformer and the latter We annotate the part-ofrelationship with the return type with a directededge to distinguish it from the rest of theparametersFig 5 depicts a part of the motivating example

In terms of part-of relationships we present thedecomposition of (a) the recordsets DSPS1LOOKUP DWPARTSUPP and (b) the activity SK1and the attributes of its input and outputschemata Note the tagging of the schemata ofthe involved activity We do not consider therejection schemata in order to avoid crowding thepicture Also note how the parameters of theactivity are also incorporated in the architecturegraph Activity SK1 has five parameters (a) PKEYwhich stands for the production key to bereplaced (b) SOURCE which stands for an integer

OUT

PKEY

SKEY

QTY

COST

DATE

SOURCE

E

PKEY

QTY

COST

DATE

SOURCE

IN

AR

DWPARTS

UPP

tionships of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 501

value that characterizes which sourcersquos data areprocessed (c) LPKEY which stands for theattribute of the lookup table which contains theproduction keys (d) LSOURCE which stands forthe attribute of the lookup table which containsthe source value (corresponding to the aforemen-tioned SOURCE parameter) (e) LSKEY whichstands for the attribute of the lookup table whichcontains the surrogate keys

Parameters and regulator relationships Once thepart-of and instantiation relationships have beenestablished it is time to establish the regulatorrelationships of the scenario In this case we linkthe parameters of the activities to the terms(attributes or constants) that populate them Wedepict regulator relationships with simple dottededgesIn the example of Fig 5 we can also observe

how the parameters of activity SK1 are populatedthrough regulator relationships The parametersin and out are mapped to the respective termsthrough regulator relationships All the para-meters of SK1 namely PKEY SOURCE LPKEYLSOURCE and LSKEY are mapped to the respec-tive attributes of either the activityrsquos input schemaor the employed lookup table LOOKUP Theparameter LSKEY deserves particular attentionThis parameter is (a) populated from the attributeSKEY of the lookup table and (b) used to populatethe attribute SKEY of the output schema of theactivity Thus two regulator relationships arerelated with parameter LSKEY one for each ofthe aforementioned attributes The existence of aregulator relationship among a parameter and anoutput attribute of an activity normally denotesthat some external data provider is employed inorder to derive a new attribute through therespective parameter

Provider relationships The flow of data from thedata sources towards the data warehouse isperformed through the composition of activitiesin a larger scenario In this context the input foran activity can be either a persistent data store oranother activity Usually this applies for theoutput of an activity too We capture the passingof data from providers to consumers by a provider

relationship among the attributes of the involvedschemataFormally a provider relationship is defined by

the following elements

Name A unique identifier for the providerrelationship

Mapping An ordered pair The first part of thepair is a term (ie an attribute or constant)acting as a provider and the second part is anattribute acting as the consumer

The mapping need not necessarily be 11 fromprovider to consumer attributes since an inputattribute can be mapped to more than oneconsumer attributes Still the opposite does nothold Note that a consumer attribute can also bepopulated by a constant in certain casesIn order to achieve the flow of data from the

providers of an activity towards its consumers weneed the following three groups of providerrelationships

1

A mapping between the input schemata of theactivity and the output schema of their dataproviders In other words for each attribute ofan input schema of an activity there must existan attribute of the data provider or a constantwhich is mapped to the former attribute

2

Amapping between the attributes of the activityinput schemata and the activity output (orrejection respectively) schema

3

A mapping between the output or rejectionschema of the activity and the (input) schema ofits data consumer

The mappings of the second type are internal tothe activity Basically they can be derived from theLDL statement for each of the outputrejectionschemata As far as the first and the third types ofprovider relationships are concerned the map-pings must be provided during the construction ofthe ETL scenario This means that they are either(a) by default assumed by the order of theattributes of the involved schemata or (b) hard-coded by the user Provider relationships aredepicted with bold solid arrows that stem fromthe provider and end in the consumer attribute

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525502

Observe Fig 5 The flow starts from tableDSPS1 of the data staging area Each of theattributes of this table is mapped to an attribute ofthe input schema of activity SK1 The attributes ofthe input schema of the latter are subsequentlymapped to the attributes of the output schema ofthe activity The flow continues to DWPARTSUPPAnother interesting thing is that during the dataflow new attributes are generated resulting on newstreams of data whereas the flow seems to stop forother attributes Observe the rightmost part ofFig 5 where the values of attribute PKEY are notfurther propagated (remember that the reason forthe application of a surrogate key transformation isto replace the production keys of the source data toa homogeneous surrogate for the records of thedata warehouse which is independent of the sourcethey have been collected from) Instead of thevalues of the production key the values from theattribute SKEY will be used to denote the uniqueidentifier for a part in the rest of the flowIn Fig 6 we depict the LDL definition of this

part of the motivating example The three rulescorrespond to the three categories of provider

addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYds_ps1(A_OUT_PKEYA_OUT_DATEA_OUT_QTYA_OUTA_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTY addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYlookup(A_IN1_SOURCEA_IN1_PKEYA_OUT_SKEY)A_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

dw_partsupp(PKEYDATEQTYCOSTSOURCE) addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTYDATE=A_IN1_DATE

QTY=A_IN1_QTYCOST=A_IN1_COSTSOURCE=A_IN1_SOURCEPKEY=A_IN1_SKEY

NOTE For reasonsof readability we do not rethe activity name ieA_OUT_PKEYshould be

Fig 6 LDL specification of t

relationships previously discussed the first ruleexplains how the data from the DSPS1 recordsetare fed into the input schema of the activity thesecond rule explains the semantics of activity (iehow the surrogate key is generated) and finallythe third rule shows how the DWPARTSUPPrecordset is populated from the output schema ofthe activity SK1

Derived provider relationships As we havealready mentioned there are certain outputattributes that are computed through the composi-tion of input attributes and parameters A derived

provider relationship is another form of providerrelationship that captures the flow from the inputto the respective output attributesFormally assume that (a) source is a term in

the architecture graph (b) target is an attributeof the output schema of an activity A and (c) xyare parameters in the parameter list of A (notnecessary different) Then a derived providerrelationship pr(source target) exists iff thefollowing regulator relationships (ie edges) existrr1(source x) and rr2(y target)

A_IN1_COSTA_IN1_SOURCE)_COSTA_OUT_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)A_IN1_COSTA_IN1_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)

place the Ain attribute names with diffPS1_OUT_PKEY

he motivating example

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 7: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525498

restructure it into a flat typed record schemaFormally a recordset is characterized by its nameits (logical) schema and its (physical) extension(ie a finite set of records under the recordsetschema) If we consider a schema S frac14

[A1yAk] for a certain recordset its extensionis a mapping S frac14 [A1yAk]-dom(A1)y

dom(Ak) Thus the extension of the recordsetis a finite subset of dom(A1)ydom(Ak) anda record is the instance of a mapping dom(A1)ydom(Ak)-[x1yxk] xiAdom(Ai)In the rest of this paper we will mainly deal withthe two most popular types of recordsets namelyrelational tables and record files A database is afinite set of relational tables

Functions We assume the existence of acountable set of built-in system function types Afunction type comprises a name a finite list ofparameter data types and a single return data typeA function is an instance of a function typeConsequently it is characterized by a name a listof input parameters and a parameter for its returnvalue The data types of the parameters of thegenerating function type also define (a) the datatypes of the parameters of the function and (b) thelegal candidates for the function parameters (ieattributes or constants of a suitable data type)

23 Activities

Activities are the backbone of the structure ofany information system We adopt the WfMCterminology [9] for processesprograms and we willcall them activities in the sequel An activity is anamount of lsquolsquowork which is processed by acombination of resource and computer applica-tionsrsquorsquo [9] In our framework activities are logicalabstractions representing parts or full modules ofcodeThe execution of an activity is performed from a

particular program Normally ETL activities willbe either performed in a black-box manner by adedicated tool or they will be expressed in somelanguage (eg PLSQL Perl C) Still we want todeal with the general case of ETL activities Weemploy an abstraction of the source code of anactivity in the form of an LDL statement Using

LDL we avoid dealing with the peculiarities of aparticular programming language Once again wewant to stress that the presented LDL descriptionis intended to capture the semantics of eachactivity instead of the way these activities areactually implementedAn elementary activity is formally described by

the following elements

Name A unique identifier for the activity

Input schemata A finite set of one or more inputschemata that receives data from the dataproviders of the activity

Output schema A schema that describes theplaceholder for the rows that pass the checkperformed by the elementary activity

Rejections schema A schema that describes theplaceholder for the rows that do not pass thecheck performed by the activity or their valuesare not appropriate for the performed transfor-mation

Parameter list A set of pairs which act asregulators for the functionality of the activity(the target attribute of a foreign key check forexample) The first component of the pair is aname and the second is a schema an attribute afunction or a constant

Output operational semantics An LDL state-ment describing the content passed to theoutput of the operation with respect to itsinput This LDL statement defines (a) theoperation performed on the rows that passthrough the activity and (b) an implicit mappingbetween the attributes of the input schema(ta)and the respective attributes of the outputschema

Rejection operational semantics An LDL state-ment describing the rejected records in a sensesimilar to the output operational semanticsThis statement is by default considered to be thecomplement of the output operational seman-tics except if explicitly defined differently

There are two issues that we would like toelaborate on here

NULL schemata Whenever we do not specifya data consumer for the output or rejec-tion schemata the respective NULL schema

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 499

(involving the correct number of attributes) isimplied This practically means that the datatargeted for this schema will neither be stored tosome persistent data store nor will they bepropagated to another activity but they willsimply be ignored

Language issues Initially we used to specify thesemantics of activities with SQL statementsStill although clear and easy to write andunderstand SQL is rather hard to use if one isto perform rewriting and composition of state-ments Thus we have supplemented SQL withLDL [10] a logic programming declarativelanguage as the basis of our scenario definitionLDL is a Datalog variant based on a Horn-clause logic that supports recursion complexobjects and negation In the context of itsimplementation in an actual deductive databasemanagement system LDL++ [11] the lan-guage has been extended to support externalfunctions choice aggregation (and even user-defined aggregation) updates and several otherfeatures

24 Relationships in the architecture graph

In this subsection we will elaborate on thedifferent kinds of relationships that the entities ofan ETL scenario have Whereas these entities aremodeled as the nodes of the architecture graphrelationships are modeled as its edges Due to theirdiversity before proceeding we list these types ofrelationships along with the related terminologythat we will use in this paper The graphical

Date

DSPS1

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT INSK1

Fig 4 Instance-of relationships

notation of entities (nodes) and relationships(edges) is presented in Fig 2

Part-of relationships These relationships in-volve attributes and parameters and relate themto the respective activity recordset or functionto which they belongInstance-of relationships These relationships aredefined among a datafunction type and itsinstancesProvider relationships These are relationshipsthat involve attributes with a providerndashconsu-mer relationshipRegulator relationships These relationships aredefined among the parameters of activities andthe terms that populate these activitiesDerived provider relationships A special case ofprovider relationships that occurs wheneveroutput attributes are computed through thecomposition of input attributes and parametersDerived provider relationships can be deducedfrom a simple rule and do not originallyconstitute a part of the graph

In the rest of this subsection we will detail thenotions pertaining to the relationships of theArchitecture Graph the knowledgeable reader isreferred to Section 25 where we discuss the issueof scenarios We will base our discussions on apart of the scenario of the motivating example(presented in Section 21) including activity SK1

Data types and instance-of relationships Tocapture typing information on attributes and

SKEY

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT IN DWPARTS

UPP

Integer

of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525500

functions the architecture graph comprises dataand function types Instantiation relationships aredepicted as dotted arrows that stem from theinstances and head toward the datafunction typesIn Fig 4 we observe the attributes of the twoactivities of our example and their correspondenceto two data types namely integer and dateFor reasons of presentation we merge severalinstantiation edges so that the figure does notbecome too crowded

Attributes and part-of relationships The firstthing to incorporate in the architecture graph isthe structured entities (activities and recordsets)along with all the attributes of their schemata Wechoose to avoid overloading the notation byincorporating the schemata per se instead weapply a direct part-of relationship between anactivity node and the respective attributes Weannotate each such relationship with the name ofthe schema (by default we assume a IN OUTPAR REJ tag to denote whether the attributebelongs to the input output parameter or rejec-

DSPS1OUT

OUT

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

PKEY

PKEY

LSKEY

LPKEY

SKEY

SOURCE

SOURCE LSOURCLOOKUP

INSK1

P

Fig 5 Part-of regulator and provider rela

tion schema of the activity respectively) Natu-rally if the activity involves more than one inputschemata the relationship is tagged with an INitag for the ith input schema We also incorporatethe functions along with their respective para-meters and the part-of relationships among theformer and the latter We annotate the part-ofrelationship with the return type with a directededge to distinguish it from the rest of theparametersFig 5 depicts a part of the motivating example

In terms of part-of relationships we present thedecomposition of (a) the recordsets DSPS1LOOKUP DWPARTSUPP and (b) the activity SK1and the attributes of its input and outputschemata Note the tagging of the schemata ofthe involved activity We do not consider therejection schemata in order to avoid crowding thepicture Also note how the parameters of theactivity are also incorporated in the architecturegraph Activity SK1 has five parameters (a) PKEYwhich stands for the production key to bereplaced (b) SOURCE which stands for an integer

OUT

PKEY

SKEY

QTY

COST

DATE

SOURCE

E

PKEY

QTY

COST

DATE

SOURCE

IN

AR

DWPARTS

UPP

tionships of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 501

value that characterizes which sourcersquos data areprocessed (c) LPKEY which stands for theattribute of the lookup table which contains theproduction keys (d) LSOURCE which stands forthe attribute of the lookup table which containsthe source value (corresponding to the aforemen-tioned SOURCE parameter) (e) LSKEY whichstands for the attribute of the lookup table whichcontains the surrogate keys

Parameters and regulator relationships Once thepart-of and instantiation relationships have beenestablished it is time to establish the regulatorrelationships of the scenario In this case we linkthe parameters of the activities to the terms(attributes or constants) that populate them Wedepict regulator relationships with simple dottededgesIn the example of Fig 5 we can also observe

how the parameters of activity SK1 are populatedthrough regulator relationships The parametersin and out are mapped to the respective termsthrough regulator relationships All the para-meters of SK1 namely PKEY SOURCE LPKEYLSOURCE and LSKEY are mapped to the respec-tive attributes of either the activityrsquos input schemaor the employed lookup table LOOKUP Theparameter LSKEY deserves particular attentionThis parameter is (a) populated from the attributeSKEY of the lookup table and (b) used to populatethe attribute SKEY of the output schema of theactivity Thus two regulator relationships arerelated with parameter LSKEY one for each ofthe aforementioned attributes The existence of aregulator relationship among a parameter and anoutput attribute of an activity normally denotesthat some external data provider is employed inorder to derive a new attribute through therespective parameter

Provider relationships The flow of data from thedata sources towards the data warehouse isperformed through the composition of activitiesin a larger scenario In this context the input foran activity can be either a persistent data store oranother activity Usually this applies for theoutput of an activity too We capture the passingof data from providers to consumers by a provider

relationship among the attributes of the involvedschemataFormally a provider relationship is defined by

the following elements

Name A unique identifier for the providerrelationship

Mapping An ordered pair The first part of thepair is a term (ie an attribute or constant)acting as a provider and the second part is anattribute acting as the consumer

The mapping need not necessarily be 11 fromprovider to consumer attributes since an inputattribute can be mapped to more than oneconsumer attributes Still the opposite does nothold Note that a consumer attribute can also bepopulated by a constant in certain casesIn order to achieve the flow of data from the

providers of an activity towards its consumers weneed the following three groups of providerrelationships

1

A mapping between the input schemata of theactivity and the output schema of their dataproviders In other words for each attribute ofan input schema of an activity there must existan attribute of the data provider or a constantwhich is mapped to the former attribute

2

Amapping between the attributes of the activityinput schemata and the activity output (orrejection respectively) schema

3

A mapping between the output or rejectionschema of the activity and the (input) schema ofits data consumer

The mappings of the second type are internal tothe activity Basically they can be derived from theLDL statement for each of the outputrejectionschemata As far as the first and the third types ofprovider relationships are concerned the map-pings must be provided during the construction ofthe ETL scenario This means that they are either(a) by default assumed by the order of theattributes of the involved schemata or (b) hard-coded by the user Provider relationships aredepicted with bold solid arrows that stem fromthe provider and end in the consumer attribute

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525502

Observe Fig 5 The flow starts from tableDSPS1 of the data staging area Each of theattributes of this table is mapped to an attribute ofthe input schema of activity SK1 The attributes ofthe input schema of the latter are subsequentlymapped to the attributes of the output schema ofthe activity The flow continues to DWPARTSUPPAnother interesting thing is that during the dataflow new attributes are generated resulting on newstreams of data whereas the flow seems to stop forother attributes Observe the rightmost part ofFig 5 where the values of attribute PKEY are notfurther propagated (remember that the reason forthe application of a surrogate key transformation isto replace the production keys of the source data toa homogeneous surrogate for the records of thedata warehouse which is independent of the sourcethey have been collected from) Instead of thevalues of the production key the values from theattribute SKEY will be used to denote the uniqueidentifier for a part in the rest of the flowIn Fig 6 we depict the LDL definition of this

part of the motivating example The three rulescorrespond to the three categories of provider

addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYds_ps1(A_OUT_PKEYA_OUT_DATEA_OUT_QTYA_OUTA_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTY addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYlookup(A_IN1_SOURCEA_IN1_PKEYA_OUT_SKEY)A_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

dw_partsupp(PKEYDATEQTYCOSTSOURCE) addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTYDATE=A_IN1_DATE

QTY=A_IN1_QTYCOST=A_IN1_COSTSOURCE=A_IN1_SOURCEPKEY=A_IN1_SKEY

NOTE For reasonsof readability we do not rethe activity name ieA_OUT_PKEYshould be

Fig 6 LDL specification of t

relationships previously discussed the first ruleexplains how the data from the DSPS1 recordsetare fed into the input schema of the activity thesecond rule explains the semantics of activity (iehow the surrogate key is generated) and finallythe third rule shows how the DWPARTSUPPrecordset is populated from the output schema ofthe activity SK1

Derived provider relationships As we havealready mentioned there are certain outputattributes that are computed through the composi-tion of input attributes and parameters A derived

provider relationship is another form of providerrelationship that captures the flow from the inputto the respective output attributesFormally assume that (a) source is a term in

the architecture graph (b) target is an attributeof the output schema of an activity A and (c) xyare parameters in the parameter list of A (notnecessary different) Then a derived providerrelationship pr(source target) exists iff thefollowing regulator relationships (ie edges) existrr1(source x) and rr2(y target)

A_IN1_COSTA_IN1_SOURCE)_COSTA_OUT_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)A_IN1_COSTA_IN1_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)

place the Ain attribute names with diffPS1_OUT_PKEY

he motivating example

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 8: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 499

(involving the correct number of attributes) isimplied This practically means that the datatargeted for this schema will neither be stored tosome persistent data store nor will they bepropagated to another activity but they willsimply be ignored

Language issues Initially we used to specify thesemantics of activities with SQL statementsStill although clear and easy to write andunderstand SQL is rather hard to use if one isto perform rewriting and composition of state-ments Thus we have supplemented SQL withLDL [10] a logic programming declarativelanguage as the basis of our scenario definitionLDL is a Datalog variant based on a Horn-clause logic that supports recursion complexobjects and negation In the context of itsimplementation in an actual deductive databasemanagement system LDL++ [11] the lan-guage has been extended to support externalfunctions choice aggregation (and even user-defined aggregation) updates and several otherfeatures

24 Relationships in the architecture graph

In this subsection we will elaborate on thedifferent kinds of relationships that the entities ofan ETL scenario have Whereas these entities aremodeled as the nodes of the architecture graphrelationships are modeled as its edges Due to theirdiversity before proceeding we list these types ofrelationships along with the related terminologythat we will use in this paper The graphical

Date

DSPS1

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT INSK1

Fig 4 Instance-of relationships

notation of entities (nodes) and relationships(edges) is presented in Fig 2

Part-of relationships These relationships in-volve attributes and parameters and relate themto the respective activity recordset or functionto which they belongInstance-of relationships These relationships aredefined among a datafunction type and itsinstancesProvider relationships These are relationshipsthat involve attributes with a providerndashconsu-mer relationshipRegulator relationships These relationships aredefined among the parameters of activities andthe terms that populate these activitiesDerived provider relationships A special case ofprovider relationships that occurs wheneveroutput attributes are computed through thecomposition of input attributes and parametersDerived provider relationships can be deducedfrom a simple rule and do not originallyconstitute a part of the graph

In the rest of this subsection we will detail thenotions pertaining to the relationships of theArchitecture Graph the knowledgeable reader isreferred to Section 25 where we discuss the issueof scenarios We will base our discussions on apart of the scenario of the motivating example(presented in Section 21) including activity SK1

Data types and instance-of relationships Tocapture typing information on attributes and

SKEY

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

OUT IN DWPARTS

UPP

Integer

of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525500

functions the architecture graph comprises dataand function types Instantiation relationships aredepicted as dotted arrows that stem from theinstances and head toward the datafunction typesIn Fig 4 we observe the attributes of the twoactivities of our example and their correspondenceto two data types namely integer and dateFor reasons of presentation we merge severalinstantiation edges so that the figure does notbecome too crowded

Attributes and part-of relationships The firstthing to incorporate in the architecture graph isthe structured entities (activities and recordsets)along with all the attributes of their schemata Wechoose to avoid overloading the notation byincorporating the schemata per se instead weapply a direct part-of relationship between anactivity node and the respective attributes Weannotate each such relationship with the name ofthe schema (by default we assume a IN OUTPAR REJ tag to denote whether the attributebelongs to the input output parameter or rejec-

DSPS1OUT

OUT

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

PKEY

PKEY

LSKEY

LPKEY

SKEY

SOURCE

SOURCE LSOURCLOOKUP

INSK1

P

Fig 5 Part-of regulator and provider rela

tion schema of the activity respectively) Natu-rally if the activity involves more than one inputschemata the relationship is tagged with an INitag for the ith input schema We also incorporatethe functions along with their respective para-meters and the part-of relationships among theformer and the latter We annotate the part-ofrelationship with the return type with a directededge to distinguish it from the rest of theparametersFig 5 depicts a part of the motivating example

In terms of part-of relationships we present thedecomposition of (a) the recordsets DSPS1LOOKUP DWPARTSUPP and (b) the activity SK1and the attributes of its input and outputschemata Note the tagging of the schemata ofthe involved activity We do not consider therejection schemata in order to avoid crowding thepicture Also note how the parameters of theactivity are also incorporated in the architecturegraph Activity SK1 has five parameters (a) PKEYwhich stands for the production key to bereplaced (b) SOURCE which stands for an integer

OUT

PKEY

SKEY

QTY

COST

DATE

SOURCE

E

PKEY

QTY

COST

DATE

SOURCE

IN

AR

DWPARTS

UPP

tionships of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 501

value that characterizes which sourcersquos data areprocessed (c) LPKEY which stands for theattribute of the lookup table which contains theproduction keys (d) LSOURCE which stands forthe attribute of the lookup table which containsthe source value (corresponding to the aforemen-tioned SOURCE parameter) (e) LSKEY whichstands for the attribute of the lookup table whichcontains the surrogate keys

Parameters and regulator relationships Once thepart-of and instantiation relationships have beenestablished it is time to establish the regulatorrelationships of the scenario In this case we linkthe parameters of the activities to the terms(attributes or constants) that populate them Wedepict regulator relationships with simple dottededgesIn the example of Fig 5 we can also observe

how the parameters of activity SK1 are populatedthrough regulator relationships The parametersin and out are mapped to the respective termsthrough regulator relationships All the para-meters of SK1 namely PKEY SOURCE LPKEYLSOURCE and LSKEY are mapped to the respec-tive attributes of either the activityrsquos input schemaor the employed lookup table LOOKUP Theparameter LSKEY deserves particular attentionThis parameter is (a) populated from the attributeSKEY of the lookup table and (b) used to populatethe attribute SKEY of the output schema of theactivity Thus two regulator relationships arerelated with parameter LSKEY one for each ofthe aforementioned attributes The existence of aregulator relationship among a parameter and anoutput attribute of an activity normally denotesthat some external data provider is employed inorder to derive a new attribute through therespective parameter

Provider relationships The flow of data from thedata sources towards the data warehouse isperformed through the composition of activitiesin a larger scenario In this context the input foran activity can be either a persistent data store oranother activity Usually this applies for theoutput of an activity too We capture the passingof data from providers to consumers by a provider

relationship among the attributes of the involvedschemataFormally a provider relationship is defined by

the following elements

Name A unique identifier for the providerrelationship

Mapping An ordered pair The first part of thepair is a term (ie an attribute or constant)acting as a provider and the second part is anattribute acting as the consumer

The mapping need not necessarily be 11 fromprovider to consumer attributes since an inputattribute can be mapped to more than oneconsumer attributes Still the opposite does nothold Note that a consumer attribute can also bepopulated by a constant in certain casesIn order to achieve the flow of data from the

providers of an activity towards its consumers weneed the following three groups of providerrelationships

1

A mapping between the input schemata of theactivity and the output schema of their dataproviders In other words for each attribute ofan input schema of an activity there must existan attribute of the data provider or a constantwhich is mapped to the former attribute

2

Amapping between the attributes of the activityinput schemata and the activity output (orrejection respectively) schema

3

A mapping between the output or rejectionschema of the activity and the (input) schema ofits data consumer

The mappings of the second type are internal tothe activity Basically they can be derived from theLDL statement for each of the outputrejectionschemata As far as the first and the third types ofprovider relationships are concerned the map-pings must be provided during the construction ofthe ETL scenario This means that they are either(a) by default assumed by the order of theattributes of the involved schemata or (b) hard-coded by the user Provider relationships aredepicted with bold solid arrows that stem fromthe provider and end in the consumer attribute

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525502

Observe Fig 5 The flow starts from tableDSPS1 of the data staging area Each of theattributes of this table is mapped to an attribute ofthe input schema of activity SK1 The attributes ofthe input schema of the latter are subsequentlymapped to the attributes of the output schema ofthe activity The flow continues to DWPARTSUPPAnother interesting thing is that during the dataflow new attributes are generated resulting on newstreams of data whereas the flow seems to stop forother attributes Observe the rightmost part ofFig 5 where the values of attribute PKEY are notfurther propagated (remember that the reason forthe application of a surrogate key transformation isto replace the production keys of the source data toa homogeneous surrogate for the records of thedata warehouse which is independent of the sourcethey have been collected from) Instead of thevalues of the production key the values from theattribute SKEY will be used to denote the uniqueidentifier for a part in the rest of the flowIn Fig 6 we depict the LDL definition of this

part of the motivating example The three rulescorrespond to the three categories of provider

addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYds_ps1(A_OUT_PKEYA_OUT_DATEA_OUT_QTYA_OUTA_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTY addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYlookup(A_IN1_SOURCEA_IN1_PKEYA_OUT_SKEY)A_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

dw_partsupp(PKEYDATEQTYCOSTSOURCE) addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTYDATE=A_IN1_DATE

QTY=A_IN1_QTYCOST=A_IN1_COSTSOURCE=A_IN1_SOURCEPKEY=A_IN1_SKEY

NOTE For reasonsof readability we do not rethe activity name ieA_OUT_PKEYshould be

Fig 6 LDL specification of t

relationships previously discussed the first ruleexplains how the data from the DSPS1 recordsetare fed into the input schema of the activity thesecond rule explains the semantics of activity (iehow the surrogate key is generated) and finallythe third rule shows how the DWPARTSUPPrecordset is populated from the output schema ofthe activity SK1

Derived provider relationships As we havealready mentioned there are certain outputattributes that are computed through the composi-tion of input attributes and parameters A derived

provider relationship is another form of providerrelationship that captures the flow from the inputto the respective output attributesFormally assume that (a) source is a term in

the architecture graph (b) target is an attributeof the output schema of an activity A and (c) xyare parameters in the parameter list of A (notnecessary different) Then a derived providerrelationship pr(source target) exists iff thefollowing regulator relationships (ie edges) existrr1(source x) and rr2(y target)

A_IN1_COSTA_IN1_SOURCE)_COSTA_OUT_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)A_IN1_COSTA_IN1_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)

place the Ain attribute names with diffPS1_OUT_PKEY

he motivating example

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 9: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525500

functions the architecture graph comprises dataand function types Instantiation relationships aredepicted as dotted arrows that stem from theinstances and head toward the datafunction typesIn Fig 4 we observe the attributes of the twoactivities of our example and their correspondenceto two data types namely integer and dateFor reasons of presentation we merge severalinstantiation edges so that the figure does notbecome too crowded

Attributes and part-of relationships The firstthing to incorporate in the architecture graph isthe structured entities (activities and recordsets)along with all the attributes of their schemata Wechoose to avoid overloading the notation byincorporating the schemata per se instead weapply a direct part-of relationship between anactivity node and the respective attributes Weannotate each such relationship with the name ofthe schema (by default we assume a IN OUTPAR REJ tag to denote whether the attributebelongs to the input output parameter or rejec-

DSPS1OUT

OUT

PKEY PKEY

QTY QTY

COST COST

DATE DATE

SOURCE SOURCE

PKEY

PKEY

LSKEY

LPKEY

SKEY

SOURCE

SOURCE LSOURCLOOKUP

INSK1

P

Fig 5 Part-of regulator and provider rela

tion schema of the activity respectively) Natu-rally if the activity involves more than one inputschemata the relationship is tagged with an INitag for the ith input schema We also incorporatethe functions along with their respective para-meters and the part-of relationships among theformer and the latter We annotate the part-ofrelationship with the return type with a directededge to distinguish it from the rest of theparametersFig 5 depicts a part of the motivating example

In terms of part-of relationships we present thedecomposition of (a) the recordsets DSPS1LOOKUP DWPARTSUPP and (b) the activity SK1and the attributes of its input and outputschemata Note the tagging of the schemata ofthe involved activity We do not consider therejection schemata in order to avoid crowding thepicture Also note how the parameters of theactivity are also incorporated in the architecturegraph Activity SK1 has five parameters (a) PKEYwhich stands for the production key to bereplaced (b) SOURCE which stands for an integer

OUT

PKEY

SKEY

QTY

COST

DATE

SOURCE

E

PKEY

QTY

COST

DATE

SOURCE

IN

AR

DWPARTS

UPP

tionships of the architecture graph

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 501

value that characterizes which sourcersquos data areprocessed (c) LPKEY which stands for theattribute of the lookup table which contains theproduction keys (d) LSOURCE which stands forthe attribute of the lookup table which containsthe source value (corresponding to the aforemen-tioned SOURCE parameter) (e) LSKEY whichstands for the attribute of the lookup table whichcontains the surrogate keys

Parameters and regulator relationships Once thepart-of and instantiation relationships have beenestablished it is time to establish the regulatorrelationships of the scenario In this case we linkthe parameters of the activities to the terms(attributes or constants) that populate them Wedepict regulator relationships with simple dottededgesIn the example of Fig 5 we can also observe

how the parameters of activity SK1 are populatedthrough regulator relationships The parametersin and out are mapped to the respective termsthrough regulator relationships All the para-meters of SK1 namely PKEY SOURCE LPKEYLSOURCE and LSKEY are mapped to the respec-tive attributes of either the activityrsquos input schemaor the employed lookup table LOOKUP Theparameter LSKEY deserves particular attentionThis parameter is (a) populated from the attributeSKEY of the lookup table and (b) used to populatethe attribute SKEY of the output schema of theactivity Thus two regulator relationships arerelated with parameter LSKEY one for each ofthe aforementioned attributes The existence of aregulator relationship among a parameter and anoutput attribute of an activity normally denotesthat some external data provider is employed inorder to derive a new attribute through therespective parameter

Provider relationships The flow of data from thedata sources towards the data warehouse isperformed through the composition of activitiesin a larger scenario In this context the input foran activity can be either a persistent data store oranother activity Usually this applies for theoutput of an activity too We capture the passingof data from providers to consumers by a provider

relationship among the attributes of the involvedschemataFormally a provider relationship is defined by

the following elements

Name A unique identifier for the providerrelationship

Mapping An ordered pair The first part of thepair is a term (ie an attribute or constant)acting as a provider and the second part is anattribute acting as the consumer

The mapping need not necessarily be 11 fromprovider to consumer attributes since an inputattribute can be mapped to more than oneconsumer attributes Still the opposite does nothold Note that a consumer attribute can also bepopulated by a constant in certain casesIn order to achieve the flow of data from the

providers of an activity towards its consumers weneed the following three groups of providerrelationships

1

A mapping between the input schemata of theactivity and the output schema of their dataproviders In other words for each attribute ofan input schema of an activity there must existan attribute of the data provider or a constantwhich is mapped to the former attribute

2

Amapping between the attributes of the activityinput schemata and the activity output (orrejection respectively) schema

3

A mapping between the output or rejectionschema of the activity and the (input) schema ofits data consumer

The mappings of the second type are internal tothe activity Basically they can be derived from theLDL statement for each of the outputrejectionschemata As far as the first and the third types ofprovider relationships are concerned the map-pings must be provided during the construction ofthe ETL scenario This means that they are either(a) by default assumed by the order of theattributes of the involved schemata or (b) hard-coded by the user Provider relationships aredepicted with bold solid arrows that stem fromthe provider and end in the consumer attribute

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525502

Observe Fig 5 The flow starts from tableDSPS1 of the data staging area Each of theattributes of this table is mapped to an attribute ofthe input schema of activity SK1 The attributes ofthe input schema of the latter are subsequentlymapped to the attributes of the output schema ofthe activity The flow continues to DWPARTSUPPAnother interesting thing is that during the dataflow new attributes are generated resulting on newstreams of data whereas the flow seems to stop forother attributes Observe the rightmost part ofFig 5 where the values of attribute PKEY are notfurther propagated (remember that the reason forthe application of a surrogate key transformation isto replace the production keys of the source data toa homogeneous surrogate for the records of thedata warehouse which is independent of the sourcethey have been collected from) Instead of thevalues of the production key the values from theattribute SKEY will be used to denote the uniqueidentifier for a part in the rest of the flowIn Fig 6 we depict the LDL definition of this

part of the motivating example The three rulescorrespond to the three categories of provider

addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYds_ps1(A_OUT_PKEYA_OUT_DATEA_OUT_QTYA_OUTA_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTY addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYlookup(A_IN1_SOURCEA_IN1_PKEYA_OUT_SKEY)A_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

dw_partsupp(PKEYDATEQTYCOSTSOURCE) addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTYDATE=A_IN1_DATE

QTY=A_IN1_QTYCOST=A_IN1_COSTSOURCE=A_IN1_SOURCEPKEY=A_IN1_SKEY

NOTE For reasonsof readability we do not rethe activity name ieA_OUT_PKEYshould be

Fig 6 LDL specification of t

relationships previously discussed the first ruleexplains how the data from the DSPS1 recordsetare fed into the input schema of the activity thesecond rule explains the semantics of activity (iehow the surrogate key is generated) and finallythe third rule shows how the DWPARTSUPPrecordset is populated from the output schema ofthe activity SK1

Derived provider relationships As we havealready mentioned there are certain outputattributes that are computed through the composi-tion of input attributes and parameters A derived

provider relationship is another form of providerrelationship that captures the flow from the inputto the respective output attributesFormally assume that (a) source is a term in

the architecture graph (b) target is an attributeof the output schema of an activity A and (c) xyare parameters in the parameter list of A (notnecessary different) Then a derived providerrelationship pr(source target) exists iff thefollowing regulator relationships (ie edges) existrr1(source x) and rr2(y target)

A_IN1_COSTA_IN1_SOURCE)_COSTA_OUT_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)A_IN1_COSTA_IN1_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)

place the Ain attribute names with diffPS1_OUT_PKEY

he motivating example

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 10: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 501

value that characterizes which sourcersquos data areprocessed (c) LPKEY which stands for theattribute of the lookup table which contains theproduction keys (d) LSOURCE which stands forthe attribute of the lookup table which containsthe source value (corresponding to the aforemen-tioned SOURCE parameter) (e) LSKEY whichstands for the attribute of the lookup table whichcontains the surrogate keys

Parameters and regulator relationships Once thepart-of and instantiation relationships have beenestablished it is time to establish the regulatorrelationships of the scenario In this case we linkthe parameters of the activities to the terms(attributes or constants) that populate them Wedepict regulator relationships with simple dottededgesIn the example of Fig 5 we can also observe

how the parameters of activity SK1 are populatedthrough regulator relationships The parametersin and out are mapped to the respective termsthrough regulator relationships All the para-meters of SK1 namely PKEY SOURCE LPKEYLSOURCE and LSKEY are mapped to the respec-tive attributes of either the activityrsquos input schemaor the employed lookup table LOOKUP Theparameter LSKEY deserves particular attentionThis parameter is (a) populated from the attributeSKEY of the lookup table and (b) used to populatethe attribute SKEY of the output schema of theactivity Thus two regulator relationships arerelated with parameter LSKEY one for each ofthe aforementioned attributes The existence of aregulator relationship among a parameter and anoutput attribute of an activity normally denotesthat some external data provider is employed inorder to derive a new attribute through therespective parameter

Provider relationships The flow of data from thedata sources towards the data warehouse isperformed through the composition of activitiesin a larger scenario In this context the input foran activity can be either a persistent data store oranother activity Usually this applies for theoutput of an activity too We capture the passingof data from providers to consumers by a provider

relationship among the attributes of the involvedschemataFormally a provider relationship is defined by

the following elements

Name A unique identifier for the providerrelationship

Mapping An ordered pair The first part of thepair is a term (ie an attribute or constant)acting as a provider and the second part is anattribute acting as the consumer

The mapping need not necessarily be 11 fromprovider to consumer attributes since an inputattribute can be mapped to more than oneconsumer attributes Still the opposite does nothold Note that a consumer attribute can also bepopulated by a constant in certain casesIn order to achieve the flow of data from the

providers of an activity towards its consumers weneed the following three groups of providerrelationships

1

A mapping between the input schemata of theactivity and the output schema of their dataproviders In other words for each attribute ofan input schema of an activity there must existan attribute of the data provider or a constantwhich is mapped to the former attribute

2

Amapping between the attributes of the activityinput schemata and the activity output (orrejection respectively) schema

3

A mapping between the output or rejectionschema of the activity and the (input) schema ofits data consumer

The mappings of the second type are internal tothe activity Basically they can be derived from theLDL statement for each of the outputrejectionschemata As far as the first and the third types ofprovider relationships are concerned the map-pings must be provided during the construction ofthe ETL scenario This means that they are either(a) by default assumed by the order of theattributes of the involved schemata or (b) hard-coded by the user Provider relationships aredepicted with bold solid arrows that stem fromthe provider and end in the consumer attribute

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525502

Observe Fig 5 The flow starts from tableDSPS1 of the data staging area Each of theattributes of this table is mapped to an attribute ofthe input schema of activity SK1 The attributes ofthe input schema of the latter are subsequentlymapped to the attributes of the output schema ofthe activity The flow continues to DWPARTSUPPAnother interesting thing is that during the dataflow new attributes are generated resulting on newstreams of data whereas the flow seems to stop forother attributes Observe the rightmost part ofFig 5 where the values of attribute PKEY are notfurther propagated (remember that the reason forthe application of a surrogate key transformation isto replace the production keys of the source data toa homogeneous surrogate for the records of thedata warehouse which is independent of the sourcethey have been collected from) Instead of thevalues of the production key the values from theattribute SKEY will be used to denote the uniqueidentifier for a part in the rest of the flowIn Fig 6 we depict the LDL definition of this

part of the motivating example The three rulescorrespond to the three categories of provider

addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYds_ps1(A_OUT_PKEYA_OUT_DATEA_OUT_QTYA_OUTA_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTY addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYlookup(A_IN1_SOURCEA_IN1_PKEYA_OUT_SKEY)A_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

dw_partsupp(PKEYDATEQTYCOSTSOURCE) addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTYDATE=A_IN1_DATE

QTY=A_IN1_QTYCOST=A_IN1_COSTSOURCE=A_IN1_SOURCEPKEY=A_IN1_SKEY

NOTE For reasonsof readability we do not rethe activity name ieA_OUT_PKEYshould be

Fig 6 LDL specification of t

relationships previously discussed the first ruleexplains how the data from the DSPS1 recordsetare fed into the input schema of the activity thesecond rule explains the semantics of activity (iehow the surrogate key is generated) and finallythe third rule shows how the DWPARTSUPPrecordset is populated from the output schema ofthe activity SK1

Derived provider relationships As we havealready mentioned there are certain outputattributes that are computed through the composi-tion of input attributes and parameters A derived

provider relationship is another form of providerrelationship that captures the flow from the inputto the respective output attributesFormally assume that (a) source is a term in

the architecture graph (b) target is an attributeof the output schema of an activity A and (c) xyare parameters in the parameter list of A (notnecessary different) Then a derived providerrelationship pr(source target) exists iff thefollowing regulator relationships (ie edges) existrr1(source x) and rr2(y target)

A_IN1_COSTA_IN1_SOURCE)_COSTA_OUT_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)A_IN1_COSTA_IN1_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)

place the Ain attribute names with diffPS1_OUT_PKEY

he motivating example

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 11: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525502

Observe Fig 5 The flow starts from tableDSPS1 of the data staging area Each of theattributes of this table is mapped to an attribute ofthe input schema of activity SK1 The attributes ofthe input schema of the latter are subsequentlymapped to the attributes of the output schema ofthe activity The flow continues to DWPARTSUPPAnother interesting thing is that during the dataflow new attributes are generated resulting on newstreams of data whereas the flow seems to stop forother attributes Observe the rightmost part ofFig 5 where the values of attribute PKEY are notfurther propagated (remember that the reason forthe application of a surrogate key transformation isto replace the production keys of the source data toa homogeneous surrogate for the records of thedata warehouse which is independent of the sourcethey have been collected from) Instead of thevalues of the production key the values from theattribute SKEY will be used to denote the uniqueidentifier for a part in the rest of the flowIn Fig 6 we depict the LDL definition of this

part of the motivating example The three rulescorrespond to the three categories of provider

addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYds_ps1(A_OUT_PKEYA_OUT_DATEA_OUT_QTYA_OUTA_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTY addSkey_in1(A_IN1_PKEYA_IN1_DATEA_IN1_QTYlookup(A_IN1_SOURCEA_IN1_PKEYA_OUT_SKEY)A_OUT_PKEY=A_IN1_PKEYA_OUT_DATE=A_IN1_DATEA_OUT_QTY=A_IN1_QTYA_OUT_COST=A_IN1_COSTA_OUT_SOURCE=A_IN1_SOURCE

dw_partsupp(PKEYDATEQTYCOSTSOURCE) addSkey_out(A_OUT_PKEYA_OUT_DATEA_OUT_QTYDATE=A_IN1_DATE

QTY=A_IN1_QTYCOST=A_IN1_COSTSOURCE=A_IN1_SOURCEPKEY=A_IN1_SKEY

NOTE For reasonsof readability we do not rethe activity name ieA_OUT_PKEYshould be

Fig 6 LDL specification of t

relationships previously discussed the first ruleexplains how the data from the DSPS1 recordsetare fed into the input schema of the activity thesecond rule explains the semantics of activity (iehow the surrogate key is generated) and finallythe third rule shows how the DWPARTSUPPrecordset is populated from the output schema ofthe activity SK1

Derived provider relationships As we havealready mentioned there are certain outputattributes that are computed through the composi-tion of input attributes and parameters A derived

provider relationship is another form of providerrelationship that captures the flow from the inputto the respective output attributesFormally assume that (a) source is a term in

the architecture graph (b) target is an attributeof the output schema of an activity A and (c) xyare parameters in the parameter list of A (notnecessary different) Then a derived providerrelationship pr(source target) exists iff thefollowing regulator relationships (ie edges) existrr1(source x) and rr2(y target)

A_IN1_COSTA_IN1_SOURCE)_COSTA_OUT_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)A_IN1_COSTA_IN1_SOURCE)

A_OUT_COSTA_OUT_SOURCEA_OUT_SKEY)

place the Ain attribute names with diffPS1_OUT_PKEY

he motivating example

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 12: Etl design document

ARTICLE IN PRESS

IN OUTSK1

PAR

IN OUTSK1

PAR

PKEY PKEY

PKEY

SOURCE

PKEY

SOURCE

SOURCE

SOURCE

SKEY

PKEY

SOURCE

PKEY

SOURCE

SKEY

SKEY

SKEY

LPKEY

LSOURCE

LSKEY

LOOKUPOUT

LOOKUPOUT

Fig 7 Derived provider relationships of the architecture graph the original situation on the left and the derived provider relationships

on the right

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 503

Intuitively the case of derived relationshipsmodels the situation where the activity computesa new attribute in its output In this case theproduced output depends on all the attributes thatpopulate the parameters of the activity resultingin the definition of the corresponding derivedrelationshipObserve Fig 7 where we depict a small part of

our running example The left side of the figuredepicts the situation where only provider relation-ships exist The legend in the right side of Fig 7depicts how we compute the derived providerrelationships between the parameters of theactivity and the computed output attribute SKEYThe meaning of these five relationships is thatSK1OUTSKEY is not computed only fromattribute LOOKUPSKEY but from the combina-tion of all the attributes that populate theparametersOne can also assume different variations of

derived provider relationships such as (a) relation-

ships that do not involve constants (remember thatwe have defined source as a term) (b) relation-ships involving only attributes of the samedifferent activity (as a measure of internal com-plexity or external dependencies) (c) relationshipsrelating attributes that populate only the sameparameter (eg only the attributes LOOKUPSKEYand SK1OUTSKEY)

25 Scenarios

A scenario is an enumeration of activities alongwith their sourcetarget recordsets and the respec-tive provider relationships for each activity AnETL scenario consists of the following elements

Name A unique identifier for the scenario

Activities A finite list of activities Note that byemploying a list (instead of eg a set) ofactivities we impose a total ordering on theexecution of the scenario

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 13: Etl design document

ARTICLE IN PRESS

Entity Model-specific Scenario-specific

Data Types DI DFunction Types FI F

Bui

lt-i

nConstants CI CAttributes ΩI

Functions ΦIΩΦ

Schemata SI SRecordSets RSI RSActivities AI AProvider Relationships PrI PrPart-Of Relationships PoI PoInstance-Of Relationships IoI IoRegulator Relationships RrI Rr

Use

r-pr

ovid

ed

Derived Provider Relationships DrI Dr

Fig 8 Formal definition of domains and notation

P Vassiliadis et al Information Systems 30 (2005) 492ndash525504

Recordsets A finite set of recordsets

Targets A special-purpose subset of the record-sets of the scenario which includes the finaldestinations of the overall process (ie the datawarehouse tables that must be populated by theactivities of the scenario)

Provider relationships A finite list of providerrelationships among activities and recordsets ofthe scenario

In our modeling a scenario is a set of activitiesdeployed along a graph in an execution sequencethat can be linearly serialized For the moment wedo not consider the different alternatives for theordering of the execution we simply require that atotal order for this execution is present (ie eachactivity has a discrete execution priority)In terms of formal modeling of the architecture

graph we assume the infinitely countable mu-tually disjoint sets of names (ie the values ofwhich respect the unique name assumption) ofcolumn model-specific in Fig 8 As far as a specificscenario is concerned we assume their respectivefinite subsets depicted in column scenario-specific

in Fig 8 Data types function types and constantsare considered built-inrsquos of the system whereas therest of the entities are provided by the user (user

provided)Formally the architecture graph of an ETL

scenario is a graph G(VE) defined as follows

V frac14 D[F[C[X[[S[RS[AE frac14 Pr[Po[Io[Rr[Dr

In the sequel we treat the terms architecturegraph and scenario interchangeably The reason-ing for the term lsquoarchitecture graphrsquo goes all theway down to the fundamentals of conceptualmodeling As mentioned in [12] conceptualmodels are the means by which designers conceivearchitect design and build software systemsThese conceptual models are used in the sameway that blueprints are used in other engineeringdisciplines during the early stages of the lifecycle ofartificial systems which involves the creation oftheir architecture The term lsquoarchitecture graphrsquoexpresses the fact that the graph that we employfor the modeling of the data flow of the ETLscenario is practically acting as a blueprint of thearchitecture of this software artifactMoreover we assume the following integrity

constraints for a scenario

Static constraints

All the weak entities of a scenario (ieattributes or parameters) should be definedwithin a part-of relationship (ie they shouldhave a container object)

All the mappings in provider relationshipsshould be defined among terms (ie attributesor constants) of the same data type

Data flow constraints

All the attributes of the input schema(ta) of anactivity should have a provider

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 14: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 505

Resulting from the previous requirement ifsome attribute is a parameter in an activity Athe container of the attribute (ie recordset oractivity) should precede A in the scenario

All the attributes of the schemata of the targetrecordsets should have a data provider

Summarizing in this section we have presenteda generic model for the modeling of the data flowfor ETL workflows In the next section we willproceed to detail how this generic model can beaccompanied by a customization mechanism inorder to provide higher flexibility to the designerof the workflow

3 Templates for ETL activities

In this section we present the mechanism forexploiting template definitions of frequently usedETL activities The general framework for theexploitation of these templates is accompaniedwith the presentation of the language-relatedissues for template management and appropriateexamples

Datatypes

Elementary Activity RecotdSe

Metamodel Layer

Template Layer

Schema Layer

NotNull

Domain Mismatch

SK Assignment

Source T

S1PARTSUPF NN DM1

Fig 9 The metamodel for the logical

31 General framework

Our philosophy during the construction of ourmetamodel was based on two pillars (a) genericityie the derivation of a simple model powerful tocapture ideally all the cases of ETL activities and(b) extensibility ie the possibility of extendingthe built-in functionality of the system with newuser-specific templatesThe genericity doctrine was pursued through the

definition of a rather simple activity metamodel asdescribed in Section 2 Still providing a singlemetaclass for all the possible activities of an ETLenvironment is not really enough for the designerof the overall process A richer lsquolsquolanguagersquorsquo shouldbe available in order to describe the structure ofthe process and facilitate its construction To thisend we provide a palette of template activitieswhich are specializations of the generic metamodelclassObserve Fig 9 for a further explanation of our

framework The lower layer of Fig 9 namelyschema layer involves a specific ETL scenarioAll the entities of the schema layer are instances ofthe classes Data Type Function Type

Functions

t Relationships

able

Fact Table

Provider Re

IsA

InstanceOf

SK1 DWPARTSUPP

entities of the ETL environment

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 15: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525506

Elementary Activity RecordSet andRelationship Thus as one can see on theupper part of Fig 9 we introduce a meta-classlayer namely metamodel layer involving theaforementioned classes The linkage between themetamodel and the schema layers is achievedthrough instantiation (InstanceOf) relation-ships The metamodel layer implements the afore-mentioned genericity desideratum the classeswhich are involved in the metamodel layer aregeneric enough to model any ETL scenariothrough the appropriate instantiationStill we can do better than the simple provision

of a metalayer and an instance layer In order tomake our metamodel truly useful for practi-cal cases of ETL activities we enrich it with a setof ETL-specific constructs which constitute asubset of the larger metamodel layer namelythe template layer The constructs in the templatelayer are also meta-classes but they arequite customized for the regular cases of ETLactivities Thus the classes of the template layerare specializations (ie subclasses) of the genericclasses of the metamodel layer (depicted asIsA relationships in Fig 9) Through this custo-mization mechanism the designer can pick theinstances of the schema layer from a muchricher palette of constructs in this setting theentities of the schema layer are instantiations notonly of the respective classes of the metamodellayer but also of their subclasses in the templatelayer

Filters - Selection (σ)- Not null (NN)- Primary key

violation (PK)

- Foreign keyviolation (FK)

- Unique value (UN)

- Domain mismatch (DM)

Unary operations- Push

- Aggregation (γ)- Projection (Π)- Function application - Surrogate key assignm

- Tuple normalization (- Tuple denormalization

File operations- EBCDIC to ASCII conve

(EB2AS)- Sort file (Sort)

Fig 10 Template activities along with their graph

In the example of Fig 9 the concept DWPARTSUPP must be populated from a certainsource S1PARTSUPP Several operations mustintervene during the propagation For instance inFig 9 we check for null values and domainviolations and we assign a surrogate key As onecan observe the recordsets that take part in thisscenario are instances of class RecordSet (be-longing to the metamodel layer) and specifically ofits subclasses Source Table and Fact TableInstances and encompassing classes are relatedthrough links of type InstanceOf The samemechanism applies to all the activities ofthe scenario which are (a) instances of classElementary Activity and (b) instances ofone of its subclasses depicted in Fig 9 Relation-ships do not escape this rule either For instanceobserve how the provider links from the conceptS1PS toward the concept DWPARTSUPP arerelated to class Provider Relationshipthrough the appropriate InstanceOf linksAs far as the class Recordset is concerned in

the template layer we can specialize it to severalsubclasses based on orthogonal characteristicssuch as whether it is a file or RDBMS table orwhether it is a source or target data store (as inFig 9) In the case of the class Relationshipthere is a clear specialization in terms of the fiveclasses of relationships which have alreadybeen mentioned in Section 2 (ie ProviderPart-Of Instance-Of Regulator andDerived Provider)

(f)ent (SK)

N)(DN)

Binary operations - Union (U)

- Join (- Diff (∆)- Update Detection (∆UPD)

rsionTransfer operations - Ftp (FTP)- Compress Decompress (ZdZ)- Encrypt Decrypt (CrdCr)

)∆

ical notation symbols grouped by category

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 16: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 507

Following the same framework class Elemen-tary Activity is further specialized to anextensible set of reoccurring patterns of ETLactivities depicted in Fig 10 As one can see onthe top side of Fig 9 we group the templateactivities in five major logical groups We do notdepict the grouping of activities in subclasses inFig 9 in order to avoid overloading the figureinstead we depict the specialization of classElementary Activity to three of its subclasseswhose instances appear in the employed scenarioof the schema layer We now proceed to presenteach of the aforementioned groups in more detailThe first group named filters provides checks

for the satisfaction (or not) of a certain conditionThe semantics of these filters are the obvious(starting from a generic selection conditionand proceeding to the check for null valuesprimary or foreign key violation etc)The second group of template activities is calledunary operations and except for the most genericpush activity (which simply propagates data fromthe provider to the consumer) consists of theclassical aggregation and function appli-cation operations along with three data ware-house specific transformations (surrogate keyassignment normalization and denorma-lization) The third group consists of classicalbinary operations such as union join anddifference of recordsetsactivities as well aswith a special case of difference involving thedetection of updates Except for the afore-mentioned template activities which mainly referto logical transformations we can also considerthe case of physical operators that refer to theapplication of physical transformations to wholefilestables In the ETL context we are mainlyinterested in operations like transfer operations

(ftp compressdecompress encryptdecrypt) and file operations (EBCDIC to AS-CII sort file)Summarizing the metamodel layer is a set of

generic entities able to represent any ETLscenario At the same time the genericity of themetamodel layer is complemented with the exten-sibility of the template layer which is a set oflsquolsquobuilt-inrsquorsquo specializations of the entities of themetamodel layer specifically tailored for the most

frequent elements of ETL scenarios Moreoverapart from this lsquolsquobuilt-inrsquorsquo ETL-specific extensionof the generic metamodel if the designer decidesthat several lsquopatternsrsquo not included in the paletteof the template layer occur repeatedly in his datawarehousing projects he can easily fit them intothe customizable template layer through a specia-lization mechanism

32 Formal definition and usage of template

activities

Once the template layer has been introducedthe obvious issue that is raised is its linkage withthe employed declarative language of our frame-work In general the broader issue is the usage ofthe template mechanism from the user to this endwe will explain the substitution mechanism fortemplates in this subsection and refer the interestedreader to [13] for a presentation of the specifictemplates that we have constructedA template activity is formally defined by the

following elements

Name A unique identifier for the templateactivity

Parameter list A set of names which act asregulators in the expression of the semantics ofthe template activity For example the para-meters are used to assign values to constantscreate dynamic mapping at instantiation timeetc

Expression A declarative statement describingthe operation performed by the instances of thetemplate activity As with elementary activitiesour model supports LDL as the formalism forthe expression of this statement

Mapping A set of bindings mapping input tooutput attributes possibly through intermediateplaceholders In general mappings at thetemplate level try to capture a default way ofpropagating incoming values from the inputtowards the output schema These defaultbindings are easily refined and possibly rear-ranged at instantiation time

The template mechanism we use is a substitutionmechanism based on macros that facilitates the

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 17: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525508

automatic creation of LDL code This simplenotation and instantiation mechanism permits theeasy and fast registration of LDL templates In therest of this section we will elaborate on thenotation instantiation mechanisms and templatetaxonomy particularities

321 Notation

Our template notation is a simple languagefeaturing five main mechanisms for dynamicproduction of LDL expressions (a) variables thatare replaced by their values at instantiationtime (b) a function that returns the arity of aninput output or parameter schema (c) loopswhere the loop body is repeated at instantiationtime as many times as the iterator constraintdefines (d) keywords to simplify the creationof unique predicate and attribute names andfinally (e) macros which are used as syntacticsugar to simplify the way we handle complexexpressions (especially in the case of variable sizeschemata)

Variables We have two kinds of variables in thetemplate mechanism parameter variables and loop

iterators Parameter variables are marked with a symbol at their beginning and they are replaced byuser-defined values at instantiation time A list ofan arbitrary length of parameters is denoted byparameter nameS[ ] For such lists theuser has to explicitly or implicitly provide theirlength at instantiation time Loop iterators on theother hand are implicitly defined in the loopconstraint During each loop iteration all theproperly marked appearances of the iterator in theloop body are replaced by its current value(similarly to the way the C preprocessor treatsDEFINE statements) Iterators that appearmarked in loop body are instantiated even whenthey are a part of another string or of a variablename We mark such appearances by enclosingthem with $ This functionality enables referencingall the values of a parameter list and facilitates thecreation of an arbitrary number of pre-formattedstrings

Functions We employ a built-in function ari-tyOf(inputoutputparameter schemaS)

which returns the arity of the respective schemamainly in order to define upper bounds in loopiterators

Loops Loops are a powerful mechanism thatenhances the genericity of the templates byallowing the designer to handle templates withunknown number of variables and with unknownarity for the inputoutput schemata The generalform of loops is

frac12hsimple constraintifhloop bodyig

where simple constraint has the form

hlower boundi hcomparison operatori hiteratori

hcomparison operatori hupper boundi

We consider only linear increase with step equalto 1 since this covers most possible cases Upperbound and lower bound can be arithmeticexpressions involving arityOf() function callsvariables and constants Valid arithmetic opera-tors are + and valid comparison operatorsare o 4 frac14 all with their usual semantics Iflower bound is omitted 1 is assumed During eachiteration the loop body will be reproduced and atthe same time all the marked appearances of theloop iterator will be replaced by its current valueas described before Loop nesting is permitted

Keywords Keywords are used in order to referto input and output schemata They provide twomain functionalities (a) they simplify the referenceto the input outputschema by using standardnames for the predicates and their attributes and(b) they allow their renaming at instantiation timeThis is done in such a way that no differentpredicates with the same name will appear in thesame program and no different attributes with thesame name will appear in the same rule Keywordsare recognized even if they are parts of anotherstring without a special notation This facilitates ahomogenous renaming of multiple distinct inputschemata at template level to multiple distinctschemata at instantiation with all of them havingunique names in the LDL program scope Forexample if the template is expressed in terms oftwo different input schemata a_in1 and a_in2at instantiation time they will be renamed to

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 18: Etl design document

ARTICLE IN PRESS

Keyword Usage Example

a_out

a_in

A unique name for the outputinput schemaof the activity The predicate that isproduced when this template is instantiatedhas the form

ltunique_pred_namegt_out (or _in respectively)

difference3_out

difference3_in

A_OUT

A_IN

A_OUTA_IN is used for constructing the namesof the a_outa_in attributes The names produced have the form

ltpredicate unique name in upper casegt_OUT

(or _IN respectively)

DIFFERENCE3_OUT

DIFFERENCE3_IN

Fig 11 Keywords for templates

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 509

dm1_in1 and dm1_in2 so that the producednames will be unique throughout the scenarioprogram In Fig 11 we depict the way therenaming is performed at instantiation time

Macros To make the definition of templateseasier and to improve their readability weintroduce a macro to facilitate attribute andvariable name expansion For example one ofthe major problems in defining a language fortemplates is the difficulty of dealing with schemataof arbitrary arity Clearly at the template level itis not possible to pin-down the number ofattributes of the involved schemata to a specificvalue For example in order to create a series ofnames like the following

name_theme_1name_theme_2yname_theme_k

we need to give the following expression

[iteratoromaxLimit]name_theme$iterator$

[iterator frac14 maxLimit]name_theme$iterator$

Obviously this results in making the writing oftemplates hard and reduces their readability Toattack this problem we resort to a simple reusablemacro mechanism that enables the simplificationof employed expressions For example observe the

definition of a template for a simple relationalselection

a_out([ioarityOf(a_out)]A_OUT_$i$

[i frac14 arityOf(a_out)]A_OUT_$i$) o-a_in1([ioarityOf(a_in1)]

A_IN1_$i$ [i frac14 arityOf(a_in1)]

A_IN1_$i$)expr([ioarityOf(PARAM)]

PARAM[$i$][i frac14 arityOf(PARAM)]

PARAM[$i$])[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

As already mentioned at the syntax for loops theexpression

[ioarityOf(a_out)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

defining the attributes of the output schemaa_out simply wants to list a variable number ofattributes that will be fixed at instantiation timeExactly the same tactics apply for the attributes ofthe predicate names a_in1 and expr Also thefinal two lines state that each attribute of theoutput will be equal to the respective attribute ofthe input (so that the query is safe) egA_OUT_4 frac14 A_IN1_4 We can simplify thedefinition of the template by allowing the designer

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 19: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525510

to define certain macros that simplify the manage-ment of temporary length attribute lists Weemploy the following macros

DEFINE INPUT_SCHEMA AS[ioarityOf(a_in1)]A_IN1_$i$[i frac14 arityOf(a_in1)] A_IN1_$i$

DEFINE OUTPUT_SCHEMA AS[ioarityOf(a_in)]A_OUT_$i$[i frac14 arityOf(a_out)]A_OUT_$i$

DEFINE PARAM_SCHEMA AS[ioarityOf(PARAM)]PARAM[$i$][i frac14 arityOf(PARAM)]PARAM[$i$]

DEFINE DEFAULT_MAPPING AS[ioarityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$[i frac14 arityOf(a_out)]

A_OUT_$i$ frac14 A_IN1_$i$

Then the template definition is as follows

a_out(OUTPUT_SCHEMA) o-a_in1(INPUT_SCHEMA)expr(PARAM_SCHEMA)DEFAULT_MAPPING

322 Instantiation

Template instantiation is the process where theuser chooses a certain template and creates aconcrete activity out of it This procedure requiresthat the user specifies the schemata of the activityand gives concrete values to the template para-meters Then the process of producing therespective LDL description of the activity is easilyautomated Instantiation order is important in ourtemplate creation mechanism since as it can easilybeen seen from the notation definitions differentorders can lead to different results The instantia-tion order is as follows

1

Replacement of macro definitions with theirexpansions

2

arityOf() functions and parameter variablesappearing in loop boundaries are calculatedfirst

3

Loop productions are performed by instantiat-ing the appearances of the iterators This leadsto intermediate results without any loops

4

All the rest parameter variables are instantiated

5

Keywords are recognized and renamed

We will try to explain briefly the intuitionbehind this execution order Macros are expandedfirst Step (2) proceeds step (3) because loopboundaries have to be calculated before loopproductions are performed Loops on the otherhand have to be expanded before parametervariables are instantiated if we want to be ableto reference lists of variables The only exceptionto this is the parameter variables that appear in theloop boundaries which have to be calculated firstNotice though that variable list elements cannotappear in the loop constraint Finally we have toinstantiate variables before keywords since vari-ables are used to create a dynamic mappingbetween the inputoutput schemata and otherattributesFig 12 shows a simple example of template

instantiation for the function application activityTo understand the overall process better firstobserve the outcome of it ie the specific activitywhich is produced as depicted in the final row ofFig 12 labeled keyword renaming The outputschema of the activity fa12_out is the head ofthe LDL rule that specifies the activity The bodyof the rule says that the output records arespecified by the conjunction of the followingclauses (a) the input schema myFunc_in (b)the application of function subtract over theattributes COST_IN PRICE_IN and the produc-tion of a value PROFIT and (c) the mapping ofthe input to the respective output attributes asspecified in the last three conjuncts of the ruleThe first row template shows the initial

template as it has been registered by the designerFUNCTION holds the name of the function to beused subtract in our case and the PARAM[ ]holds the inputs of the function which in our caseare the two attributes of the input schema Theproblem we have to face is that all input outputand function schemata have a variable number ofparameters To abstract from the complexity ofthis problem we define four macro definitions onefor each schema (INPUT_SCHEMA OUTPUT_SCHEMA FUNCTION_INPUT) along with a macrofor the mapping of input to output attributes

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 20: Etl design document

ARTICLE IN PRESS

Fig 12 Instantiation procedure

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 511

(DEFAULT_MAPPING) The second row macro

expansion shows how the template looks after themacros have been incorporated in the templatedefinition The mechanics of the expansion arestraightforward observe how the attributes of theoutput schema are specified by the expression[ioarityOf(a_in)+1]A_OUT_$i$OUT-FIELD as an expansion of the macro OUTPUT_SCHEMA In a similar fashion the attributes of theinput schema and the parameters of the functionare also specified note that the expression for thelast attribute in the list is different (to avoidrepeating an erroneous comma) The mappingsbetween the input and the output attributes are

also shown in the last two lines of the template Inthe third row parameter instantiation we can seehow the parameter variables were materialized atinstantiation In the fourth row loop productionwe can see the intermediate results after the loopexpansions are done As it can easily be seen theseexpansions must be done before PARAM[]variables are replaced by their values In the fifthrow variable instantiation the parameter variableshave been instantiated creating a default mappingbetween the input the output and the functionattributes Finally in the last row keyword

renaming the output LDL code is presented afterthe keywords are renamed Keyword instantiation

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 21: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525512

is done on the basis of the schemata and therespective attributes of the activity that the userchooses

323 Taxonomy simple and program-based

templates

The most commonly used activities can be easilyexpressed by a single predicate template it isobvious though that it would be very incon-venient to restrict activity templates to singlepredicates Thus we separate template activitiesin two categories simple templates which coversingle-predicate templates and program-based tem-

plates where many predicates are used in thetemplate definitionIn the case of simple templates the output

predicate is bound to the input through a mappingand an expression Each of the rules for obtainingthe output is expressed in terms of the inputschemata and the parameters of the activity In thecase of program templates the output of theactivity is expressed in terms of its intermediatepredicate schemata as well as its input schemataand its parameters Program-based templates areoften used to define activities that employ con-straints like does-not-belong or does-not-existwhich need an intermediate negated predicate tobe expressed intuitively This predicate usuallydescribes the conjunction of properties we want toavoid and then it appears negated in the outputpredicate Thus in general we allow the construc-tion of a LDL program with intermediatepredicates in order to enhance intuition Thisclassification is orthogonal to the logical one ofSection 31

Simple templates Formally the expression of anactivity which is based on a certain simpletemplate is produced by a set of rules of thefollowing form

OUTPUTethTHORNo INPUTethTHORN EXPRESSION MAPPING

where INPUT( ) and OUTPUT( ) denote the fullexpression of the respective schemata in the caseof multiple input schemata INPUT( )expressesthe conjunction of the input schemata MAPPINGdenotes any mapping between the input outputand expression attributes A default mapping canbe explicitly done at the template level by

specifying equalities between attributes wherethe first attribute of the input schema is mappedto the first attribute of the output schema thesecond to the respective second one and so on Atinstantiation time the user can change thesemappings easily especially in the presence of thegraphical interface Note also that despite the factthat LDL allows implicit mappings by givingidentical names to attributes that must be equalour design choice was to give explicit equalities inorder to support the preservation of the names ofthe attributes of the input and output schemata atinstantiation timeTo make ourselves clear we will demonstrate

the usage of simple template activities through anexample Suppose thus the case of the DomainMismatch template activity checking whetherthe values for a certain attribute fall within aparticular range The rows that abide by the rulepass the check performed by the activity and theyare propagated to the outputObserve Fig 13 where we present an example of

the definition of a template activity and itsinstantiation in a concrete activity The first rowin Fig 13 describes the definition of the templateactivity There are three parameters FIELD forthe field that will be checked against the expres-sion Xlow and Xhigh for the lower and upperlimit of acceptable values for attribute FIELDThe expression of the template activity is a simpleexpression guaranteeing that FIELD will bewithin the specified range The second row ofFig 13 shows the template after the macros areexpanded Let us suppose that the activity namedDM1 materializes the templates parameters thatappear in the third row of Fig 13 ie specifies theattribute over which the check will be performed(A_IN_3) and the actual ranges for this check (510) The fourth row of Fig 13 shows the resultinginstantiation after keyword renaming is done Theactivity includes an input schema dm1_in withattributes DM1_IN_1 DM1_IN_2 DM1_IN_3DM1_IN_4 and an output schema dm1_out withattributes DM1_OUT_1 DM1_OUT_2 DM1_OUT_3DM1_OUT_4 In this case the parameter FIELDimplements a dynamic internal mapping in thetemplate whereas the Xlow Xigh parametersprovide values for constants The mapping from

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 22: Etl design document

ARTICLE IN PRESS

Fig 13 Simple template example domain mismatch

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 513

the input to the output is hardcoded in thetemplate

Program-based templates The case of program-

based templates is somewhat more complex sincethe designer who records the template creates morethan one predicate to describe the activity This isusually the case of operations where we want toverify that some data do not have a conjunction ofcertain properties Such constraints employ nega-tion to assert that a tuple does not satisfy apredicate which is defined in a way that it requiresthat the data that satisfy it have the properties wewant to avoid Such negations can be expressed bymore than one rules for the same predicate thateach negates just one property according to thelogical rule (q4p)q3p Thus in generalwe allow the construction of a LDL program withintermediate predicates in order to enhanceintuition For example the does-not-belong rela-

tion which is needed in the Difference activitytemplate needs a second predicate to be expressedintuitivelyLet us see in more detail the case of Differ-

ence During the ETL process one of the veryfirst tasks that we perform is the detection of newlyinserted and possibly updated records Usuallythis is physically performed by the comparison oftwo snapshots (one corresponding to the previousextraction and the other to the current one) Tocapture this process we introduce a variation ofthe classical relational difference operator whichchecks for equality only on a certain subset ofattributes of the input records Assume that duringthe extraction process we want to detect the newlyinserted rows Then if PK is the set of attributesthat uniquely identify rows (in the role of aprimary key) the newly inserted rows can befound from the expression DPKS4(Rnew R) Theformal semantics of the difference operator are

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 23: Etl design document

ARTICLE IN PRESS

Fig 14 Program-based template example Difference activity

P Vassiliadis et al Information Systems 30 (2005) 492ndash525514

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 24: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 515

given by the following calculus-like definitionDA1yAkS(R S)frac14 xAR|(yAS x[A1]frac14 y[A1]4y4x[Ak]frac14 y[Ak]In Fig 14 we can see the template of the

Difference activity and a resulting instantiationfor an activity named dF1 As we can see we needthe semijoin predicate so we can exclude alltuples that satisfy it Note also that we have twodifferent inputs which are denoted as distinct byadding a number at the end of the keyword a_in

4 Implementation

In the context of the aforementioned frame-work we have implemented a graphical designtool ARKTOS II with the goal of facilitating thedesign of ETL scenarios based on our model Inorder to design a scenario the user defines thesource and target data stores the participatingactivities and the flow of the data in the scenarioThese tasks are greatly assisted (a) by a friendlyGUI and (b) by a set of reusability templatesAll the details defining an activity can be

captured through forms andor simple point andclick operations More specifically the user mayexplore the data sources and the activities already

Fig 15 The motivating e

defined in the scenario along with their schemata(input output and parameter) Attributes belong-ing to an output schema of an activity or arecordset can be lsquolsquodragrsquonrsquodroppedrsquorsquo in the inputschema of a subsequent activity or recordset inorder to create the equivalent data flow in thescenario In a similar design manner one can alsoset the parameters of an activity By default theoutput schema of the activity is instantiated as acopy of the input schema Then the user has theability to modify this setting according to hisdemands eg by deleting or renaming the properattributes The rejection schema of an activity isconsidered to be a copy of the input schema of therespective activity and the user may determine itsphysical location eg the physical location of alog file that maintains the rejected rows of thespecified activity Apart from these features theuser can (a) draw the desirable attributes orparameters (b) define their name and data type(c) connect them to their schemata (d) createprovider and regulator relationships betweenthem and (e) draw the proper edges from onenode of the architecture graph to another Thesystem assures the consistency of a scenario byallowing the user to draw only relationshipsrespecting the restrictions imposed from the

xample in ARKTOS II

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 25: Etl design document

ARTICLE IN PRESS

Fig 16 A detailed zoom-in view of the motivaing example

P Vassiliadis et al Information Systems 30 (2005) 492ndash525516

model As far as the provider and instance-ofrelationships are concerned they are calculatedautomatically and their display can be turned onor off from an applicationrsquos menu Moreover thesystem allows the designer to define activitiesthrough a form-based interface instead of definingthem through the point-and-click interface Natu-rally the form automatically provides lists withthe available recordsets their attributes etc Fig15 shows the design canvas of our GUI where ourmotivating example is depicted

ARKTOS II offers zoom-inzoom-out capabilitiesa particularly useful feature in the construction ofthe data flow of the scenario through inter-attribute lsquolsquoproviderrsquorsquo mappings The designer candeal with a scenario in two levels of granularity (a)at the entity or zoom-out level where only theparticipating recordsets and activities are visibleand their provider relationships are abstracted asedges between the respective entities or (b) at theattribute or zoom-in level where the user can seeand manipulate the constituent parts of anactivity along with their respective providers atthe attribute level In Fig 16 we show a part of thescenario of Fig 15 Observe (a) how part-of

relationships are expanded to link attributes totheir corresponding entities (b) how providerrelationships link attributes to each other (c)how regulator relationships populate activityparameters and (d) how instance-of relationshipsrelate attributes with their respective data typesthat are depicted at the lower right part of thefigureIn ARKTOS II the customization principle is

supported by the reusability templates The notionof template is in the heart of ARKTOS II There aretemplates for practically every aspect of the modeldata types functions and activities Templates areextensible thus providing the user with thepossibility of customizing the environment accord-ing to hisher own needs Especially for activitieswhich form the core of our model a specific menuwith a set of frequently used ETL Activities isprovided The system has a built-in mechanismresponsible for the instantiation of the LDLtemplates supported by a graphical form thathelps the user define the variables of the templateby selecting its values among the appropriatescenariorsquos objects Another distinctive feature ofARKTOS II is the computation of the scenariorsquos

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 26: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 517

design quality by employing a set of metrics thatare presented in [6] either for the whole scenarioor for each activity of itThe scenarios are stored in ARKTOS II repository

(implemented in a relational DBMS) the systemallows the user to store retrieve and reuse existingscenarios All the metadata of the system involvingthe scenario configuration the employed templatesand their constituents are stored in the repositoryThe choice of a relational DBMS for our metadatarepository allows its efficient querying as well asthe smooth integration with external systems andor future extensions of ARKTOS II The connectivityto source and target data stores is achievedthrough ODBC connections and the tool offersan automatic reverse engineering of their schema-ta We have implemented ARKTOS II with Oracle817 as basis for our repository and Ms VisualBasic (Release 6) for developing our GUIAn on-going activity is the coupling of ARKTOS II

with state-of-the-art algorithms for individualETL tasks (eg duplicate removal or surrogatekey assignment) and with scheduling and monitor-ing facilities Future plans for ARKTOS II involve theextension of data sources to more sophisticateddata formats outside the relational domain likeobject-oriented or XML data

5 Related work

In this section we will report (a) on relatedcommercial studies and tools in the field of ETL(b) on related efforts in the academia in the issueand (c) applications of workflow technology in thefield of data warehousing

51 Commercial studies and tools

In a recent study [14] the authors report thatdue to the diversity and heterogeneity of datasources ETL is unlikely to become an opencommodity market The ETL market has reacheda size of $667 millions for year 2001 still thegrowth rate has reached a rather low 11 (ascompared with a rate of 60 growth for year2000) This is explained by the overall economicdownturn environment In terms of technological

aspects the main characteristic of the area is theinvolvement of traditional database vendors withETL solutions built in the DBMSs The threemajor database vendors that practically ship ETLsolutions lsquolsquoat no extra chargersquorsquo are pinpointedOracle with Oracle Warehouse Builder [4] Micro-soft with Data Transformation Services [3] andIBM with the Data Warehouse Center [1] Still themajor vendors in the area are InformaticarsquosPowercenter [2] and Ascentialrsquos DataStage suites[1516] (the latter being part of the IBM recom-mendations for ETL solutions) The study goes onto propose future technological challengesfore-casts that involve the integration of ETL with (a)XML adapters (b) enterprise application integra-tion (EAI) tools (eg MQ-Series) (c) customizeddata quality tools and (d) the move towardsparallel processing of the ETL workflowsThe aforementioned discussion is supported

from a second recent study [17] where the authorsnote the decline in license revenue for pure ETLtools mainly due to the crisis of IT spending andthe appearance of ETL solutions from traditionaldatabase and business intelligence vendors TheGartner study discusses the role of the three majordatabase vendors (IBM Microsoft Oracle) andpoints that they slowly start to take a portion ofthe ETL market through their DBMS-built-insolutionsIn the sequel we elaborate more on the major

vendors in the area of the commercial ETL toolsand we discuss three tools that the major databasevendors provide as such two ETL tools that areconsidered as best sellers But we stress the factthat the former three have the benefit of theminimum cost because they are shipped with thedatabase while the latter two have the benefit toaim at complex and deep solutions not envisionedby the generic products

IBM DB2 Universal Database offers the DataWarehouse Center [1] a component that auto-mates data warehouse processing and the DB2Warehouse Manager that extends the capabilitiesof the Data Warehouse Center with additionalagents transforms and metadata capabilitiesData Warehouse Center is used to define theprocesses that move and transform data for thewarehouse Warehouse Manager is used to

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 27: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525518

schedule maintain and monitor these processesWithin the Data Warehouse Center the warehouse

schema modeler is a specialized tool for generatingand storing schema associated with a data ware-house Any schema resulting from this process canbe passed as metadata to an OLAP tool Theprocess modeler allows user to graphically link thesteps needed to build and maintain data ware-houses and dependent data marts DB2 Ware-house Manager includes enhanced ETL functionover and above the base capabilities of DB2 DataWarehouse Center Additionally it provides me-tadata management repository function as suchan integration point for third-party independentsoftware vendors through the information catalog

Microsoft The tool that is offered by Microsoftto implement its proposal for the Open Informa-tion Model is presented under the name of Data

Transformation Services(DTS) [318] DTS are thedata-manipulation utility services in SQL Server(from version 70) that provide import export anddata-manipulating services between OLE DB [19]ODBC and ASCII data stores DTS are char-acterized by a basic object called a package thatstores information on the aforementioned tasksand the order in which they need to be launched Apackage can include one or more connections todifferent data sources and different tasks andtransformations that are executed as steps thatdefine a workflow process [20] The softwaremodules that support DTS are shipped with MSSQL Server These modules include

DTS designer A GUI used to interactivelydesign and execute DTS packages

DTS export and import wizards Wizards thatease the process of defining DTS packages forthe import export and transformation of data

DTS programming interfaces A set of OLEAutomation and a set of COM interfaces tocreate customized transformation applicationsfor any system supporting OLE automation orCOM

Oracle Oracle Warehouse Builder [421] is arepository-based tool for ETL and data ware-housing The basic architecture comprises twocomponents the design environment and the

runtime environment Each of these componentshandles a different aspect of the system the designenvironment handles metadata the runtime en-vironment handles physical data The metadatacomponent revolves around the metadata reposi-tory and the design tool The repository is basedon the Common Warehouse Model (CWM)standard and consists of a set of tables in anOracle database that are accessed via a Java-basedaccess layer The front-end of the tool (entirelywritten in Java) features wizards and graphicaleditors for logging onto the repository The datacomponent revolves around the runtime environ-ment and the warehouse database The WarehouseBuilder runtime is a set of tables sequencespackages and triggers that are installed in thetarget schema The code generator that bases onthe definitions stores in the repository it createsthe code necessary to implement the warehouseWarehouse Builder generates extraction specificlanguages (SQLLoader control files for flat filesABAP for SAPR3 extraction and PLSQL for allother systems) for the ETL processes and SQLDDL statements for the database objects Thegenerated code is deployed either to the file systemor into the database

Ascential software DataStage XE suite fromAscential Software [1516] (formerly InformixBusiness Solutions) is an integrated data ware-house development toolset that includes an ETLtool (DataStage) a data quality tool (QualityManager) and a metadata management tool(MetaStage) The DataStage ETL componentconsists of four design and administration mod-ules Manager Designer Director and Adminis-

trator as such a metadata repository and a serverThe DataStage Manager is the basic metadatamanagement tool In the Designer module ofDataStage ETL tasks execute within individuallsquolsquostagersquorsquo objects (source target and transformationstages) in order to create ETL tasks The Directoris DataStagersquos job validation and schedulingmodule The DataStage Administrator is primarilyfor controlling security functions The DataStageServer is the engine that moves data from source totarget

Informatica Informatica PowerCenter [2] is theindustry-leading (according to recent studies

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 28: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 519

[1417]) data integration platform for buildingdeploying and managing enterprise data ware-houses and other data integration projects Theworkhorse of Informatica PowerCenter is a dataintegration engine that executes all data extrac-tion transformation migration and loading func-tions in-memory without generating code orrequiring developers to hand-code these proce-dures The PowerCenter data integration engine ismetadata driven creating a repository-and-enginepartnership that ensures data integration processesare optimally executed

52 Research efforts

Research focused specifically on ETL The AJAX

system [22] is a data cleaning tool developed atINRIA France It deals with typical data qualityproblems such as the object identity problem [23]errors due to mistyping and data inconsistencies

between matching records This tool can be usedeither for a single source or for integratingmultiple data sources AJAX provides a frame-work wherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions that start from some input source data Fourtypes of data transformations are supported

Mapping transformations standardize data for-mats (eg date format) or simply merge or splitcolumns in order to produce more suitableformatsMatching transformations find pairs of recordsthat most probably refer to same object Thesepairs are called matching pairs and each suchpair is assigned a similarity valueClustering transformations group togethermatching pairs with a high similarity value byapplying a given grouping criteria (eg bytransitive closure)Merging transformations are applied to eachindividual cluster in order to eliminate dupli-cates or produce new records for the resultingintegrated data source

AJAX also provides a declarative language forspecifying data cleaning programs which consistsof SQL statements enriched with a set of specific

primitives to express mapping matching cluster-ing and merging transformations Finally ainteractive environment is supplied to the user inorder to resolve errors and inconsistencies thatcannot be automatically handled and support astepwise refinement design of data cleaningprograms The theoretic foundations of this toolcan be found in [24] where apart from thepresentation of a general framework for the datacleaning process specific optimization techniquestailored for data cleaning applications arediscussedRaman et al [2526] present the Potterrsquos Wheel

system which is targeted to provide interactivedata cleaning to its users The system offers thepossibility of performing several algebraic opera-tions over an underlying data set including format

(application of a function) drop copy add acolumn merge delimited columns split a columnon the basis of a regular expression or a position ina string divide a column on the basis of a predicate(resulting in two columns the first involving therows satisfying the condition of the predicate andthe second involving the rest) selection of rows onthe basis of a condition folding columns (where aset of attributes of a record is split into severalrows) and unfolding Optimization algorithms arealso provided for the CPU usage for certain classesof operators The general idea behind PotterrsquosWheel is that users build data transformations initerative and interactive way In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Usersgradually build transformations to clean the databy adding or undoing transforms on a spread-sheet-like interface the effect of a transform isshown at once on records visible on screen Thesetransforms are specified either through simplegraphical operations or by showing the desiredeffects on example data values In the backgroundPotterrsquos Wheel automatically infers structures fordata values in terms of user-defined domains andaccordingly checks for constraint violations Thususers can gradually build a transformation asdiscrepancies are found and clean the data with-out writing complex programs or enduring longdelays

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 29: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525520

We believe that the AJAX tool is mostlyoriented towards the integration of web data(which is also supported by the ontology of itsalgebraic transformations) at the same timePotterrsquos wheel is mostly oriented towards aninteractive data cleaning tool where the usersinteractively choose data With respect to theseapproaches we believe that our technique con-tributes (a) by offering an extensible frameworkthough a uniform extensibility mechanism and (b)by providing formal foundations to allow thereasoning over the constructed ETL scenariosClearly ARKTOS II is a design tool for traditionaldata warehouse flows therefore we find theaforementioned approaches complementary (espe-cially Potterrsquos Wheel) At the same time whencontrasted with the industrial tools it is evidentthat although ARKTOS II is only a design environ-ment for the moment the industrial tools lack thelogical abstraction that our model implemented inARKTOS II offers on the contrary industrial toolsare concerned directly with the physical perspec-tive (at least to the best of our knowledge)

Data quality and cleaning An extensive reviewof data quality problems and related literaturealong with quality management methodologiescan be found in [27] A collection of articles ondata transformations [28] offers a discussion onvarious aspects of this research area A collectionof articles on data cleaning [29] (including a survey[30]) provides an extensive overview of the fieldalong with research issues and a review of somecommercial tools and solutions on specific pro-blems eg [3132] In a related still differentcontext we would like to mention the IBIS tool[33] IBIS is an integration tool following theglobal-as-view approach to answer queries in amediated system Departing from the traditionaldata integration literature though IBIS brings theissue of data quality in the integration process Thesystem takes advantage of the definition ofconstraints at the intentional level (eg foreignkey constraints) and tries to provide answers thatresolve semantic conflicts (eg the violation of aforeign key constraint) The interesting aspect hereis that consistency is traded for completeness Forexample whenever an offending row is detectedover a foreign key constraint instead of assuming

the violation of consistency the system assumesthe absence of the appropriate lookup value andadjusts its answers to queries accordingly [34]

Workflows To the best of our knowledgeresearch on workflows is focused around thefollowing reoccurring themes (a) modeling[59353637] where the authors are primarilyconcerned in providing a metamodel for work-flows (b) correctness issues [35ndash37] where criteriaare established to determine whether a workflow iswell formed and (c) workflow transformations[35ndash37] where the authors are concerned oncorrectness issues in the evolution of the workflowfrom a certain plan to anotherIn the literature there is a standard proposed by

the workflow management coalition (WfMC) [9]The standard includes a metamodel for thedescription of a workflow process specificationand a textual grammar for the interchange ofprocess definitions A workflow process comprisesof a network of activities their interrelationshipscriteria for staringending a process and otherinformation about participants invoked applica-

tions and relevant data Also several other kindsof entities which are external to the workflow suchas system and environmental data or the organiza-tional model are roughly described In [38] severalinteresting research results on workflow manage-ment are presented in the field of electroniccommerce distributed execution and adaptiveworkflows Still there is no reference to data flowmodeling efforts In [5] the authors provide anoverview of the most frequent control flowpatterns in workflows The patterns refer explicitlyto control flow structures like activity sequenceANDXOROR splitjoin and so on Severalcommercial tools are evaluated against the 26patterns presented In [35ndash37] the authors basedon minimal metamodels try to provide correctnesscriteria in order to derive equivalent plans for thesame workflow scenarioIn more than one work [536] the authors

mention the necessity for the perspectives alreadydiscussed in the introduction of the paper Dataflow or data dependencies are listed within thecomponents of the definition of a workflow still inall these works the authors quickly move on toassume that control flow is the primary aspect of

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 30: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 521

workflow modeling and do not deal with data-centric issues any further It is particularly inter-esting that the [9] standard is not concerned withthe role of business data at all The primary focusof the WfMC standard is the interfaces thatconnect the different parts of a workflow engineand the transitions between the states of a work-flow No reference is made to business data(although the standard refers to data which arerelevant for the transition from one state toanother under the name workflow related data)

53 Applications of ETL workflows in data

warehouses

Finally we would like to mention that theliterature reports several efforts (both research andindustrial) for the management of processes andworkflows that operate on data warehouse sys-tems In [39] the authors describe an industrialeffort where the cleaning mechanisms of the datawarehouse are employed in order to avoid thepopulation of the sources with problematic data inthe fist place The described solution is based on aworkflow that employs techniques from the field ofview maintenance The industrial effort at DeutcheBank involving the importexport transformationand cleaning and storage of data in a Terabyte-sizedata warehouse is described in Ref [40] The paperexplains also the usage of metadata managementtechniques which involves a broad spectrum ofapplications from the import of data to themanagement of dimensional data and moreimportantly for the querying of the data ware-house A research effort (and its application in anindustrial application) for the integration andcentral management of the processes that liearound an information system is presented in thework of Jarke et al [41] A metadata managementrepository is employed to store the differentactivities of a large workflow along with impor-tant data that these processes employFinally we should refer the interested reader to

[6] for a detailed presentation of ARKTOS II modelThe model is accompanied by a set of importance

metrics where we exploit the graph structure tomeasure the degree to which activitiesrecordsetsattributes are bound to their data providers or

consumers In [42] we propose a complementaryconceptual model for ETL scenarios and in [43] amethodology for constructing it Ref [44] ab-stractly describes our approach of modeling andmanaging ETL processes

6 Discussion

In this section we would like to briefly discusssome comments on the overall evaluation of ourapproach Our proposal involves the data model-ing part of ETL activities which are modeled asworkflows in our setting nevertheless it is notclear whether we covered all possible problemsaround the topic Therefore in this section we willexplore three issues as an overall assessment of ourproposal First we will discuss its completenessie whether there are parts of the data modelingthat we have missed Second we will discuss thepossibility of further generalizing our approach tothe general case of workflows Finally we will exitthe domain of the logical design and deal withperformance and stability concerns around ETLworkflows

Completeness A first concern that arisesinvolves the completeness of our approach Webelieve that the different layers of Fig 1 fully coverthe different aspects of workflow modeling Wewould like to make clear that we focus on the data-oriented part of the modeling since ETL activitiesare mostly concerned with a well-establishedautomated flow of cleanings and transformationsrather than an interactive session where user

decisions and actions direct the flow (like forexample in [45])Still is this enough to capture all the aspects of

the data-centric part of ETL activities Clearly wedo not provide any lsquolsquoformalrsquorsquo proof for thecompleteness of our approach Nevertheless wecan justify our basic assumptions based on therelated literature in the field of software metricsand in particular on the method of function points

[4647] Function points is a methodology tryingto quantify the functionality (and thus the re-quired development effort) of an applicationAlthough based on assumptions that pertain tothe technological environment of the late 1970s

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 31: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525522

the methodology is still one of the most cited in thefield of software measurement In any casefunction points compute the measurement valuesbased on the five following characteristics (i) userinputs (ii) user outputs (iii) user inquiries (iv)employed files and (v) external interfacesWe believe that an activity in our setting covers

all the above quite successfully since (a) it employsinput and output schemata to obtain and forwarddata (characteristics i ii and iii) (b) communicateswith files (characteristic iv) and other activities(practically characteristic v) Moreover it is tunedby some user-provided parameters which are notexplicitly captured by the overall methodology butare quite related to characteristics (iii) and (v) Asa more general view on the topic we could claimthat it is sufficient to characterize activities withinput and output schemata in order to denotetheir linkage to data (and other activities too)while treating parameters as part of the input andor output of the activity depending on theirnature We follow a more elaborate approachtreating parameters separately mainly becausethey are instrumental in defining our templateactivities

Generality of the results A second issue that wewould like to bring up is the general applicabilityof our approach Is it possible that we apply thismodeling for the general case of workflowsinstead of applying it simply to the ETL onesAs already mentioned to the best of our knowl-edge typical research efforts in the context ofworkflow management are concerned with themanagement of the control flow in a workflowenvironment This is clearly due to the complexityof the problem and its practical application tosemi-automated decision-based interactive work-flows where user choices play a crucial roleTherefore our proposal for a structured manage-ment of the data flow concerning both theinterfaces and the internals of activities appearsto be complementary to existing approaches forthe case of workflows that need to accessstructured data in some kind of data store or toexchange structured data between activitiesIt is possible however that due to the complex-

ity of the workflow a more general approachshould be followed where activities have multiple

inputs and outputs covering all the cases ofdifferent interactions due to the control flow Weanticipate that a general model for businessworkflows will employ activities with inputs andoutputs internal processing and communicationwith files and other activities (along with all thenecessary information on control flow resourcemanagement etc) nevertheless we find this to beoutside the context of this paper

Execution characteristics A third concern in-volves performance Is it possible to model ETLactivities with workflow technology Typically theback-stage of the data warehouse operates understrict performance requirements where a loadingtime-window dictates how much time is assignedto the overall ETL process to refresh the contentsof the data warehouse Therefore performance isreally a major concern in such an environmentClearly in our setting we do not have in mind EAIor other message-oriented technologies to bringthe loading task to a successful end On thecontrary we strongly believe that the volume ofdata is the major factor of the overall process (andnot for example any user-oriented decisions)Nevertheless to our point of view the paradigm ofactivities that feed one another with data duringthe overall process is more than suitableLet us mention a recent experience report on the

topic in [48] the authors report on their datawarehouse population system The architecture ofthe system is discussed in the paper withparticular interest (a) in a lsquolsquoshared data arearsquorsquowhich is an in-memory area for data transforma-tions with a specialized area for rapid access tolookup tables and (b) the pipelining of the ETLprocesses A case study for mobile network trafficdata is also discussed involving around 30 dataflows 10 sources and around 2TB of data with 3billion rows for the major fact table In order toachieve a throughput of 80M rowh and 100Mrowday the designers of the system were practi-cally obliged to exploit low-level OCI calls inorder to avoid storing loading data to files andthen loading them through loading tools With 4 hof loading window for all this workload the mainissues identified involve (a) performance (b)recovery (c) day-by-day maintenance of ETLactivities and (d) adaptable and flexible activities

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 32: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 523

Based on the above we believe that the quest for aworkflow rather than a push-and-store paradigmis quite often the only way to followOf course this kind of workflow approach

possibly suffers in the issue of software stabilityand mostly recovery Having a big amount oftransient data processed through a large set ofactivities in main memory is clearly vulnerable toboth software and hardware failures Moreoveronce a failure has occurred rapid recovery ifpossible within the loading time-window is also astrong desideratum Techniques to handle the issueof recovery already exist To our knowledge themost prominent one is the one by Labio et al [49]where the ordering of data is taken into considera-tion Checkpoint techniques guarantee that oncethe activity output is ordered recovery can startright at the point where the activity did the lastcheckpoint thus speeding up the whole processsignificantly

7 Conclusions

In this paper we have focused on the data-centric part of logical design of the ETL scenarioof a data warehouse First we have defined aformal logical metamodel as a logical abstractionof ETL processes The data stores activities andtheir constituent parts as well as the providerrelationships that map data producers to dataconsumers have formally been defined We havealso employed a declarative database program-ming language LDL to define the semantics ofeach activity Then we have provided a reusabilityframework that complements the genericity of theaforementioned metamodel Practically this isachieved from an extensible set of specializationsof the entities of the metamodel layer specificallytailored for the most frequent elements of ETLscenarios which we call template activities In thecontext of template materialization we have dealtwith specific language issues in terms of themechanics of template instantiation to concreteactivities Finally we have presented a graphicaldesign tool ARKTOS II with the goal of facilitatingthe design of ETL scenarios based on our model

Still several research issues are still left open onthe grounds of this work A broad area of researchinvolves the efficient and reliable execution of anETL scenario In this context an obvious issue isthe optimization of ETL scenarios under time andthroughput constraints The topic appears inter-esting since the frequent usage of functions inETL scenarios drives the problem outside theexpressive power of relational algebra (and there-fore the traditional optimization techniques usedin the context of relational query optimizers) Theproblem becomes even more complex if oneconsiders issues of reliability and recovery in thepresence of failures or even issues of softwarequality (eg resilience to changes in the underlyingdata stores) Similar results already exist in thecontext of materialized views maintenance [5051]Of course the issue of providing optimal algo-rithms for individual ETL tasks (eg duplicatedetection surrogate key assignment or identifica-tion of differentials) is also very interesting In adifferent line of research one could also worktowards providing a general model for the dataflow of data-centric business workflows involvingissues of transactions alternative interfaces in thecontext of control flow decisions and contingencyscenarios Finally the extension of ETL techni-ques for streaming or XML-formatted data is alsoanother interesting topic of future research

Acknowledgments

We would like to thank the anonymousreviewers of this paper for valuable commentsthat improved the overall quality of the paper

References

[1] IBM IBM Data warehouse manager available at http

www-3ibmcomsoftwaredatadb2datawarehouse

[2] Informatica Power Center available at httpwww

informaticacomproductsdata+integrationpowercenter

defaulthtm

[3] Microsoft Data transformation services available at

httpwwwmicrosoftcom

[4] Oracle Oracle warehouse builder product page available at

httpotnoraclecomproductswarehousecontenthtml

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 33: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525524

[5] WMP van der Aalst AHM ter Hofstede B Kiepus-

zewski AP Barros Workflow Patterns BETA Working

Paper Series WP 47 Eindhoven University of Technology

Eindhoven 2000 available at the Workflow Patterns

web site at tmit httpwwwtmtuenlresearchpatterns

documentationhtm

[6] P Vassiliadis A Simitsis S Skiadopoulos Modeling ETL

activities as graphs in Proceedings of the Fourth

International Workshop on Design and Management of

Data Warehouses (DMDW) pp 52ndash61 Toronto Canada

2002

[7] P Vassiliadis A Simitsis P Georgantas M Terrovitis A

framework for the design of ETL scenarios in Proceed-

ings of the 15th Conference on Advanced Information

Systems Engineering (CAiSE lsquo03) pp 520ndash535 Klagen-

furtVelden Austria 16ndash20 June 2003

[8] R Kimbal L Reeves M Ross W Thornthwaite The

Data Warehouse Lifecycle Toolkit Expert Methods for

Designing Developing and Deploying Data Warehouses

Wiley New York 1998

[9] Workflow Management Coalition Interface 1 Process

Definition Interchange Process Model Document no

WfMC TC-1016-P 1998 available at httpwww

wfmcorg

[10] S Naqvi S Tsur A Logical Language for Data and

Knowledge Bases Computer Science Press Rockville

MD 1989

[11] C Zaniolo LDL++ Tutorial UCLA httppikecs

uclaeduldl December 1998

[12] D Dori Conceptual modeling and system architecting

Commun ACM 46 (10) (2003) 62ndash65

[13] P Vassiliadis A Simitsis P Georgantas M Terrovitis

S Skiadopoulos A generic and customizable frame-

work for the design of ETL scenarios (long version)

Technical Report TR-2004-1 Knowledge and Data-

base Systems Laboratory National Technical University

of Athens available at httpwwwdbnetecentuagr

pubs

[14] Giga Information Group Market Overview Update

ETL Technical Report RPA-032002-00021 March

2002

[15] Ascential Software Inc available at httpwwwascen-

tialsoftwarecom

[16] Ascential Software ProductsmdashData Warehousing Tech-

nology available at httpwwwascentialsoftwarecom

productsdatastagehtml

[17] Gartner Inc ETL magic quadrant update market

pressure increases Gartnerrsquos Strategic Data Management

Research Note M-19-1108 January 2003

[18] PA Bernstein T Bergstraesser Meta-data support for

data transformations using Microsoft repository Special

issue on data transformations Bull Tech Committee

Data Eng 22 (1) (1999) 9ndash14

[19] Microsoft Corp OLEDB specification available at http

wwwmicrosoftcomdataoledb

[20] C Graves M Scott M Benkovich P Turley R

Skoglund R Dewson S Youness D Lee S Ferguson

T Bain T Joubert Professional SQL Server 2000 data

warehousing with analysis services 1st ed Wrox Press

Ltd 2001

[21] Oracle Oracle 9i Warehouse Builder Architectural White

paper April 2002

[22] H Galhardas D Florescu D Shasha E Simon Ajax An

extensible data cleaning tool in Proceedings of the ACM

SIGMOD International Conference on the Management

of Data pp 590 Dallas TX 2000

[23] W Cohen Some practical observations on integration of

Web information in WebDBrsquo99 Workshop in conj with

ACM SIGMOD 1999

[24] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning Technical Report

INRIA 1999 (RR-3742)

[25] V Raman J Hellerstein Potters Wheel an interactive

framework for data cleaning and transformation Techni-

cal Report University of California at Berkeley Computer

Science Division 2000 available at httpwwwcs

berkeleyedurshankarpaperspwheelpdf

[26] V Raman J Hellerstein Potterrsquos Wheel an interactive

data cleaning system in Proceedings of 27th Inter-

national Conference on Very Large Data Bases (VLDB)

pp 381ndash390 Roma Italy 2001

[27] M Jarke M Lenzerini Y Vassiliou P Vassiliadis

Springer New York 2000

[28] E Rundensteiner Special issue on data transformations

Bull Tech Committee Data Eng 22 (1) (1999)

[29] S Sarawagi Special issue on data cleaning Bull Tech

Committee Data Eng 23 (4) (2000)

[30] E Rahm H Hai Do Data cleaning problems and current

approaches Bull Tech Committee Data Eng 23 (4)

(2000)

[31] V Borkar K Deshmuk S Sarawagi Automatically

extracting structure form free text Addresses Bull Tech

Committee Data Eng 23 (4) (2000)

[32] A Monge Matching algorithms within a duplicate

detection system Bull Tech Committee Data Eng 23

(4) (2000)

[33] A Calı D Calvanese G De Giacomo M Lenzerini P

Naggar F Vernacotola IBIS Semantic data integration

at work in Proceedings of the 15th International

Conference on Advanced Information Systems Engineer-

ing (CAiSE 2003) vol 2681 of Lecture Notes in Computer

Science pp 79ndash94 Springer 2003

[34] A Calı D Calvanese G De Giacomo M Lenzerini

Data integration under integrity constraints in Proceed-

ings of the 14th International Conference on Advanced

Information Systems Engineering (CAiSE 2002) vol 2348

of Lecture Notes in Computer Science pp 262ndash279

Springer 2002

[35] J Eder W Gruber A meta model for structured work-

flows supporting workflow transformations in Proceed-

ings of the Sixth East European Conference on Advances

in Databases and Information Systems (ADBIS 2002)

pp 326ndash339 Bratislava Slovakia September 8ndash11

2002

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References
Page 34: Etl design document

ARTICLE IN PRESS

P Vassiliadis et al Information Systems 30 (2005) 492ndash525 525

[36] W Sadiq ME Orlowska On business process model

transformations 19th International Conference on Con-

ceptual Modeling (ER 2000) Salt Lake City UT USA

October 9ndash12 2000 pp 267ndash280

[37] B Kiepuszewski AHM ter Hofstede C Bussler On

structured workflow modeling in Proceedings of the 12th

International Conference on Advanced Information Sys-

tems Engineering (CAiSE 2000) pp 431ndash445 Stockholm

Sweden June 5ndash9 2000

[38] P Dadam M Reichert (eds) Enterprise-wide and cross-

enterprise workflow management concepts systems

applications GI Workshop Informatikrsquo99 1999 available

at httpwwwinformatikuni-ulmdedbisveranstaltungen

Workshop-Informatik99-Proceedingspdf

[39] M Jarke C Quix G Blees D Lehmann G Michalk S

Stierl Improving OLTP Data Quality Using Data Ware-

house Mechanisms Proceedings of 1999 ACM SIGMOD

International Conference on Management of Data Phila-

delphia USA June 1999 pp 537ndash538

[40] E Schafer J-D Becker M Jarke DB-Prism Integrated

data warehouses and knowledge networks for bank

controlling Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[41] M Jarke T List J Koller The challenge of process

warehousing Proceedings of the 26th International Con-

ference on Very Large Databases Cairo Egypt 2000

[42] P Vassiliadis A Simitsis S Skiadopoulos Conceptual

modeling for ETL processes in Proceedings of the Fifth

ACM International Workshop on Data Warehousing and

OLAP (DOLAP) pp 14ndash21 McLean VA USA 2002

[43] A Simitsis P Vassiliadis A methodology for the

conceptual modeling of ETL processes in Proceedings

of the Decision Systems Engineering (DSE lsquo03) Velden

Austria June 17 2003

[44] A Simitsis Modeling and managing ETL processes in

Proceedings of the VLDB 2003 PhD Workshop Berlin

Germany September 12ndash13 2003

[45] F Casati S Ceri B Pernici G Pozzi Conceptual

Modeling of Workflows in Proceedings of the OO-ER

Conference Australia 1995

[46] AJ Albrecht Measuring Application Development Pro-

ductivity in IBM Applications Development Symposium

Monterey CA 1979 pp 83ndash92

[47] RS Pressman Software Engineering A Practitionerrsquos

Approach 5th ed McGraw-Hill New York 2000

[48] J Adzic V Fiore Data Warehouse Population Platform

in Proceedings of the Fifth International Workshop on the

Design and Management of Data Warehouses

(DMDWrsquo03) Berlin Germany September 2003

[49] W Labio JL Wiener H Garcia-Molina V Gorelik

Efficient resumption of interrupted warehouse loads in

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data (SIGMOD 2000)

pp 46ndash57 Dallas TX USA 2000

[50] J Chen S Chen EA Rundensteiner A Transactional

Model for Data Warehouse Maintenance in Proceedings

of the of ER 2002 LNCS 2503 pp 247ndash262 2002

[51] B Liu S Chen EA Rundensteiner A transactional

approach to parallel data warehouse maintenance in

Proceedings of DaWaK 2002 LNCS 2454 2002 pp 307ndash316

  • A generic and customizable framework for the design of ETL scenarios
    • Introduction
    • Generic model of ETL activities
      • Graphical notation and motivating example
      • Preliminaries
      • Activities
      • Relationships in the architecture graph
      • Scenarios
        • Templates for ETL activities
          • General framework
          • Formal definition and usage of template activities
            • Notation
            • Instantiation
            • Taxonomy simple and program-based templates
                • Implementation
                • Related work
                  • Commercial studies and tools
                  • Research efforts
                  • Applications of ETL workflows in data warehouses
                    • Discussion
                    • Conclusions
                    • Acknowledgments
                    • References