Top Banner
ETL Prof. Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani
21
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • ETLProf. Navneet GoyalDepartment of Computer Science & Information SystemsBITS, Pilani

  • TopicsRequirementsBuild or Buy?ETL Data StructuresData FlowExtract Clean & ConformDeliverDimension TablesFact tablesImplementation & Operations

  • IntroductionETL system is the foundation of any DW systemStill the most under rated system!An ETL systemExtracts data from source systemsEnforces data quality & consistencyConforms data so that separate source systems could be used togetherDelivers data in a presentation ready format that can be used by end users

  • IntroductionETL system makes or breaks a DWBuilding an ETL system is a back room activity not visible to end usersConsumes almost 70-80% of resources needed for implementation & maintenance of a DWMission of ETL system: get data out of the source systems & load it into the DW

  • IntroductionExtractExtract relevant data TransformTransform data to DW formatBuild keys, etc.Cleansing of data LoadLoad data into DWBuild aggregates, etc.

  • Choice of Architecture Tool Based ETLSimpler, Cheaper & Faster developmentPeople with business skills & not much technical skills can use it.Automatically generate MetadataAutomatically generates data Lineage & data dependency analysisOffers in-line encryption & compression capabilitiesManage complex load balancing across servers

  • Choice of Architecture Hand-Coded ETLQuality of tool by exhaustive unit testingBetter metadataRequirement may be just file based processes not database-stored proceduresUse of existing legacy routinesUse of in-house programmersUnlimited flexibility

  • Middleware & Connectivity ToolsProvide transparent access to source systems in heterogeneous computing environmentsExpensive but often prove invaluable because they provide transparent access to DBs of different types, residing on different platformsExamples:IBM: Data JoinerOracle: Transparent GatewaySAS: SAS/ConnectSybase: Enterprise Connect

  • Extraction ToolsLot of tools available in the marketTool selection tediousChoice of tool depends on following factors:Source system platform and DBBuilt-in extraction or duplication functionalityBatch windows of the operational systems

  • Extraction MethodsBulk ExtractionsEntire DW is refreshed periodicallyHeavily taxes the network connections between the source & target DBsEasier to set up & maintainChange-based ExtractionsOnly data that have been newly inserted or updated in the source systems are extracted & loaded into the DWPlaces less stress on the network but requires more complex programming to determine when a new DW record must be inserted or when an existing DW record must be updated

  • Transformation ToolsTransform extracted data into the appropriate format, data structure, and values that are required by the DWFeatures provided:Field splitting & consolidationStandardizationAbbreviations, date formats, data types, character formats, etc.Deduplication

  • Source SystemType of transformationDWAddress Field:#123 ABC StreetXYZ City 1000Republic of MNField SplittingNo: 123Street: ABCCity: XYZCountry: Republic of MNPostal Code: 1000System ACustomer title: PresidentSystem BCustomer title: CEOField ConsolidationCustomer title: President & CEOOrder Date:05 August 1998Order Date: 08/08/98StandardizationOrder Date:05 August 1998Order Date: 08 August 1998System ACustomer Name: John W. SmithSystem BCustomer Name: John William SmithDeduplicationCustomer Name: John William Smith

  • Mission of ETL teamTo build the back room of the DWDeliver data most effectively to end user toolsAdd value to the data in the cleaning & conforming stepsProtect & document the lineage of data

  • Mission of ETL teamThe back room must support 4 key stepsExtracting data from original sourcesQuality assuring & cleaning dataConforming the labels & measures in the data to achieve consistency across the original sourcesDelivering the data in a physical format that can be used by query tools and report writers

  • ETL Data StructuresData FlowExtract Clean Conform DeliverBack room of a DW is often called the data staging areaStaging means writing to diskETL team needs a number of different data structures for all kinds of staging needs

  • To stage or not to stageDecision to store data in physical staging area versus processing it in memory is ultimately the choice of the ETL architect

  • To stage or not to stageA conflict betweengetting the data from the operational systems as fast as possiblehaving the ability to restart without repeating the process from the beginningReasons for stagingRecoverability: stage the data as soon as it has been extracted from the source systems and immediately after major processing (cleaning, transformation, etc).Backup: can reload the data warehouse from the staging tables without going to the sourcesAuditing: lineage between the source data and the underlying transformations before the load to the data warehouse

  • Designing the staging areaThe staging area is owned by the ETL teamno indexes, no aggregations, no presentation access, no querying, no service level agreements

    Users are not allowed in the staging area for any reason staging is a construction site

    Reports cannot access data in the staging areatables can be added, or dropped without notifying the user communityControlled environment

  • Designing the staging area (contd)Only ETL processes can read/write the staging area (ETL developers must capture table names, update strategies, load frequency, ETL jobs, expected growth and other details about the staging area)

    The staging area consists of both RDBMS tables and data files

  • Staging Tables Volumetric WorksheetLists each table in the staging area with the following information:Table Name: name or table or file in the DSA. One row in the WS for each staging tableUpdate Strategy: Indicates how a table is maintained. For persistent tables it will have data appended, updated, or perhaps deleted. Transient tables are truncated and reloaded with each processLoad Frequency: How often the table is loaded or changed by the ETL process. In real-time environment continuouslyETL JobsInitial Row count:

  • Staging Area data Structures in the ETL System

    Flat files fast to write, append to, sort and filter (grep) but slow to update, access or joinenables restart without going to the sourcesXML Data Sets Used as a medium of data transfer between incompatible data sourcesGives enough information to create tables using CREATE TABLERelational TablesMetadata, SQL interface, DBA support