5/29/2009 1 Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 8. Building the DW 8.1 The DW Project 8.2 Data Extract/Transform/Load (ETL) 8.3 Metadata DW & DM –Wolf-TiloBalke–InstitutfürInformationssysteme–TU Braunschweig 2 8. Building the DW • Building a DW, is a complex IT project – A middle size DW-project contains 500-1000 activities • DW-Project organization – Project roles and corresponding tasks, e.g.: DW & DM –Wolf-TiloBalke–InstitutfürInformationssysteme–TU Braunschweig 3 8.1 The DW Project Roles Tasks DW-PM Project Management DW-Architect Methods, Concepts,Modeling DW-Miner Concepts, Analyze(non-standard) Domain Expert Domain Knowledge DW-System Developer System-and metadata-management, ETL DW User Analyze (standard) • DW-Project usual tasks – Communication, as process of information exchange between team members – Conflict management • The magical triangle, compromise between time, costs and quality – Quality assurance • Performance, reliability, scalability, robustness, etc. – Documentation DW & DM –Wolf-TiloBalke–InstitutfürInformationssysteme–TU Braunschweig 4 8.1 The DW Project • Software choice – Database system for the DW • Usually the choice is to use the same technology provider as for the operational data • MDB vs. RDB – ETL tools • Differentiate by the data cleansing needs – Analysis tools • Varying from data mining to OLAP products, with a focus on reporting functionality – Repository • Not very oft used • Helpful for metadata management DW & DM –Wolf-TiloBalke–InstitutfürInformationssysteme–TU Braunschweig 5 8.1 The DW Project • Hardware choice – Data storage • RAID systems, SAN’s, NAS’s – Processing • Multi-CPU systems, SMP, Clusters – Failure tolerance • Data replication, mirroring RAID, backup strategies – Other factors • Data access times, transfer rates, memory bandwidth, network throughput and latency DW & DM –Wolf-TiloBalke–InstitutfürInformationssysteme–TU Braunschweig 6 8.1 The DW Project
14
Embed
Data Warehousing & Data Mining - TU Braunschweig · Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu ... • Assisted by artificial intelligence algorithms and pattern
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
5/29/2009
1
Data Warehousing & Data MiningWolf-Tilo BalkeSilviu HomoceanuInstitut für InformationssystemeTechnische Universität Braunschweighttp://www.ifis.cs.tu-bs.de
8. Building the DW
8.1 The DW Project
8.2 Data Extract/Transform/Load (ETL)
8.3 Metadata
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2
8. Building the DW
• Building a DW, is a complex IT project
– A middle size DW-project contains 500-1000 activities
• DW-Project organization
– Project roles and corresponding tasks, e.g.:
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3
8.1 The DW Project
Roles Tasks
DW-PM Project Management
DW-Architect Methods, Concepts, Modeling
DW-Miner Concepts, Analyze(non-standard)
Domain Expert Domain Knowledge
DW-System Developer System- and metadata-management,
ETL
DW User Analyze (standard)
• DW-Project usual tasks
– Communication, as process of information exchange between team members
– Conflict management
• The magical triangle, compromise between time, costs and quality
– Quality assurance
• Performance, reliability, scalability, robustness, etc.
– Documentation
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4
8.1 The DW Project
• Software choice– Database system for the DW
• Usually the choice is to use the same technology provider as for the operational data
• MDB vs. RDB
– ETL tools• Differentiate by the data cleansing needs
– Analysis tools• Varying from data mining to OLAP products, with a focus on
reporting functionality
– Repository• Not very oft used
• Helpful for metadata management
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5
8.1 The DW Project
• Hardware choice
– Data storage
• RAID systems, SAN’s, NAS’s
– Processing
• Multi-CPU systems, SMP, Clusters
– Failure tolerance
• Data replication, mirroring RAID, backup strategies
– Other factors
• Data access times, transfer rates, memory bandwidth, network throughput and latency
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6
8.1 The DW Project
5/29/2009
2
• Project Timeline, depends on the development methodology, but usually– Phase I – Proof of Concept
– Phase II – Controlled Release• Iterative Process of Building Subject Oriented Data Marts
– Phase III – General Availability• On going operations, support and training, maintenance and
growth
• The most important part of the DW building project is defining the ETL process
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7
8.1 The DW Project
• What is ETL?
– Short for extract, transform, and load
– Three database functions that are combined into one tool to pull data out of productive databases and place it into the DW
• Migrate data from one database to another, to form data marts and data warehouses
• Convert databases from one format or type to another
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8
8.2 ETL
• When should we ETL?
– Periodically (e.g., every night, every week) or after significant events
– Refresh policy set by administrator based on user needs and traffic
– Possibly different policies for different sources
– Rarely, on every update (real-time DW)
• Not warranted unless warehouse data require current data (up to the minute stock quotes)
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9
8.2 ETL
• ETL is used to integrate heterogeneous systems
– With different DBMS, operating system, hardware, communication protocols
• ETL challenges
– Getting the data from the source to target as fast as possible
– Allow recovery from failure without restarting the whole process
• This leads to balance between writing data to staging tables or keeping it in memory
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10
8.2 ETL
• Staging area, basic rules
– Data in the staging area is owned bythe ETL team
• Users are not allowed in the stagingarea at any time
– Reports cannot access data from the staging area
– Only ETL processes can write to and read from the staging area
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 11
8.2 ETL
• ETL input/output example
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12
8.2 ETL
5/29/2009
3
• Staging area structures for holding data
– Flat files
– XML data sets
– Relational tables
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13
8.2 Staging area data structures
• Flat files– ETL tools based on scripts, such as
Perl, VBScript or JavaScript
– Advantages• No overhead of maintaining metadata as DBMS does
• Sorting, merging, deleting, replacing and other data-migration functions are much faster outside the DBMS
– Disadvantages• No concept of updating
• Queries and random access lookups are not well supported by the operating system
• Flat files can not be indexed for fast lookups
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14
8.2 Staging area data structures
• When should flat files be used?
– Staging source data for safekeeping and recovery
• Best approach to restart a failed process is by having data dumped in a flat file
– Sorting data
• Sorting data in a file system may be more efficient as performing it in a DBMS with Order By clause
• Sorting is important: a huge portion of the ETL processing cycles goes to sorting
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15
8.2 Staging area data structures
• When should flat files be used?
– Filtering
• Using grep-like functionality
– Replacing text strings
• Sequential file processing is muchfaster at the system-level than itis with a database
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16
8.2 Staging area data structures
• XML Data sets
– Used as common format for bothinput and output from the ETLsystem
– Generally, not used for persistentstaging
– Useful mechanisms
• XML schema (successor of DTD)
• XQuery, XPath
• XSLT
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17
8.2 Staging area data structures
• Relational tables– Using tables is most appropriate especially when there
are no dedicated ETL tools
– Advantages• Apparent metadata: column names data types and lengths,
cardinality, etc.
• Relational abilities: data integrity as well as normalized staging
• Open repository/SQL interface: easy to access by any SQL compliant tool
– Disadvantages• Sometimes slower than the operating file system
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18
8.2 Staging area data structures
5/29/2009
4
• How is the staging area designed?
– Staging database, file system, and directory structures are set up by the DB and OS administrators based on ETL architect estimations e.g., tables volumetric worksheet
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19
8.2 Staging area storage
Table
Name
Update
strategy
Load
frequency
ETL Job Initial
row
count
Avg
row
length
Grows
with
Expected
rows/mo
Expected
bytes/mo
Initial table
size
Table
Size 6
mo. (MB)
S_ACC Truncate/
Reload
Daily SAcc 39,933 27 New
account
9,983 269,548 1,078,191 2.57
S_ASSETS Insert/
Delete
Daily SAssets 771,500 75 New
assets
192,875 15,044,250 60,177,000 143.47
S_DEFECT Truncate/
Reload
On
demand
SDefect 84 27 New
defect
21 567 2,268 0.01
• ETL
– Data extraction
– Data transformation
– Data loading
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20
8.2 ETL
• Data Extraction– Data needs to be taken from a data
source so that it can be put into theDW
• Internal scripts/tools at the data source,which export the data to be used
• External programs, which extract the data from the source
– If the data is exported, it is typically exported into a text file that can then be brought into an intermediary database
– If the data is extracted from the source, it is typically transferred directly into an intermediary database
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21
8.2 Data Extraction
• Steps in data extraction
– Initial extraction
• Preparing the logical map
• First time data extraction
– Ongoing extraction
• Just new data
• Changed data
• Or even deleted data
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22
8.2 Data Extraction
• Logical map connects the original source data to the final data
– Most important part is the description of the transformation rules
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23
8.2 Data Extraction
Target Source Transformation
Table Column Data Type Table Type DB
Name
Table Column Data Type
EMPL_
DIM
E_KEY NUMBER DIMENSION NUMBER Surrogate key
EMPL_
DIM
COUNTRY VARCHAR(
75)
DIMENSION HR_SYS COUNTRIES NAME VARCHAR(75) Select c.name from
employees e, states s,
countries c where e.state_id =
s.state_id and s.country_id =
c.country
…
• Building the logical map: first identify the data sources– Data discovery phase
• Collecting and documenting source systems: databases, tables, relations, cardinality, keys, data types, etc.
– Anomaly detection phase• NULL values can destroy any ETL process, e.g., if a foreign
key is NULL, joining tables on a NULL column results in data loss, because in RDB NULL ≠ NULL
– If NULL on foreign key then use outer joins
– If NULL on other columns then create a business rule to replace NULLs while loading data in DW
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24
8.2 Data Extraction
5/29/2009
5
• Data needs to be maintained in the DW also after the initial load
– Extraction is performed on a regular basis
– Only changes are extracted after the first time
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25
8.2 Ongoing Extraction
• Detecting changes (new/changed data)
– Using audit columns
– Database log scraping or sniffing
– Process of elimination
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26
8.2 Ongoing Extraction
• Detecting changes (new/changed data)
– Using audit columns
• Store date and time a record has been added or modified
• Detect changes based on date stamps higher than the last extraction date
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27
8.2 Ongoing Extraction
– Log scraping
• Takes a snapshot of the database redo log at a certain time (e.g., midnight) and finds the transactions affecting the tables ETL is interested
• It can be problematic when the redo loggets full and is emptied by the DBA
– Log Sniffing
• Pooling the redo log capturing thetransactions on the fly
• The better choice: suitable also for real-time ETL
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28
8.2 Detecting changes
– Process of Elimination
• Preserves exactly one copy of each previous extraction
• During next run, it compares the entire source tables against the extraction copy
• Only differences are sent to the DW
• Advantages
– Because the process makes row by row comparisons, it is impossible to miss data
– It can also detect deleted rows
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29
8.2 Detecting changes
• Detecting deleted or overwritten fact records
– If records or incorrect values are inserted by mistake, records in ODS get deleted or overwritten
– If the mistakes have already been loaded in the DW, corrections have to be made
– In such cases the solution is not to modify or delete data in the DW, but inserting an additional record which corrects or even cancels the mistake by negating it
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30
8.2 Detecting changes
5/29/2009
6
• Data transformation
– Uses rules or lookup tables, or creating combinations with other data, to convert source data to the desired state
• 2 major steps
– Data Cleaning
• Mostly involves manual work
• Assisted by artificial intelligencealgorithms and pattern recognition
– Data Integration
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31
8.2 Data Transformation
• Extracted data can be dirty. How does clean data look like?
• Data Quality characteristics:
– Correct: values and descriptions in data represent their associated objects truthfully
• E.g., if the city in which store 1 is located is Braunschweig, then the address should not report Paris.
–Unambiguous: the values and descriptions in data can be taken to have only one meaning
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32
8.2 Data Cleaning
– Consistent: values and descriptions in data use one constant notational convention
• E.g., Braunschweig can be expressed as BS or Brunswick, by our employees in USA. Consistency means using just BS in all our data
– Complete
• Individual values and descriptors in data have a value (not null)
• Aggregate number of records is complete
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33
8.2 Data Cleaning
• The data cleaning engine produces 3 main deliverables:
– Data-profiling results:
• Meta-data repository describing schema definitions, business objects, domains, data sources, table definitions, data rules, value rules, etc.
• Represents a quantitative assessment of original data sources
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34
8.2 Data Cleaning
– Error event table
• Structured as a dimensional star schema
• Each data quality error identified by the cleaning subsystem is inserted as arow in the errorevent fact table
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35
8.2 Cleaning Deliverables
Data Quality Screen:
- Status report on data quality
- Gateway which lets only clean data
go through
– Audit dimension
• Describes the data-quality context of a fact table record being loaded into the DW
• Attached to each fact record
• Aggregates the information from the error event table on a per record basis
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36
8.2 Cleaning Deliverables
Audit key (PK)
Completeness category (text)
Completeness score (integer)
Number screens failed
Max severity score
Extract timestamp
Clean timestamp
…
5/29/2009
7
• Core of the data cleaning engine
– Break data into atomic units
• E.g., breaking the address into street, number, city, zip and country
– Standardizing
• E.g., encoding of the sex: 0/1, M/F, m/f, male/female
– Verification
• E.g., does zip code 38106 belong to Braunschweig?
– Matching
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37
8.2 Cleaning Engine
• Types of enforcement
– Column property enforcement
• Ensures that incoming data contains expected values
• NULL values in required columns
• Numeric values outside the expected high/low ranges
• Columns whose lengths are unexpectedly short or long
• Columns that contain values outside of valid value sets
• Adherence to a required pattern
• Hits against a list of known wrong values (if the list of acceptable values is to long)
• Spell-checker rejects
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38
8.2 Data quality checks
• Structure enforcement
– Focus on the relationship of columns to each other
– Proper primary and foreign keys
– Explicit and implicit hierarchies and relationships among group of fields e.g., valid postal address
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39
8.2 Data quality checks
• Data and Value rule enforcement
– E.g., a commercial customer cannot simultaneously be a limited and a corporation
– Value rules can also provide probabilistic warnings that the data might be incorrect
• E.g., boy named ‘Sue’ might be correct, but most probably it is a gender or name error and such a record should be flagged for inspection
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40
8.2 Data quality checks
• Overall Process Flow
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41
8.2 Data cleaning
• Sometimes data isjust garbage
– We shouldn’t loadgarbage in the DW
• Cleaning data manuallytakes just…forever!!!
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42
8.2 Data Quality
5/29/2009
8
• Use tools to clean data semi-automatic
– Open source tools
• E.g., Eobjects DataCleaner, Talend Open Profiler
– Non-open source
• Firstlogic both by Business Objects (now SAP)
• Vality both by Ascential (now IBM)
• Oracle Data Quality and Oracle Data Profiling
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43
8.2 Data Quality
• Data cleaning process
– Use of regular expressions
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44
8.2 Data Quality
• Regular expressions for date/time data
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45
8.2 Data Quality
• Core of the data cleaning engine
– Anomaly detection phase:
• Data anomaly is a piece ofdata which doesn’t fit into thedomain of the rest of the data itis stored with
• “What is wrong with thispicture?”
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46
8.2 Cleaning Engine
• Anomaly detection
– Count the rows in a table while grouping on the column in question e.g.,
• SELECT city, count(*) FROM order_detail GROUP BY city
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 47
8.2 Anomaly detection
City Count(*)
Bremen 2
Berlin 3
WOB 4,500
BS 12,000
HAN 46,000
…
• What if our table has 100 million rows with 250,000 distinct values?
– Use data sampling e.g.,
• Divide the whole data into 1000 pieces, and choose 1 record from each
• Add a random number column to the data, sort it an take the first 1000 records
• Etc.
– Common mistake is to select a range of dates
• Most anomalies happen temporarily
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48
8.2 Anomaly detection
5/29/2009
9
• Data profiling
– E.g., observe name anomalies
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49
8.2 Data Quality
• Data profiling
– Pay closer look to strange values
– Observe data distribution pattern
• Gaussian distribution
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50
• Conflicting integrity conditions e.g., different data types or boundary limitations
– Table/Attribute conflicts
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58
8.2 Schema integration
• The basic goal is to make schemas compatiblefor integration
• Conformation usually needs manual interaction
– Conflicts need to be resolved semantically
– Rename entities/attributes
– Convert differing types, e.g., convert an entity to an attribute or a relationship
– Align cardinalities/functionalities
– Align different datatypes
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59
8.2 Schema integration
• Schema integration is a semantic process
– This usually means a lot of manual work
– Computers can support the process by matching some (parts of) schemas
• There have been some approaches towards (semi-)automatic matching of schemas
– Matching is a complex process and usually only focuses on simple constructs like ‘Are two entities semantically equivalent?’
– The result is still rather error-prone…
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60
8.2 Schema integration
5/29/2009
11
• Schema Matching– Label-based matching
• For each label in one schema consider all labels of the other schema and every time gauge their semantic similarity
– Instance-based matching• Looking at the instances (of entities or relationships) one can e.g.,
find correlations between attributes like ‘Are there duplicate tuples?’ or ‘Are the data distributions in their respective domains similar?’
– Structure-based matching• Abstracting from the actual labels, only the structure of the
schema is evaluated, e.g., regarding element types, depths in hierarchies, number and type of relationships, etc.
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61
8.2 Schema integration
• If integration is query-driven only Schema Mapping is needed
– Mapping from one or more source schemas to a target schema
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62
8.2 Schema integration
Data
Source
schema S
Correspondence
Target
schema T
CorrespondenceMapping
compiler
Low-level mapping
High-level mapping
• Schema Mapping
– Abstracting from the actual labels, regarding element types, depths in hierarchies, number and type of relationships, etc.
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63
8.2 Schema integration
Product
ProdID: Decimal
Product: VARCHAR(50)
Group: VARCHAR(50)
Categ: VARCHAR(50)
Product
ID: Decimal
Product: VARCHAR(50)
GroupID:Decimal
• Schema mapping automation– Complex problem, based on heuristics
– Idea:• Based on given schemas and a high level mapping between
them
• Generate a set of queries that transform and integrate data from the sources to conform to the target schema
– Problems• Generation of the correct query considering the schemas
and the mappings
• Guarantee that the transformed data correspond to the target schema
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64
8.2 Schema integration
• Schema integration in praxis
– BEA AquaLogic Data Services
• Special Feature: easy-to-use modeling: “Mappings and transformations can be designed in an easy-to-use GUI tool using a library of over 200 functions. For complex mappings andtransformations, architects anddevelopers can bypass the GUItool and use an Xquery sourcecode editor to define or edit services. “
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65
8.2 Schema integration
www.bea.com
• What tools are actually given to support integration?– Data Translation Tool
• Transforms binary data into XML
• Transforms XML to binary data
– Data Transformation Tool• Transforms an XML to another XML
– Base Idea• Transform data to application specific XML → Transform to
XML specific to other application / general schema →
Transform back to binary
• Note: the integration work still has to be done manually
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66
8.2 Schema integration
5/29/2009
12
• “I can’t afford expensive BEA consultants and the AquaLogic Integration Suite, what now??”– Do it completely yourself
• Most used technologies can be found as Open Source projects (data mappers, XSL engines, XSL editors, etc)
– Do it yourself with specialized tools• Many companies and open source projects are specialized in
developing data integration and transformation tools– CloverETL
– Altova MapForce
– BusinessObjects Data Integrator
– etc…
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 67
8.2 Schema integration
• Altova MapForce– Same idea than BEA Integrator
• Also based on XSLT and a data description language
– Editors for binary/DB to XML mapping
– Editor for XSLtransformation
– Automatic generationof data sources, web-services, and transformation modules in Java, C#, C++
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68
8.2 Schema integration
• The loading process can be broken down into 2 different types:
– Initial load
– Continuous load (loadingover time)
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 69
8.2 Loading
• Issues
– Huge volumes of data to be loaded
– Small time window available when warehouse can be taken off line (usually nights)
– When to build index and summary tables
– Allow system administrators to monitor, cancel, resume, change load rates
– Recover gracefully -- restart after failure from where you were and without loss of data integrity
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 70
8.2 Loading
• Initial Load– Deliver dimensions tables
• Create and assign surrogate keys, each time a new cleaned and conformed dimension record has to be loaded
• Write dimensions to disk as physical tables, in the proper dimensional format
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 71
8.2 Loading
• Continuous load (loading over time)
– Must be scheduled and processed in a specific order to maintain integrity, completeness, and a satisfactory level of trust
– Should be the most carefully planned step in data warehousing or can lead to:
• Error duplication
• Exaggeration of inconsistencies in data
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 72
8.2 Loading
5/29/2009
13
• Continuous load of facts
– Separate updates from inserts
– Drop any indexes not required to support updates
– Load updates
– Drop all remaining indexes
– Load inserts through bulk loaders
– Rebuild indexes
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 73
8.2 Loading
• Metadata - data about data
– In DW, metadata describe the contents of a data warehouse and how to use it
• What information exists in a data warehouse, what the information means, how it was derived, from what source systems it comes, when it was created, what pre-built reports and analyses exist for manipulating the information, etc.
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 74
8.3 Metadata
• Types of metadata in DW
– Source system metadata
– Data staging metadata
– DBMS metadata
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 75
8.3 Metadata
• Source system metadata
– Source specifications
• E.g., repositories, and source logical schemas
– Source descriptive information
• E.g., ownership descriptions, update frequencies and access methods
– Process information
• E.g., job schedules and extraction code
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 76
8.3 Metadata
• Data staging metadata
– Data acquisition information, such as data transmission scheduling and results, and file usage
– Dimension table management, such as definitions of dimensions, and surrogate key assignments
– Transformation and aggregation, such as data enhancement and mapping, DBMS load scripts, and aggregate definitions
– Audit, job logs and documentation, such as data lineage records, data transform logs
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 77
8.3 Metadata
– E.g., Cube description metadata
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 78
8.3 Metadata
5/29/2009
14
• Business Intelligence (BI)
– Principles of Data Mining
– Association Rule Mining
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 79