Previews of TDWI course books are provided as an opportunity to see the quality of our material and help you to select the courses that best fit your needs. The previews can not be printed. TDWI strives to provide course books that are content-rich and that serve as useful reference documents after a class has ended. This preview shows selected pages that are representative of the entire course book. The pages shown are not consecutive. The page numbers as they appear in the actual course material are shown at the bottom of each page. All table-of-contents pages are included to illustrate all of the topics covered by a course.
43
Embed
TDWI Data Integration Basics - download.101com.comdownload.101com.com/pub/tdwi/Files/TDWI_Data_Integration_Basics_Previewv2.pdfTDWI Data Integration Basics Data Integration Concepts
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Previews of TDWI course books are provided as an opportunity to see the quality of our material and help you to select the courses that best fit your needs. The previews can not be printed. TDWI strives to provide course books that are content-rich and that serve as useful reference documents after a class has ended. This preview shows selected pages that are representative of the entire course book. The pages shown are not consecutive. The page numbers as they appear in the actual course material are shown at the bottom of each page. All table-of-contents pages are included to illustrate all of the topics covered by a course.
TDWI Data Integration Basics for Business and IT Professionals
The Data Warehousing Institute takes pride in the educational soundness and technical accuracy of all of our courses. Please give us your comments – we’d like to hear from you. Address your feedback to:
Dave Wells at TDWI defines data integration as “the process of combining data from two or more disparate but related data sources in such a way that data from each source increases the overall information value of the resulting body of data.” Consider these key points from the definition: • Data integration is a process. As with all processes, data integration
has inputs, events and activities that lead to production of a product. • Data integration combines data from multiple related data sources. • The goal of data integration is increased information value from a
body of data. INTEGRATION ACTIVITIES
The activities of the data integration process are those steps necessary to acquire data from sources, transform the data to achieve desirable properties of integrated data, and store integrated data so it is available for use. Data transformation steps – those that change the data – are the most complex of all integration activities. The goals when combining data include removing conflict, establishing data relationships, improving consistency of representation, and ensuring data quality. Business rules provide the foundation for data transformation logic. Transformation based on business rules serves to align data structure and content with real things in the business – an essential part of increasing information value of the data.
INTEGRATION RESULTS
The product of a data integration process is a database that contains integrated data. Desirable characteristics of integrated data include: • Every data element is connected with and related to other data
elements. • Each data element complements the surrounding data by collecting a
related business fact, adding clarity, and providing added context. • Each data element contains a unique and non-redundant business fact,
or if redundant avoids conflict and uncertainty of multiple values for the same business fact.
• The lineage of each data element is known and recorded; every data element is traceable to the source from which it was obtained.
Data Integration Concepts TDWI Data Integration Basics
Data Integration Defined What ISN’T Data Integration
STOVEPIPE DATA Data organized around business processes, business organizations, or transactions systems is not integrated. A payroll system and a personnel system, for example, each collect, store, and use employee data. When each system independently manages its own employee data redundancy conflicts are certain to occur. When each system uses its own means of identifying employees the situation is aggravated by inability to navigate between systems and to reconcile conflicts and discrepancies. These circumstances are common throughout the legacy applications of most organizations. More recently many organizations have developed stovepipe data marts, where each data mart is designed to meet the needs of a specific process or work group. When independent data definitions and transformation logic are defined for each data mart, no integration occurs. Non-integrated data marts may use more up-to-date technology than legacy systems, but they do nothing to resolve redundancy and conflict in the data.
CO-LOCATED DATA Putting all of the data into a single database does not by itself achieve
integration. The collective databases that are sometimes built – whether we call them data warehouse, operational data store, reporting database, or another name – are not integrated simply because they are a single database. The same issues of confusion and conflict occur when these databases contain islands of disconnected data, unresolved redundancy, and conflicting values for a single business fact.
Data Integration Concepts TDWI Data Integration Basics
Data Integration Context Business Context – The Need for Data Integration
DRIVERS OF INTEGRATION
Many different data and technology environments create a need for data integration. Although distinctly different in goals and purpose the issues, the need, and the integration process are similar for each of: • Non-integrated legacy systems where multiple systems independently
collect and manage redundant and overlapping data. • ERP islands with different and non-integrated ERP systems for
various business functions. • Data warehousing which brings together data from many disparate
sources. • Business intelligence which depends on a foundation of integrated
data to deliver meaningful information. • Mergers and acquisitions where dissimilar data resources of two
enterprises must be combined. • Cross-organizational metrics to provide consistent business measures
that involve multiple business processes, data sources, and computer systems.
INTEGRATION PROJECTS
The drivers itemized above typically result in two distinct kinds of data integration projects: • Recurring integration projects are needed when data needs to be
integrated on a continuous basis. These projects are typical for drivers such as cross-organizational metrics, business intelligence, and data warehousing. Note that the term “recurring integration” does not suggest that the project persists indefinitely, but that the integration process can be executed continuously.
• One-time integration projects are needed when the data integration
process needs to be executed only once. These kinds of projects are typical of data conversion to initially load ERP systems, historical data collection for initial data warehouse loads, and combining of data following mergers or acquisitions.
Although the nature of the projects differs, the integration issues and activities are similar for both types of projects.
Selecting Data Sources Evaluating Sources – Data with Integration Value
USABLE DATA SOURCES
Each prospective data source needs to be evaluated in terms of usability to help determine its real value as a source for data integration. A subjective assessment of usability criteria using a five point scale (1=poor, 5= excellent) is sufficient for the purpose. Usability criteria for evaluation include:
Criteria Assessment Questions Availability How available and accessible is the data? Are there technical obstacles to
access? Or ownership and access authority issues? Understandability How easily understood is the data? Is it well documented? Does someone
in the organization have depth of knowledge? Who works regularly with this data?
Stability How frequently do data structures change? What is the history of change for the data? What is the expected life span of the potential data source?
Accuracy How reliable is the data? Do the business people who work with the data trust it?
Timeliness When and how often is the data updated? How current is the data? How much history is available? How available is it for extraction?
Completeness Does the scope of data correspond to the scope of the data warehouse? Is any data missing?
Granularity Is the source the lowest available grain (most detailed level) for this data? MANAGEABLE DATA SOURCES
The degree to which a data source is easily managed is also important when selecting data sources. It is particularly important for those data sources that will be used routinely for ongoing integration activities such as data warehousing. Consider the following manageability criteria:
Criteria Assessment Questions Origin of Data Is this data source the first point-of-capture for the data? Is it a reliable
source for all instances of the data? Ownership of
Data Who owns the data and the system that collects it? Is it considered to be the system-of-record for the facts that it collects?
System Management
Is the data collection system managed internally or externally? By a service bureau? Internal IT department? End-user department?
Usage of Data Who uses this data? For what purpose? Does the usage naturally lead to feedback and verification of data quality?
TDWI Data Integration Basics Data Integration Systems
3-1
Module 3 Data Integration Systems
Topic Page Getting Data 3-2
Transforming Data 3-14
Storing Data 3-26
This page intentionally left blank.
Data Integration Systems TDWI Data Integration Basics
Getting Data Source-to-Target Data Element Mapping
SAMPLE MATRIX
The matrix on the facing page illustrates an example of mapping source data to target data at the data element level. Data element mapping is not necessarily complex. It is just detailed and sometimes tedious. This level of mapping is necessary to understand requirements for migration of data from non-integrated to integrated data stores. This detailed level of mapping provides information that is essential before transformation design can begin In this example we can see that: • Some data elements have one-to-one associations and identical names
(city, state, and zip_code for example). Do they share common formats and allowable values?
• Some data elements have one-to-one associations and similar but different names (sex / gender, date_of_birth / birthdate). Do they share common formats and allowable values?
• Some data elements have one-to-many associations (employee_name first_name, last_name, and middle_initial). Clearly some kind of
data transformation will be needed here.
• Some target data elements (plan_type, participation_end_date, participation_begin_date, plan_code) have no apparent data source. Will the data be manually populated? Is there another source? OK to not collect this data?
• Some source data elements (phone numbers and emergency contact data from PlayNation) have no corresponding target. Will the data be lost? Should the target be modified?
• Some collections of data elements (spouse and children benefits coverage, for example) are organized in significantly different ways. Complex data transformations may be needed here.
Data Integration Systems TDWI Data Integration Basics
Works well for one time data conversion such as:• Combining data from two systems• Initial load of warehousing data• Start-up data for ERP implementation
Works well for ongoing data integration with small amounts of data.
OK for ongoing data integration (i.e., data warehousing) when data volume is small, and timeliness of data is not important.
Works well for ongoing data integration when real-time data is desired.
Works well for ongoing data integration when real-time data is desired.
ALLDATA
CHANGEDDATA
PUSH TOTARGET
PULL FROMSOURCE
replicate sourcefiles / tables
extract sourcefiles / tables
replicate sourcechanges or
transactions
extract sourcechanges or
transactions
ALLDATA
CHANGEDDATA
PUSH TOTARGET
PULL FROMSOURCE
replicate sourcefiles / tables
extract sourcefiles / tables
replicate sourcechanges or
transactions
extract sourcechanges or
transactions
Works well for one time data conversion such as:• Combining data from two systems• Initial load of warehousing data• Start-up data for ERP implementation
Works well for ongoing data integration with small amounts of data.
OK for ongoing data integration (i.e., data warehousing) when data volume is small, and timeliness of data is not important.
Works well for ongoing data integration when real-time data is desired.
Works well for ongoing data integration when real-time data is desired.
TDWI Data Integration Basics Data Integration Systems
Data capture design seeks to get all of the data needed as efficiently as is practical, and to minimize impact on the source systems from which data is obtained. Some of the questions that help to design and develop an optimal data capture process are:
• What constraints does the source system impose? Source systems with limited batch processing time, or those that require 24x7 availability demand special consideration and careful design.
• Will data be captured from the source only one time, or will data capture be ongoing? One-time data capture processes typically consider simplicity, reliability, and speed of development to be more important than processing efficiency. An extract of all data from a source is often the most effective means of acquiring data.
• What volume of data is expected with each instance of data capture? Very large data volumes need special attention to efficiency of acquisition. Capturing only data changes is ideal when changes can reliably be detected. A source system capable of pushing changes may offer an ideal solution.
• Are all occurrences (rows/records) or only a subset needed? If only a subset is needed, then consider the percent of the total body of data that is needed. Small percentage indicates selection as part of the extract process. Large percentage suggests selection after extract.
• Will capture of data changes meet the need or ongoing data capture, or is a full extract needed each time? Can changes be reliably detected in the source system? When changes can’t be detected with confidence, then comparing generations of full extracts may be required. Changes may still be lost, however, depending on the frequency of extract and the volatility of the data.
• Can the source system push data to the integration system, or must the data be pulled by the integration system? For particularly sensitive source systems, push is the best option whenever possible. A push approach allows the source system to control impact of data acquisition.
• What technology is used to store the source data? What technologies are available for data capture? Exploit the available technology to achieve rapidly developed and easy to maintain data acquisition processes. Consider available ETL tools, DBMS replication features, database transaction logs, etc.
Data correctness defects exist whenever data is found to be in violation of correctness rules. Data cleansing is a process of taking action to remove defects of data quality. The four common kinds of actions include: • Detection – Knowing when a defect exists. • Repair – Fixing a defect in data that has already been delivered. • Correction – Fixing a data quality defect before the data is delivered. • Prevention – Fixing a process deficiency that allows defects to occur. Eleven types of data correctness rules, when intersected with four kinds of data cleansing activities (detect, repair, correct, prevent) yield forty-four distinct actions that may be taken to improve data correctness.
DETECTING DATA QUALITY DEFECTS
Validation, verification, and inspection are the common techniques used to detect data quality defects. Validation tests data against expressed data quality rules. Verification tests against other reliable sources (i.e., asking a customer to verify their address). Inspection conducts a thorough examination of data to discover properties that might not be found using validation and verification techniques. Where validation and verification assume known questions (e.g. business rules and alternative sources) inspection is a process of data-driven discovery where the questions aren’t necessarily known in advance.
Data integrity defects exist whenever data is found to be in violation of integrity rules. Data cleansing is a process of taking action to remove defects of data quality. The four common kinds of actions are identical to those discussed for data correctness defects: detect, repair, correct, and prevent. Seven types of data correctness rules, when intersected with four kinds of data cleansing activities yield twenty-eight distinct actions that may be taken to improve data correctness. When combined with the forty-four actions for data correctness, a total of seventy-two data cleansing actions are possible.
Continuous Quality Improvement Planning and Execution
PLANNED DATA QUALITY
Developing a plan for data cleansing includes the activities necessary to improve data quality, monitor achievement of quality goals, and evolve the data cleansing strategy. Data quality planning consumes time, effort, and resources – it is not free. Like most things, when done well, data quality strategy takes more effort to plan than to execute. The cost and effort of planning is supported by this simple truth: Good data quality is always the result of good planning. Only poor quality happens without planning. A comprehensive data quality plan includes:
Defined Scope addressing questions such as which data is within the scope of effort and which rule types to be applied. While you might be inclined to say “all data and all rules,” practical constraints of time and resources may demand that the scope of effort be reduced.
Goals and Measures that express quantifiable objectives of the data cleansing plan. Goals typically quantify a defect rate – i.e., 99.5% accuracy or zero reference defects. Measures are needed to assess the current state and to evaluate progress toward meeting the plan’s goals.
Actions describe what steps will be taken to improve quality and achieve the planned goals. This course has identified seventy-two common actions for data cleansing. No plan is likely to include all of them. Is the plan to detect errors and audit data quality? To correct or repair defects? To prevent defects at the source?
Roles, Resources and Responsibilities are assigned to detect, correct, and prevent data quality defects, as well to continuously measure and monitor.
Scheduling attaches a timeframe to the goals of the plan. Consider the relative priorities of data quality issues and dependencies among activities to develop a realistic timeline.
Continuity shifts data quality improvement from a project to an ongoing data management practice. Ideally, a data-cleansing plan seeks continuous improvement of data quality. Continuous quality improvement is achieved through regular planning, incremental improvements, and routine communication and feedback.
TDWI Data Integration Basics Data Integration Roles
Developing and operating data integration systems are processes that demand both business and technical knowledge. Understanding how data is used, what business rules apply, where and how it is collected, and the degree to which it is trusted offer examples of needs where business knowledge is paramount. Knowledge of storage methods, data structures, database capabilities, etc. provide examples of needs where technical skills are critical.
ROLES AND RESPONSIBILITIES FRAMEWORK
The five stages of data integration lifecycle – understand the data, get the data, change the data, store the data, and use the data provide the foundation to define a roles and responsibilities structure for data integration. When intersected with typical information systems lifecycle phases – planning, analysis, design, construction, implementation, and operation (or execution) – they yield a roles and responsibilities matrix as shown on the facing page. Note that the cells in the matrix do not represent roles or activities, but categories of work within which activities, roles, and responsibilities need to be identified.
Data Integration Roles TDWI Data Integration Basics
Understanding the Data Planning and Analysis Roles
• Conflicting business definitions and terminology
• Different ways of identifying data• Data overlap and inconsistency throughout business transaction systems• Hidden meaning, missing data and much more …• Deciding which data to use• Mapping transaction data sources to integrated data targets• Detecting and capturing data changes• Timing and source data readiness and much more …• Business rules for data transformation• Auditing and improving data quality• Connecting data from multiple and disparate transaction systems• Delivering summary data without loss of detail and much more …• Moving data securely over computer networks• Fast and reliable transport for large amounts of data• Freshness of data and timing of data loads• Availability and “up time” vs. time required to load and much more …• Access and navigation of the data• Understanding contents of integrated data stores• Quality, trust, and confidence• Feedback and continuous improvement and much more …
• Conflicting business definitions and terminology• Different ways of identifying data• Data overlap and inconsistency throughout business transaction systems• Hidden meaning, missing data and much more …• Deciding which data to use• Mapping transaction data sources to integrated data targets• Detecting and capturing data changes• Timing and source data readiness and much more …• Business rules for data transformation• Auditing and improving data quality• Connecting data from multiple and disparate transaction systems• Delivering summary data without loss of detail and much more …• Moving data securely over computer networks• Fast and reliable transport for large amounts of data• Freshness of data and timing of data loads• Availability and “up time” vs. time required to load and much more …• Access and navigation of the data• Understanding contents of integrated data stores• Quality, trust, and confidence• Feedback and continuous improvement and much more …
Design and construction activities of data transformation build the processes to actually change the data. These activities include: • Identify rule dependencies to develop a modular design that executes
interdependent rules in the correct sequence. Rule dependency exists when execution of a transformation rule is based upon the result of another rule.
• Design and build transformation modules that package a collection of interdependent rules as a single, executable computer procedure.
• Identify time dependencies to develop a process design that executes transformation modules in the correct sequence. Time dependency exists when one transformation rule must execute before another can be executed.
• Design and assemble transformation processes as a set of modules to be executed together in a specific sequence.
ROLES AND RESPONSIBILITIES
Applying the roles and responsibilities model produces a result such as that shown below. Responsibility designations may differ for your organization and activities may need to be tailored to your specific project.
Activity Business IT
Identify Rule Dependencies Consult Decide
Design and Build Transformation Modules Inform Decide
Identify Time Dependencies Consult Decide
Design and Assemble Transformation Processes Inform Decide
Data Integration Roles TDWI Data Integration Basics
• Conflicting business definitions and terminology• Different ways of identifying data• Data overlap and inconsistency throughout business transaction systems• Hidden meaning, missing data and much more …• Deciding which data to use• Mapping transaction data sources to integrated data targets• Detecting and capturing data changes• Timing and source data readiness and much more …• Business rules for data transformation• Auditing and improving data quality• Connecting data from multiple and disparate transaction systems• Delivering summary data without loss of detail and much more …• Moving data securely over computer networks• Fast and reliable transport for large amounts of data• Freshness of data and timing of data loads• Availability and “up time” vs. time required to load and much more …• Access and navigation of the data• Understanding contents of integrated data stores• Quality, trust, and confidence• Feedback and continuous improvement and much more …
test and execute dataaccess capabilities,manage data quality
• Conflicting business definitions and terminology• Different ways of identifying data• Data overlap and inconsistency throughout business transaction systems• Hidden meaning, missing data and much more …• Deciding which data to use• Mapping transaction data sources to integrated data targets• Detecting and capturing data changes• Timing and source data readiness and much more …• Business rules for data transformation• Auditing and improving data quality• Connecting data from multiple and disparate transaction systems• Delivering summary data without loss of detail and much more …• Moving data securely over computer networks• Fast and reliable transport for large amounts of data• Freshness of data and timing of data loads• Availability and “up time” vs. time required to load and much more …• Access and navigation of the data• Understanding contents of integrated data stores• Quality, trust, and confidence• Feedback and continuous improvement and much more …
Value of integrated data is realized when the data is used to achieve positive business outcomes – executing the entire data-to-value chain. Usage activities include: • Test operational features and functions to ensure that they work
correctly and meet business needs. Formalize successful testing by documenting system acceptance.
• Test decision-support and analytic capabilities to ensure that they work correctly and meet business needs. Formalize successful testing by documenting system acceptance.
• Employ operational system capabilities to execute and record business transactions, to carry out day-to-day work, and to obtain data and information needed for operational activities.
• Employ decision-support and analytic capabilities to inform decision-making processes, analyze business outcomes, forecast business trends, and enlighten planning processes.
• Manage data quality by providing continuous feedback about the quality of the data, and by correcting business process issues that lead to data quality problems.
ROLES AND RESPONSIBILITIES
Applying the roles and responsibilities model produces a result such as that shown below. Responsibility designations may differ for your organization and activities may need to be tailored to your specific project.
Activity Business IT
Test Operational Features and Functions Decide Consult
Test Decision-Support and Analytic Capabilities Decide Consult
Employ Operational System Capabilities Decide Consult
Employ Decision-Support and Analytic Capabilities Decide Consult
Manage Data Quality Decide by Consensus
Data Integration Roles TDWI Data Integration Basics
In Conclusion Best Practices for Data Integration Success
PROCESS AND TEAMWORK
Four key elements make up successful data integration projects regardless of the reason for data integration: • Data integration is managed as a process with six distinct stages –
planning, analysis, design, construction, implementation, and execution.
• Each stage of the process has activities to focus on every aspect of data integration – understanding the data, getting the data, changing the data, storing the data, and using the data.
• Every activity has designated roles and responsibilities. • Business and IT work together as a team to achieve successful data
integration. MAKING TEAMWORK WORK
To achieve real teamwork every stakeholder in the data integration project, whether representing business or IT, must be able to fill multiple roles – sometimes with decision authority and sometimes in a consulting and advisory role. With clearly designated roles and responsibilities for each activity, teamwork is achieved when: • Business has significant decision-making responsibility. • IT has significant decision-making responsibility. • Business has a consulting and advisory role in IT decisions. • IT has a consulting and advisory role in business decisions. • Critical decisions are made by consensus of business and IT.
A MODEL FOR INTEGRATION TEAMWORK
The following two pages summarize the set of activities discussed throughout this module and suggest typical designation of business and IT roles for each activity. Note that decision-making roles are divided between business and IT, and that each supports and advises the other in a consulting capacity as needed. This model is not presented as the “right way” for all integration projects. It may readily be adapted to your data integration project by adding activities unique to the project, removing activities not needed for the project, and adjusting responsibilities to fit the organization and culture in which the project will be performed. It is less important which roles and responsibilities are decided than that they are decided at the start of the project.
TDWI Data Integration Basics Basis of Course Examples
E-Max is a consumer electronics retailer with sales outlets that include brick-and-mortar stores, an internet outlet, and catalog sales. E-Max acquires PlayNation, a small chain of electronic gaming stores clustered locally in a fewregions throughout the US and Canada.E-Max has a mature IT department that supports many operational systems and is in the earlystages of building a data warehouse. PlayNation has an ad-hoc systems environment typical of small companies. Much of the data management is done locally by each regional office. Critical corporate systems for finance and payroll are operated by an external service bureau. Most internal data is stored in spreadsheetscomplemented by limited use of a Microsoft Access® database.The most pressing data integration needs are related to workforce and payroll data. Compliance considerations, common paymaster requirements, and the move to an international workforce (with PlayNation’s Canada stores) drive E-Max to focus first on these areas.After satisfying the urgent need to integrate workforce and payroll data, attention will turn to other operational systems and data warehousing.
Basis of Course Examples TDWI Data Integration Basics
HRMS Data• employee• appointment• job postings• applicants• position • salary and wage• benefits programs• benefits enrollment• personnel actions• salary history• employee performance history• benefits participation history
Payroll Data• employee (common with HRMS)• appointment (common with HRMS)• position (common with HRMS)• funding distribution• dollar balances • employee deductions• employer contributions• payment history and audit trail• direct deposit enrollment• direct deposit transmittal• time and commission transactions• deduction history
TDWI Data Integration Basics Basis of Course Examples