Page 1
TN PM Tutorial
10/14/2014 1:00:00 PM
"Testing the Data Warehouse—
Big Data, Big Problems"
Presented by:
Geoff Horne
NZ/OZ/USTester Magazine
Brought to you by:
340 Corporate Way, Suite 300, Orange Park, FL 32073
888-268-8770 ∙ 904-278-0524 ∙ [email protected] ∙ www.sqe.com
Page 2
Geoff Horne
NZTester Magazine Geoff Horne has an extensive background in test program/project directorship and management, architecture, and
general consulting. In New Zealand Geoff established and ran ISQA as a testing consultancy which enjoys a local and international clientele in Australia, the United States, and the United Kingdom. He has held senior test management roles across a number of diverse industry sectors, and is editor and publisher of the recently launched NZTester magazine. Geoff has authored a variety of white papers on software testing and is a regular speaker at the STAR conferences. Married with four children, he enjoys writing and recording contemporary Christian music.
Page 3
1
Testing the Data WarehouseGeoff Horne, NZTester Magazine
[email protected]
October 2014
NZTester
www.nztester.co.nz
2
• 37 years IT in various roles including development, sales, consulting, IT
management and testing.
• The last 19 years has been exclusively in test/QA management & consulting.
• Extensive background in programme/project test management, advisory services,
governance, architecture and general consulting.
• Established & ran ISQA as a testing consultancy and practice 2000-2007 (it now
runs as a vehicle for my contracting services).
• Founder & publisher of NZTester, OZTester and USTester Magazines for which I also
undertake writing, editing & analysis duties. As this is my first foray into publishing
& journalism, I'm on a steep learning curve however thoroughly enjoying myself.
• Recently taken on my first assignment as a software testing industry analyst with a
large American IT technology company; speaking at conferences and delivering
white papers and webinars.
NZTester
About Me:
Page 4
2
• Introduction to Data Warehousing as a Solution to a Problem
• Why Test?
• Data Warehouse Testing References
• What to Test?
• Where to Test?
• Test Order
• Typical Data Warehouse Issues
• Transformation Rules
• Source to Data Warehouse – Unit Testing
• Source to Data Warehouse – Integration Testing
• Continually Changing Source Systems
• Planning for Data Warehouse Testing
• Planning Testing for Common Data Warehouse Issues
3NZTester
www.nztester.co.nz
Agenda:
4NZTester
www.nztester.co.nz
Agenda:
• Useful Skills for Data Warehouse Testing
• Considerations for Selecting Data Warehouse Testers
• Common QA Tasks for the Data Warehouse Test Team
• Analyse Source Data before and after Extraction to Staging
• Verifying Corrected and Cleansed Source Data in Staging
• Verifying Matched and Consolidated Data
• Verifying Transformed/Enhanced/Calculated Data to Target Tables
• End-to-End Testing
• Acceptance Testing
• Performance Testing
• Regression Testing
• Opportunities for Automation
• Questions
Page 5
3
5
www.nztester.co.nz
Which came first, the bug or the test?
NZTester
www.nztester.co.nz
www.nztester.co.nzwww.nztester.co.nz
6NZTester
Page 6
4
7
www.nztester.co.nz
NZTester
www.nztester.co.nz
8
www.nztester.co.nzwww.nztester.co.nz
NZTester
Page 7
5
www.nztester.co.nz
NZTester
www.nztester.co.nz
9
10NZTester
www.nztester.co.nz
+ +
Page 8
6
11NZTester
www.nztester.co.nz
www.nztester.co.nz
12NZTester
Page 9
7
Examples:
Source: Wikipedia
www.nztester.co.nz
13NZTester
Examples:
• Walmart handles 1m transactions per hour imported into
databases containing 2.5 petabytes of data
• Google processes 25 petabytes of data per day (= ~25,600
terabytes)
• AT&T transfers 30 petabytes per day
• 90 trillion emails are sent per year
• World of Warcraft uses 1.3 petabytes of storage
• Facebook stores 2.5+ petabytes of user data including 50 billion
photos and processes 50+ terabytes per day
14NZTester
www.nztester.co.nz
Page 10
8
Examples:
• Wayback Machine stores 3 petabytes of data and processes 100
terabytes per day
• eBay stores 6.5 petabytes of data and processes 100 terabytes per
month
• CERN’s Large Hydron Collider generates 15 petabytes per year
• NASA Center for Climate Simulation store 32 petabytes of climate
observations
• Amazon.com handles millions of back-end operations every day
and operates the three largest Linux databases in the world
Source: Wikipedia, TheBigDataGroup.com
15NZTester
www.nztester.co.nz
Characteristics – the 3 + 1 Vs:
• Volume: more data than ever before, most of the world’s data is
un-, semi- or multi-structured
• Variety: more sources than ever before – social, web logs, machine
logs, photos, documents, geotags, video….
• Velocity: some data only has value for a short space of time –
relevance engines, financial fraud sensors, early warning sensors….
• Vitality: agility is required in analytics, able to adapt quickly to
changing business needs
16NZTester
www.nztester.co.nz
Page 11
9
www.nztester.co.nz
Enterprise Involvement:
• Awareness is high however 75% still wondering what its all about
• Usual answer – we don’t know what the business case is!
17NZTester
18
Challenges:
• How can we understand and use Big Data when it comes in an
unstructured format eg text or video?
• How can we capture the most important data as it happens and
deliver that to the right people in real-time?
• How can we store the data?
• How can we analyse and understand it given its size and our
computational capacity?
• How will we cater for the increasing data deluge?
NZTester
www.nztester.co.nz
Page 12
10
19
Opportunities:
• McKinsey calls Big Data “the next frontier for innovation,
competition and productivity”.
• We can answer questions with Big Data that were beyond our reach
in the past.
• We can extract insight and knowledge, identify trends and use the
data to improve productivity, gain competitive advantage and
create substantial value.
• The challenges with Big Data are limited compared to the potential
benefits, which are limited only by our creativity and ability to
make connections among the trillions of bytes of data we have
access to.
NZTester
www.nztester.co.nz
20
So, how is all that data to be divvied up?
NZTester
www.nztester.co.nz
Page 13
11
21
So, how?
+
NZTester
www.nztester.co.nz
www.nztester.co.nz
Date Warehousing :
22NZTester
Page 14
12
Date Warehousing – a Definition:
23NZTester
www.nztester.co.nz
A data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a
database used for reporting and data analysis. It is a central repository of
data which is created by integrating data from one or more disparate sources.
Data warehouses store current as well as historical data and are used for
creating trending reports for senior management reporting such as annual
and quarterly comparisons.
Source, Wikipedia.org, 2013
Date Warehousing :
• Pre-1990s: innovations by ACNielsen, Sperry & Teradata
• 1990 – Ralph Kimball & Red Brick Systems
• Businesses becoming increasingly dependent on timely intelligence
• Fast growing requirement for faster, more stable, reliable, flexible & easily
accessible intelligence repositories
• Big Data revolution will create exponential pressure to deliver quality solutions
• Will current toolsets be able to cope in terms of speed & reliability?
• New innovations, products, technologies will undoubtedly emerge and….
24NZTester
www.nztester.co.nz
Page 15
13
Date Warehousing :
If you take over the world, you’re gonna need lawyers!
25NZTester
www.nztester.co.nz
Date Warehousing :
If you develop & deliver faster, more stable, reliable, flexible & easily accessible
intelligence repositories, you’re gonna need testers!
26NZTester
www.nztester.co.nz
Page 16
14
Why Test?
• Source data is often huge in volume and obtained from varied types of data
repositories eg. application databases, spreadsheets, flat files, data feeds etc
• Source data quality cannot be assumed and should be profiled and cleaned
• Source data may be inconsistent and redundancy present
• Source data records may be rejected by ETL procedures and logs will contain
error messages that will need addressing
• Source field values may be missing where they should be present.
• Source data history, business rules and audits of source data may not be
available.
• Enterprise-wide data knowledge and business rules may not be available to
verify data.
27NZTester
www.nztester.co.nz
Why Test(2)?
• There may be multi-phased ETL procedures and a high level of data variety may
exist.
• Data sources (eg. mainframe, spreadsheets, databases, flat files) will be updated
over time
• Transaction-level traceability is difficult to attain during ETL
• The data warehouse will be a strategic enterprise resource and heavily replied
upon.
28NZTester
www.nztester.co.nz
Page 17
15
Data Warehouse Testing References
29NZTester
www.nztester.co.nz
• Data Warehouse test plans and templates (sample and real)
• Sample Data Warehouse test scripts for all phases
• ETL testing checklists from test planning to post run troubleshooting.
• Business Intelligence report testing – plans and approaches
• Stored procedure testing
• Data Warehouse testing presentations (case studies, proposals)
• Vendor services; Data Warehouse testing brochures
• Data Warehouse QA guidance from IBM, Oracle, Informatica, Microsoft,
etc.
• Data Warehouse test automation and regression testing guidance
Data Warehouse Testing References(2)
30NZTester
www.nztester.co.nz
• Data Warehouse tester job descriptions
• Data integration, migration best practices
• Data profiling; from beginner to advanced methods
• SQL queries for testers
• Data Warehouse tester training agendas
• Data Warehouse Testing – PhD, Master’s degree thesis
Page 18
16
What to Test?
• Data Completeness – all expected data is correctly loaded via ETL procedures
• Data Transformation – all data is transformed correctly according to business
rules and design specifications
• Data Quality – the ETL application correctly rejects, remedies, ignores,
substitutes and reports on invalid data
• Performance and Scalability - data loads and queries perform within expected
time frames and that the technical architecture is scalable.
• Integration Testing - the ETL process accommodates all required upstream and
downstream processes.
• User Acceptance Testing - the end result meets or exceeds business
stakeholder and user expectations.
• Regression Testing - existing functionality remains intact each time a new
release of code is completed.
31NZTester
www.nztester.co.nz
Where to Test?
32NZTester
www.nztester.co.nz
Primary
Page 19
17
Where to Test?
33NZTester
www.nztester.co.nz
Primary
Secondary
Where to Test?
34NZTester
www.nztester.co.nz
Primary
Secondary
Tertiary
Page 20
18
Test Order?
35NZTester
www.nztester.co.nz
Primary
Test Order?
36NZTester
www.nztester.co.nz
Primary
Secondary
Page 21
19
Test Order?
37NZTester
www.nztester.co.nz
Primary
Secondary
Tertiary
Test Order?
38NZTester
www.nztester.co.nz
Primary
Secondary
Tertiary
Page 22
20
39
Typical Data Warehouse Issues:
• Inadequate ETL and stored procedure design documentation to aid in
test planning.
• Field values are null when specified as Not Null.
• Field constraints and SQL not coded correctly for the ETL tool.
• Excessive ETL errors discovered after entry to formal QA - lack of unit
testing.
• Source data does not meet table mapping specifications (eg. dirty data).
• Source-to-target mappings: (1) often not reviewed before
implementation, (2) are in error or (3) not consistently maintained
throughout the development life cycle.
• Data models are not adequately maintained during the development life
cycle.
NZTester
www.nztester.co.nz
Typical Data Warehouse Issues(2):
• Duplicate field values are found in either source or target data when
defined in mapping specifications to be distinct.
• ETL/SQL transformation errors leading to missing rows and invalid field
values.
• Constraint violations exist in source (perhaps could be found through
data profiling).
• Target data is incorrectly stored or in non-standard formats.
• Primary or foreign key values are incorrect for important relationship
linkages.
www.nztester.co.nz
40NZTester
Page 23
21
Transformation rules:
• Specify source table elements from all data sources including metadata
• Specify Data Warehouse destination table elements:
• Dimensions – reference data, keys etc.
• Facts – data assets
• Specify how the source table elements map onto the destination table
elements
• Form the basis of unit test cases
www.nztester.co.nz
41NZTester
Transformation rules:
Source_Database_1
SD1_Table_1
SD1_T1_Attr_1
SD1_T1_Attr_2
SD1_T1_Attr_3
SD1_T1_Attr_4
SD1_Table_2
SD1_T2_Attr_1
SD1_T2_Attr_2
SD1_T2_Attr_3
SD1_T2_Attr_4
Dest_Database_Data
Warehouse
Data Warehouse _Dim
DD1_T1_Attr_1
DD1_T1_Attr_2
DD1_T1_Attr_3
Data Warehouse _Fact
DD1_T2_Attr_1
DD1_T2_Attr_2
DD1_T2_Attr_3
Transformation Rules
= SD1_T1_Attr_1
= SD1_T1_Attr_2
= SD1_T1_Attr_3 + SD1_T1_Attr_4
= (SD1_T2_Attr_1 * SD1_T2_Attr_3)/52
= SD1_T2_Attr_3 + " " + SD1_T2_Attr_4
= DD1_T1_Attr_3/SD1_T2_Attr_4
www.nztester.co.nz
42NZTester
Page 24
22
Transformation rules:
www.nztester.co.nz
43NZTester
From Source to Data Warehouse – Unit Testing:
• Know your transformation rules!
• Test cases should cover each transformation rule and include positive and
negative situations
• Row counts: Destination = Source + Rejected
• Correctly access all required data including metadata
• Cross reference Data Warehouse Dimensions to source tables
• All computations are correct especially those based on business rules
• Database queries, expected vs actual results
www.nztester.co.nz
44NZTester
Page 25
23
From Source to Data Warehouse – Unit Testing:
• Rejects are correctly handled and conform to business rules
• Slow-changing data eg. address, marital status
• Correctness of surrogate keys eg. time zones, currencies in Fact tables
• Opportunities for automation
• Dual drive:
• Source table driven – data ends up in the right place
• Destination table driven – contains the right result
• Risk-based testing
www.nztester.co.nz
45NZTester
From Source to Data Warehouse – Integration Testing:
Once all extract, transformation and load unit tests have been successfully executed, need to execute ETL processes between stages:
• Job sequences and dependencies
• Errors in one job that impact subsequent jobs
• Error log generation
• Restarting the ETL process in case of failure:
• Does it have to be started over?
• Can it start from where it failed?
• Restores required?
• Auto/manual?
• Impact of failure on subsequent jobs
• Processing of rejected records
• Reprocessing of already processed records
www.nztester.co.nz
46NZTester
Page 26
24
Data Warehouse Testing – Continually Changing Source Systems
• Source data quality = garbage in/garbage out
• Inherent nature of Data Warehouse is continually updating data and source
systems so testing must allow for both
• New Source data/schema/application = retesting/regression testing
• Data Warehouse systems are always high maintenance
• Will always find new issues
• Opportunities for automation
• Package test suites modularly for ease of repeatability
www.nztester.co.nz
47NZTester
Planning for Data Warehouse Testing
• Source data quality = garbage in/garbage out
• Business requirements document
• Data models for source and target schemas
• Source-to-target mappings
• ETL design documents Configuration management system
• Project schedule
• Data quality verification process
• Incident and error handling system
www.nztester.co.nz
48NZTester
Page 27
25
• QA staff resources estimates and training needs
• Testing environment budget and plan
• Test tools
• Test objectives
• QA roles and responsibilities
• Test deliverables
• Test tasks
• Entrance criteria that should be met before formal testing commences
• Exit criteria that should be met before formal testing is completed
Planning for Data Warehouse Testing(2)
www.nztester.co.nz
49NZTester
Planning Tests for Common Data Warehouse Issues
• Inadequate ETL and stored procedure design documentation to aid in test
planning.
• Field values are null when specified as Not Null.
• Field constraints and SQL not coded correctly for the ETL tool.
• Excessive ETL errors discovered after entry to formal QA.
• Source data does not meet table mapping specifications (eg. dirty data).
• Source-to-target mappings: (1) often not reviewed before implementation,
(2) are in error or (3) not consistently maintained throughout the
development life cycle.
• Data models are not adequately maintained during the development life
cycle.
www.nztester.co.nz
50NZTester
Page 28
26
Planning Tests for Common Data Warehouse Issues(2)
• Duplicate field values are found in either source or target data when defined in
mapping specifications to be distinct.
• ETL/SQL transformation errors leading to missing rows and invalid field values.
• Constraint violations exist in source (perhaps could be found through data
profiling).
• Target data is incorrectly stored in nonstandard formats.
• Primary or foreign key values are incorrect for important relationship linkages.
www.nztester.co.nz
51NZTester
Some data mapping and data movement best practice goals:
• Introduce common, consistent data movement analysis, design, and coding
patterns,
• Develop reusable, enterprise-wide analysis, design, and construction
components through data movement modelling processes using data
movement tools, to ensure an acceptable level of data quality per business
specifications,
• Introduce best practices and consistency in coding and naming standards.
• Reduce costs to develop and maintain analysis, design and source code
deliverables
• Integrate controls into the data movement process to ensure data quality and
integrity.
• An ETL conceptual data movement model should be created as part of the
information management strategy. This model is part of the business model and
shows what data flows into, within, and out of the organization.
www.nztester.co.nz
52NZTester
Page 29
27
Those involved in test planning should consider the following verifications as
primary among those planned for various phases of the data warehouse loading
project.
• Verify data mappings, source to target
• Verify that all tables and specified fields were loaded from source to staging
• Verify that primary and foreign keys were properly generated using sequence
generator or similar
• Verify that not-null fields were populated
• Verify no data truncation in each field
• Verify data types and formats are as specified in design phase
• Verify no unexpected duplicate records in target tables.
www.nztester.co.nz
53NZTester
Those involved in test planning should consider the following verifications as
primary among those planned for various phases of the data warehouse loading
project(2)
• Verify transformations based on data table low level design (LLDs—usually
text documents describing design direction and specifications)
• Verify that numeric fields are populated with correct precision
• Verify that each ETL session completed with only planned exceptions
• Verify all cleansing, transformation, error and exception handling
• Verify stored procedure calculations and data mappings
www.nztester.co.nz
54NZTester
Page 30
28
Useful Skills for Testing:
• Good understanding of the fundamental concepts of data warehousing and
its place in an information management environment.
• Understanding the role of the testing process as part of data warehouse
development.
• Development of data warehouse test strategies, test plans, and test cases -
what they are and how to develop them, specifically for data warehouse
and decision-support systems.
• Creating effective test cases and scenarios based on technical and
business/user requirements.
• Able to participate in reviews of the data models, data mapping
documents, ETL design, and ETL coding; provide feedback to designers and
developers.
www.nztester.co.nz
55NZTester
Useful Skills for Testing(2):
• Able to participate in the change management process and documenting
relevant changes to decision support requirements.
• A good understanding of data modelling and source-to-target data
mappings
• Skills and experience with SQL, stored procedures, database management
and ETL tools
• Data profiling experience
• Microsoft Excel etc. expertise for data analysis
• Understanding how data from the data warehouse is used by the business
and the business processes it is related to
www.nztester.co.nz
56NZTester
Page 31
29
Members of the QA staff who will plan and execute data warehouse testing
should have many of the following skills and experiences.
• Over five years of experience in testing and development in the fields of data
warehousing, client server technologies, which includes over five years of
extensive experience in data warehousing with Informatica, SSIS or other ETL
tools.
• Strong experience in Informatica or SQL Server, stored procedure and SQL
testing.
• Expertise in unit and integration testing of the associated ETL or stored
procedure code.
Considerations for Selecting Data Warehouse Testers
www.nztester.co.nz
57NZTester
58
• Experience in creating data verification unit and integration test plans and
test cases based on technical specifications.
• Demonstrated ability to write complex multi-table SQL queries.
• Excellent skills with OLAP, ETL, and business intelligence.
• Experience with dimensional data modelling using Erwin Modelling star join
schema/snowflake modelling, fact and dimensions tables, physical and logical
data modelling.
• Experience in OLAP reporting tools like Business Objects, SSRS, OBusiness
IntelligenceEE or Cognos.
• Expertise in data migration, data profiling, data cleansing.
Considerations for Selecting Data Warehouse Testers(2)
NZTester
www.nztester.co.nz
Page 32
30
• Hands on experience with source-to-target mapping in enterprise data
warehouse environment. Responsible for QA tasks in all phases of the system
development life cycle (SDLC), from requirements definition through
implementation, on large-scale, mission critical processes; excellent
understanding of business requirements development, data analysis,
relational database design, systems development methodologies,
business/technical liaising, workflow and quality assurance.
• Experienced in business analysis, source system data analysis, architectural
reviews, data validation, data testing, resolution of data discrepancies and
ETL architecture. Good knowledge of QA processes.
Considerations for Selecting Data Warehouse Testers(3)
www.nztester.co.nz
59NZTester
• Familiarity with performance tuning of targets databases and sources system.
• Extensively worked in both UNIX (AIX/HP/Sun Solaris) and Windows
(Windows SQL Server) platforms.
• Good knowledge of UNIX Shell Scripting and understanding of PERL scripting.
• Experience in Oracle 10g/9i/8i, PL/SQL, SQL, TOAD, Stored Procedures,
Functions and Triggers.
Considerations for Selecting Data Warehouse Testers(4)
www.nztester.co.nz
60NZTester
Page 33
31
During the data warehouse testing life cycle, many of the following tasks may be
typically be executed by the QA team. It is important to plan for those tasks
below that are keys to the project’s success.
• Complete test data acquisition and baseline all test data.
• Create test environments.
• Document test cases.
• Create and validate test scripts.
• Conduct unit testing and confirm that each component is functioning
correctly.
• Conduct testing to confirm that each group of components meet
specification.
Common QA Tasks for the Data Warehouse Test Team
www.nztester.co.nz
61NZTester
• Conduct unit testing and confirm that each component is functioning
correctly.
• Conduct testing to confirm that each group of components meet
specification.
• Conduct quality assurance testing to confirm that the solution meets
requirements.
• Perform load testing, or performance testing, to confirm that the system is
operating correctly and can handle the required data volumes and that data
can be loaded in the available load window.
• Specify and conduct reconciliation tests to manually confirm the validity of
data.
Common QA Tasks for the Data Warehouse Test Team(2)
www.nztester.co.nz
62NZTester
Page 34
32
• Conduct testing to ensure that the new software does not cause problems
with existing software.
• Conduct user acceptance testing to ensure that business intelligence reports
work as intended.
• Carefully manage scope to ensure that perceived defects are actually
requirement defects and not something that would be “nice to have, but we
forgot to ask.”
• Conduct a release test and production readiness test.
• Ensure that the on-going defect management and reporting is effective.
• Manage testing to ensure that each follows testing procedures and software
testing best practices.
Common QA Tasks for the Data Warehouse Team(3)
www.nztester.co.nz
63NZTester
• Establish standard business terminology and value standards for each subject
area.
• Develop a business data dictionary that is owned and maintained by a series
of business-side data stewards. These individuals should ensure that all
terminology is kept current and that any associated rules are documented.
• Document the data in your core systems and how it relates to the standard
business terminology. This will include data transformation and conversion
rules.
Common QA Tasks for the Data Warehouse Team(4)
www.nztester.co.nz
64NZTester
Page 35
33
• Establish a set of data acceptance criteria and correction methods for your
standard business terminology. This should be identified by the business-side
data stewards and implemented against each of your core systems (where
practical).
• Implement a data profiling program as a production process. You should
• Consider regularly measuring the data quality (and value accuracy) of the
data
• Contained within each of your core operational systems.
Common QA Tasks for the Data Warehouse Team(5)
www.nztester.co.nz
65NZTester
Process Description:
• Extract representative samples of data from each source or staging table.
• Parse the data for the purpose of profiling.
• Verify that not-null fields are populated as expected.
• Structure discovery - Does the data match the corresponding metadata? Do
field attributes of the data match expected patterns? Does the data adhere
to appropriate uniqueness and null value rules?
• Data discovery - Are the data values complete, accurate and unambiguous?
• Relationship discovery - Does the data adhere to specified required key
relationships across columns and tables? Are there inferred relationships
across columns, tables or databases? Is there redundant data?
Analyze Source Data before and after Extraction to Staging
www.nztester.co.nz
66NZTester
Page 36
34
• Verify that all required data from the source was extracted.
• Verify that extraction process did not extract more or less data source than it
should have.
• Verify or write defects for exceptions and errors discovered during the ETL
process.
• Verify that extraction process did not extract duplicate data from the source
(usually this happens in repeatable processes where at point zero we need to
extract all data from the source file, but the during the next intervals we only
need to capture the modified, and new rows).
• Validate that no data truncation occurred during staging.
Analyze Source Data before and after Extraction to Staging(2)
www.nztester.co.nz
67NZTester
This step works to improve the quality of existing data in source files or defects
that meet source specs but must be corrected before load.
Inputs:
• Files or tables (staging) that require cleansing; data definition and business
rule documents, data map of source files and fields; business rules, and data
anomalies discovered in earlier steps of this process.
• Fixes for data defects that will result in data does not meet specifications for
the application Data Warehouse .
• Meet source specifications but must be corrected before load.
Verify Corrected, Cleansed Source Data in Staging
www.nztester.co.nz
68NZTester
Page 37
35
• Outputs - Defect reports, cleansed data, rejected or uncorrectible data.
• Techniques and Tools - Data reengineering, transformation, and cleansing
tools, MS Access, Excel filtering.
• Process Description - In this step, data with missing values, known errors,
and suspect data is corrected. Automated tools may be identified to best to
locate, clean/correct large volumes of data.
Verify Corrected, Cleansed Source Data in Staging(2)
www.nztester.co.nz
69NZTester
• Document the type of data cleansing approach taken for each data type in
the repository
• Determine how “uncorrectable” or suspect data is processed, rejected,
maintained for corrective action. SMEs and stakeholders should be involved
in the decision.
• Review ETL defect reports to assess rejected data excluded from source files
or information group targeted for the warehouse.
• Determine if data not meeting quality rules was accepted.
• Document in defect reports, records and important fields that cannot be
easily corrected.
Verify Corrected, Cleansed Source Data in Staging(3)
www.nztester.co.nz
70NZTester
Page 38
36
• Document records that were corrected and how corrected.
• Certification Method - Validation of data cleansing processes could be a
tricky proposition, but certainly do-able.
• All data cleansing requirements should be clearly identified.
• The QA team should learn all data cleansing tools available and their
methods.
• QA should create various conditions as specified in the requirements for
the data cleansing tool to support and validate its results.
• QA will run a volume of real data through each tool to validate accuracy
as well as performance.
Verify Corrected, Cleansed Source Data in Staging(4)
www.nztester.co.nz
71NZTester
• There are often ETL processes where data has been consolidated from
various files into a single occurrence of records. The cleansed and
consolidated data can be assessed to verify matched and consolidated data.
• Much of the ETL “heavy lifting” occurs in the transform step where combined
data, data with quality issues, updated data, surrogate keys, and build
aggregates are processed.
• Inputs - Analysis of all files or databases for each entity type.
Verifying Matched and Consolidated Data
www.nztester.co.nz
72NZTester
Page 39
37
Outputs:
• Report of matched, consolidated, related data that is suspect or in error.
• List of duplicate data records or fields.
• List of duplicate data suspects.
Techniques and Tools - Data matching techniques or tools; data cleansing
software with matching and merging capabilities.
Verifying Matched and Consolidated Data(2)
www.nztester.co.nz
73NZTester
Process Description:
• Establish match criteria for data. Select attributes to become the basis for
possible duplicate occurrences (e.g., names, account numbers).
• Determine the impact of incorrectly consolidated records. If the negative
impact of consolidating two different occurrences such as different
customers into a single customer record exists, submit defect reports. The fix
should be higher controls to help avoid such consolidations in the future.
• Determine the matching techniques to be used. Exact character match in two
corresponding fields such as wild card match, key words, close match, etc.
Verifying Matched and Consolidated Data(3)
www.nztester.co.nz
74NZTester
Page 40
38
• Compare match criteria for specific record with all other records within a
given file to look for intra-file duplicate records.
• Compare match criteria for a specific record with all records in another file to
seek inter-file duplicate records.
• Evaluate potential matched occurrences to assure they are, in fact, duplicate.
• Verify that consolidated data into single occurrences is correct.
• Examine and re-relate data related to old records being consolidated to new
occurrence-of-reference record. Validate that no related data was
overlooked.
Verifying Matched and Consolidated Data(4)
www.nztester.co.nz
75NZTester
• At this stage, base data is being prepared for loading into the application
operational tables and the data mart. This includes converting and formatting
cleansed, consolidated data into the new data architecture, and possibly
enhancing internal operational data with external data licensed from service
providers.
• The objective is to successfully map the cleansed, corrected and consolidated
data into the Data Warehouse environment.
Verify Transformed/Enhanced/Calculated Data to Target Tables
www.nztester.co.nz
76NZTester
Page 41
39
• Inputs - Cleansed, consolidated data; external data from service providers;
business rules governing the source data; business rules governing the target
Data Warehouse data; transformation rules governing the transformation
process; Data Warehouse or target data architecture; data map of source
data to standardized data.
• Output - Transformed, calculated, enhanced data; updated data map of
source data to standardized data; data map of source data to target data
architecture.
Verify Transformed/Enhanced/Calculated Data to Target Tables(2)
www.nztester.co.nz
77NZTester
Techniques and Tools - Data transformation software; external or online or
public databases.
Process Description:
• Verify that the data warehouse construction team is using the data map of
source data to the Data Warehouse standardized data, verify the mapping.
• Verify that the data transformation rules and routines are correct.
• Verify the data transformations to the Data Warehouse and assure that the
processes were performed according to specifications.
Verify Transformed/Enhanced/Calculated Data to Target Tables(3)
www.nztester.co.nz
78NZTester
Page 42
40
• Verify that data loaded in the operational tables and data mart meets the
definition of the data architecture including data types, formats, accuracy,
etc.
• Develop scenarios to be covered in Load Integration Testing.
• Count Validation - Record Count Verification Data Warehouse
backend/Reporting queries against source and target as an initial check.
• Dimensional Analysis - Data integrity exists between the various source
tables and parent/child relationships.
• Statistical Analysis - Validation for various calculations.
Verify Transformed/Enhanced/Calculated Data to Target Tables(4)
www.nztester.co.nz
79NZTester
• Data Quality Validation - Check for missing data, negatives and consistency.
Field-by-field data verification will be done to check the consistency of
source and target data.
• Granularity - Validate at the lowest granular level possible (lowest in the
hierarchy, e.g., Country-City-Sector—start with test cases).
• Dynamic Transformation Rules and Tables - Such methods need to be
checked continuously to ensure the correct transformation routines are
executed. Verify that dynamic mapping tables and dynamic mapping rules
provide an easy, documented, and automated way for transforming values
from one or more sources into a standard value presented in the Data
Warehouse .
Verify Transformed/Enhanced/Calculated Data to Target Tables(5)
www.nztester.co.nz
80NZTester
Page 43
41
• Verification Method - The QA team will identify the detailed
requirements as they relate to transformation and validate the dynamic
transformation rules and tables against Data Warehouse records.
Utilizing SQL and related tools, the team will identify unique values in
source data files that are subject to transformation. The QA team
identifies the results from the transformation process and validate that
such transformation have accurately taken place.
Verify Transformed/Enhanced/Calculated Data to Target Tables(6)
www.nztester.co.nz
81NZTester
End-to-End Testing:
Once all extract, transformation and load unit tests have been successfully executed, need to execute ETL process from end-to-end:
• Job sequences and dependencies
• Errors in one job that impact subsequent jobs
• Error log generation
• Restarting the ETL process in case of failure:
• Does it have to be started over?
• Can it start from where it failed?
• Restores required?
• Auto/manual?
• Impact of failure on subsequent jobs
• Processing of rejected records
• Reprocessing of already processed records
www.nztester.co.nz
82NZTester
Page 44
42
Acceptance Testing:
When all previous testing has been completed, Acceptance Testing to compare Source systems with Business Intelligence output can start:
• Best performed by end users
• Equivalent of UAT in traditional testing
• Run without regard for Data Warehouse systems processes
• Concerned with:
• Is the data from the Source systems correctly placed and formatted on the
Business Intelligence output?
• Is it correct ie. match the Source system as per transformation rules?
• Are totals produced by the Business Intelligence tool correctly placed and
formatted on the output?
• Is it correct eg add up correctly?
• Is data displayed graphically correct?
www.nztester.co.nz
83NZTester
Performance Testing:
The ability of the Data Warehouse ETL and Business Intelligence reporting to execute within acceptable timeframes is key:
• ETL process should execute within its allotted time window
• The process should be able to handle the expected data volumes
• Any constraints around system resources should be identified and tested
• Should not interfere with other system processes or users
• Any degradation in system performance should be within an acceptable band
• Adhoc Business Intelligence use should not adversely impact system performance
www.nztester.co.nz
84NZTester
Page 45
43
Regression Testing:
Repeating End-to-End Testing in full or in part:
• Job sequences and dependencies
• Errors in one job that impact subsequent jobs
• Error log generation
• Restarting the ETL process in case of failure:
• Does it have to be started over?
• Can it start from where it failed?
• Restores required?
• Auto/manual?
• Impact of failure on subsequent jobs
• Processing of rejected records
• Reprocessing of already processed records
www.nztester.co.nz
85NZTester
Excellent Reference:
86NZTester
www.nztester.co.nz
Page 46
44
Questions?
www.nztester.co.nz
87NZTester
Testing the Data WarehouseGeoff Horne, NZTester Magazine
[email protected]
October 2014
NZTester