Top Banner
June 6, 2022 TCS Public Data Management
174
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

Data Management5 November 2012 TCS Public

ContentsData Warehouse Concepts Data Modeling Dimensional Modeling Implementation and Maintenance Data Management Data Quality Analysis Metadata Management Data Governance Master Data Management Data Storage, Movement and Access

5 November 2012

2

Data Warehouse Concepts

5 November 2012

3

Data Warehouse Concepts AgendaA.What is a Data Warehouse (DW) ?

B.What are the components of a DW ?C.What are the various architectures/formats of a DW ? D.Examples of Data Warehousing tools in use

5 November 2012

4

Need for Data Warehousing Business View Customer Centricity Single view of each customer and his/her activities Integrated information from heterogeneous sources Adaptability to rapidly changing business needs Multiple ways to view business performance Low cycle time, faster analytics Increased Global competition Crunch more and more data, faster and faster Mergers and Acquisition With each acquisition comes another set of disparate IT systems affecting consistency and performance

5 November 2012

5

Need for Data Warehousing Systemic ViewPerformance Optimization OLTP systems get overloaded with large analytical queries Data Models for OLTP and OLAP are very different Reduce reliance on IT to produce reports Reporting making on OLTP systems is very technical OLTP systems not built to hold history data Data Security To prevent unauthorized access to sensitive data

5 November 2012

6

Data Warehouse Defined

Data Warehouse is a Subject-Oriented Integrated Time-Variant Non-volatile collection of data enabling management decision making

5 November 2012

7

Subject OrientationProcess Oriented Subject Oriented

EntrySales Rep Quantity Sold Part Number Date Customer Name Product Description Unit Price Mail Address

SalesCustomers

Products

Transactional Storage

Data Warehouse Storage

5 November 2012

8

Data VolatilityVolatileInsert Change

Non-Volatile

Delete Insert Change Access Record-by-Record Data Manipulation Load

Access

Mass Load / Access of Data

Transactional Storage

Data Warehouse Storage

5 November 2012

9

Time VarianceCurrent Data Historical Data

Sales ( Region , Year - Year 97 - 1st Qtr)20 15 Sales ( in lakhs 10 ) 5 0 January February March Year97 East West North

Transactional Storage

Data Warehouse Storage

5 November 2012

10

Data Warehouse CharacteristicsStores large volumes of data used frequently by DSS Is maintained separately from operational databases Are relatively static with infrequent updates Contains data integrated from several, possibly heterogeneous operational databases Supports queries processing large data volumes

5 November 2012

11

Three views of Data WarehousingStrategic or Business view Define key business drivers of data warehouse How can business-driven approach achieve high ROI? Architectural or Technology view Alternative data warehousing architectures How can the right architecture achieve a high ROI? Methodology or Implementation view Development and implementation methodology How can the right methodology achieve a rapid ROI?

5 November 2012

12

Data Warehouse Concepts AgendaA.What is a Data Warehouse (DW) ?

B.What are the components of a DW ?C.What are the various architectures/formats of a DW ? D.Examples of Data Warehousing tools in use

5 November 2012

13

Data Warehouse ComponentsMetadata LayerExtractionFS1 FS2 S T A G I N G

CleansingTransformation Aggregation Summarization

Data Mart Population

DM1 DM2

. . .FSn

TransmissionN E T W O R K

ODS

DW

DMn A R E A

OLAP ANALYSIS Knowledge Discovery

Legacy System

5 November 2012

14

Data Warehouse Build LifecycleData extraction Data Cleansing and Transformation Data Load and refresh Build derived data and views Service queries Administer the warehouse

5 November 2012

15

Data Warehouse Concepts AgendaA.What is a Data Warehouse (DW) ?

B.What are the components of a DW ?C.What are the various architectures/formats of a DW ? D.Examples of Data Warehousing tools in use

5 November 2012

16

Data Warehouse ArchitecturesVirtual Data Warehouse Enterprise Data Warehouse Data Marts Distributed Data Marts Multi-tiered warehouse

5 November 2012

17

Virtual Data Warehouse

REPORTING TOOL

Legacy Client/ Server OLTP Application

Operational Systems Data

U S E R S

External

5 November 2012

18

Enterprise Data WarehouseData Preparation

LegacyClient/ Server

Select

REPORTING TOOL

Metadata Repository

Operational Systems Data

Extract

Transform Integrate

DATA WAREHOUSE

OLTP External

A P I

U S E R S

Maintain

5 November 2012

19

Data Marts

LegacyClient/ Server

Select

REPORTING TOOL

Metadata Repository

Extract

Transform

DATA MART

OLTP External

Integrate

A P I

U S E R S

Maintain

Data Preparation Operational Systems Data

5 November 2012

20

Distributed Data Marts

LegacyClient/ Server

Select

Data Mart

REPORTING TOOL

Extract

Transform

Data Mart

OLTP External

Integrate

A P I

U S E R S

Maintain

Data Mart

Data Preparation Operational Systems Data

5 November 2012

21

Multi-tiered Data Warehouse: Option 1Data MartLegacy

REPORTING TOOL

Select

Client/ Server

Extract

Metadata Repository

TransformOLTP

Data Mart

Integrate

DATA WAREHOUSE

A P I

U S E R S

External

Maintain

Data Mart

Operational Systems Enterprise wide Data

5 November 2012

22

Multi-tiered Data Warehouse: Option 2

Legacy Client/ Server OLTP External

Select

Data MartMetadata Repository

REPORTING TOOL

Extract

Transform Integrate

Data Mart

DATA WAREHOUSE

A P I

U S E R S

Maintain

Data Mart

Data Preparation Operational Systems Data

5 November 2012

23

Relative Data sizes in a Data WarehouseHighly Summarized Data

Lightly Summarized Data

Current Detail Data

Metadata

Older Detail Data

Cont.5 November 2012 24

Data Warehouse - ExampleMonthly sales by region for 1991-94 Monthly Sales by Product for 1991-94

Weekly sales by region for 1991-94

Weekly sales by product/sub-product for 1991-94

Sales Detail for 1991-94

Metadata

Sales Detail for 1985-90

5 November 2012

25

Building a Data Warehouse - StepsIdentify key business drivers, sponsorship, risks, ROI Survey information needs and identify desired functionality and define functional requirements for initial subject area. Architect long-term, data warehousing architecture Evaluate and Finalize DW tool & technology Conduct Proof-of-Concept Cont.5 November 2012 26

Building a Data Warehouse - StepsDesign target data base schema Build data mapping, extract, transformation, cleansing and aggregation/summarization rules Build initial data mart, using exact subset of enterprise data warehousing architecture and expand to enterprise architecture over subsequent phases Maintain and administer data warehouse

5 November 2012

27

Data Warehouse Concepts AgendaA.What is a Data Warehouse (DW) ?

B.What are the components of a DW ?C.What are the various architectures/formats of a DW ? D.Examples of Data Warehousing tools in use

5 November 2012

28

Representative DW ToolsTool Category ETL Tools OLAP Server Products ETI Extract, Informatica, IBM Visual Warehouse Oracle Warehouse Builder Oracle Express Server, Hyperion Essbase, IBM DB2 OLAP Server, Microsoft SQL Server OLAP Services, Seagate HOLOS, SAS/MDDB Oracle Express Suite, Business Objects, Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query, MetaCube Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft SQL Server, RedBricks SAS Enterprise Miner, IBM Intelligent Miner, SPSS/Clementine, TCS Tools

OLAP Tools

Data Warehouse Data Mining & Analysis

5 November 2012

29

Data Warehousing - InsightsAn enabling technology that facilitates improved business decision-making A process, not a product A technique for assembling and managing a wide variety of data from heterogeneous systems for decision support

5 November 2012

30

Data Modeling

5 November 2012

31

Modeling ER ModelDefinitionLogical & Graphical representation of the information needs

Process

Classifying Entities Characterizing Attributes Inter-relating Relationships

5 November 2012

32

Modeling Logical ModelDefinition Representation of a business problem without regard to implementation, technology and organizational structure

Features Represent business requirement completely, correctly & consistently Remove redundancy Does not presuppose data granularity Not implemented

5 November 2012

33

Modeling Example of ER model

5 November 2012

34

Modeling Physical ModelDefinitionSpecification of what is implemented

Features Optimized Efficient Buildable Robust

5 November 2012

35

Dimensional Modeling Form of analytical design (or physical model) in which data is pre-classified as a fact or dimension

Improves performance by matching the data structure to the queries

Give this periods total sales volume and revenue by product, business unit and package

5 November 2012

36

Dimensional Modeling

5 November 2012

37

Dimensional Modeling AgendaA.What is Dimensional Modeling ?

B.What are the various types of Dimension Tables ?C.What are the various types of Fact Tables ? D.How do I model a star schema ?

5 November 2012

38

A Quick Recap of Data WarehousingData Warehouse is a SUBJECT ORIENTED, INTEGRATED, TIME-VARIANT, NON-VOLATILE collection of data enabling Management Decision Making

RDBMS ERP

Cleansing, Transformation, Validation, Massaging

Client Browsers

ODS

Extraction

CRM

STAGING AREA

Network

DW

Data

Mainframe DBs

MartsAggregation, summarization, Data Mart Population, Dimension loading, Fact Loading Reports, Cubes, Analysis, Data mining, Dashboards, MIS reports, Company Quarterly reports etc..5 November 2012 39

PC DBs

Dimensional Modeling: In PerspectiveDimensional Modeling is an effective, efficient and successful technique to design Enterprise Data Warehouse and Distributed Data Mart database schemas for maximum query performance

RDBMSERPClient Browsers

ODS

Extraction

CRM

STAGING AREA

Network

DW

Data

Mainframe DBs

Marts

DIMENSIONAL MODELING AREA

PC DBs5 November 2012 40

Dimensional Modeling Form of analytical design (or physical model) in which data is pre-classified as a fact or dimension

Improves performance by matching the data structure to the queriesGive this periods total sales volume and revenue by product, business unit and package

5 November 2012

41

Dimensional Model: Strengths Predictable, Standard Framework Query tools and user interfaces can be created to provide a consistent way of reporting data Most filter conditions being on dimensional attributes allows performance boosting through bit-vector indexes on dimension table columns Metadata functionality in the query tools can make use of the known cardinality of dimension values to offer such facilities such as valueselection drop-downs and value-selection windows Resilience to unexpected user querying patterns All dimensions are equivalent entry points into the fact table (number of joins to fact table is same = 1) Symmetrical query strategies and SQL Logical Design can be done independent of query patterns

5 November 2012

42

Dimensional Model: Strengths Graceful Extensibility Allows adding new unanticipated facts as long as they are consistent with the fact table grain Allows adding new dimension tables as long as a single value of that dimension is defined for each existing fact record Allows adding new unanticipated dimensional attributes

Standard approaches for common modeling situations Slowly Changing Dimensions (SCDs) Heterogeneous products (e.g. Savings account, current account) Pay-in-advance databases Event-handling databases (fact-less facts)

5 November 2012

43

What is a Fact?Facts MeasureSales Volume Revenue

5 November 2012

44

Facts and Fact TablesFact Measure

SalesDate Key (int) Store Key (int)

BillingDate Key (int) Customer Key (int) Service Line Key (int) Rate Plan Key (int) Number of Total Minutes Number of Calls (int) Service Charge (float) Taxes (float)

Revenue

Cost

Product Key (int) Sales (float) Qty Sold (int) Price (float)

No. of Accounts

Discount (float)

The term FACT represents a single business measure. E.g. Sales, Qty Sold Each fact has a GRAIN which is the set of perspectives or attributes that define/ qualify the fact completely. E.g. Grain of Sales could be for each PRODUCT, at each STORE, on each/ every DAY. A FACT TABLE is the primary table in a dimensional model where the business measures or FACTS are stored. A business measure or FACT is a row in a FACT TABLE. All FACTS in a FACT TABLE must be at the SAME GRAIN.5 November 2012 45

Fact tables: some features Fact tables express MANY-MANY RELATIONSHIPS between dimensions in dimensional models One product can be sold in many stores while a single store typically sells many different products at the same time The same customer can visit many stores and a single store typically has many customers The fact table is typically the MOST NORMALIZED TABLE in a dimensional model No repeating groups (1N), No redundant columns (2N) No columns that are not dependent on keys; all facts are dependent on keys (3N) No Independent multiple relationships (4N) Fact tables can contain HUGE DATA VOLUMES running into millions of rows Facts can be identified by answering the question: WHAT ARE WE MEASURING? The grain of the fact table can be identified by answering the question: HOW DO YOU DESCRIBE A SINGLE ROW IN THE FACT TABLE? All facts within the same fact table must be at the SAME GRAIN

5 November 2012

46

Fact tables: some features The grain of the fact table is the LIST OF DIMENSIONS that uniquely define each row of the fact table Each row of sales fact table is uniquely identified by a unique combination of store, product and time: which product, which store, when sold Additional foreign keys may exist in the fact table that point to other dimension tables e.g payment type, but which do not contribute to the grain Every foreign key in a Fact table is usually a DIMENSION TABLE PRIMARY KEY Every foreign key in a Fact table is usually an INTEGER KEY Fact tables are TYPICALLY used in GROUP BY SQL queries Every column in a fact table is either a foreign key to a dimension table primary key or a fact Every non-key column in the fact table is typically used in the SELECT clause of a SQL query

5 November 2012

47

What is a Dimension?Data Warehouse is Subject-Oriented

Integrated Time-Variant Non-volatile collection of data in support of managements decision.

SubjectProduct

DimensionBusiness UnitPackage

5 November 2012

48

Dimensions and Dimension TablesDimension Perspective

StoreStore Key (int) Store name (char) Street Address (char) City (Char) State (Char) Region (Char) Country (Char)

ProductProduct Key (int) Product id (char/int) Product name (char) Product Group Brand Department

Customer Geography

Time

The term DIMENSION represents a single category or perspective by which associated FACTS are interpreted and understood.

E.g. Store is a perspective by which sales are understood. It is the answer to the question Where did the sales occur?A DIMENSION TABLE is a table which holds a list of attributes or qualities of the dimension most often used in queries and reports. E.g. The Store dimension can have attributes such as the street and block number, the city, the region and the country where it is located in addition to its name. Every row in the DIMENSION TABLE represents a unique instance of that DIMENSION and has a unique identifier called the DIMENSION KEY.5 November 2012 49

Dimension tables: some features Dimensions can be identified by answering the question: HOW DO BUSINESS PEOPLE DESCRIBE THE DATA THAT RESULTS FROM A BUSINESS PROCESS? Dimension tables are ENTRY POINTS into the fact table. Typically The number of rows selected and processed from the fact table depends on the conditions (WHERE clauses) the user applies on the dimensional attributes selected Dimension tables are typically DE-NORMALIZED in order to reduce the number of joins in resulting queries Dimension table attributes are generally STATIC, DESCRIPTIVE fields describing aspects of the dimension Dimension tables typically designed to hold IN-FREQUENT CHANGES to attribute values over time using SCD concepts Dimension tables are TYPICALLY used in GROUP BY SQL queries Dimension Tables serve to simplify SQL GROUP BY queries Every column in the dimension table is TYPICALLY either the primary key or a dimensional attribute Every non-key column in the dimension table is typically used in the GROUP BY clause of a SQL Query

5 November 2012

50

Some more jargons.. Hierarchy LevelHierarchy - Geography DimensionWorld Level World

Member Attribute Grain

Continent Level USA

America

Europe

Asia

Country Level State Level City Level

Canada

Argentina

FL

GA

VA

CA

WA

Miami Attributes: Population, Tourists Place

Tampa

Orlando

Naples Dimension Member / Business Entity5 November 2012 51

Some more jargons..A Dimension can have one or more hierarchies Another Hierarchy Geography Dimension A hierarchy can have one or more levels or grainEconomy Level Upper Developed Global Level Global

Developing

Third world

Each Financial have one or more members level canClass Level Regional Level City Level Metro Suburb

Middle

Lower

Town

Village

Each member can have one or more attributesChennai Attributes: Population, Tourists Place Delhi Mumbai Kolkatta Dimension Member / Business Entity

5 November 2012

52

The Star Schema: Linking Facts and DimensionsDate/TimeDate Key (int) Date (date dd/mm/yy) Day Of Week (char) Day of Month (int) Month (int) Quarter (int) Year (int)

StoreStore Key (int) Store name (char) Street Address (char) City (Char) State (Char) Region (Char) Country (Char)

ProductProduct Key (int) Product id (char/int) Product name (char) Product Group Brand Department

SalesDate Key (int) Store Key (int) Product Key (int) Customer Key (int) Payment Type Key (int) Sales (float) Qty Sold (int) Price (float)

Store

Discount (float)

Time

Store Sales

Payment Type

The Star Join Schema or STAR SCHEMA is a single FACT TABLE joined to a set of DIMENSION TABLES Simple, Symmetric, Extensible and Optimized! GRAIN of the Star Schema is the grain of its central Fact table!5 November 2012 53

Customer

Product

Star Schema

Particular form of a dimensional modelCentral fact table containing Measures Surrounded by one perimeter of descriptors - Dimensions

5 November 2012

54

Star SchemaFact TableThis table is the core of the Star Schema Structure and contains the Facts or Measures available through the Data Warehouse.

These Facts answer the questions of What, How Much, or How Many.Some Examples:Sales Dollars, Units Sold, Gross Profit, Expense Amount, Net Income, Unit Cost, Number of Employees, Turnover, Salary, Tenure, etc.

5 November 2012

55

Star SchemaDimension TablesThese tables describe the Facts or Measures. These tables contain the Attributes and may also be Hierarchical. These Dimensions answer the questions of Who, What, When, or Where. Some Examples: Day, Week, Month, Quarter, Year Sales Person, Sales Manager, VP of Sales Product, Product Category, Product Line Cost Center, Unit, Segment, Business, Company

5 November 2012

56

Star SchemaEmployee_Dim EmployeeKey EmployeeID . . .

Time_Dim TimeKey TheDate . . .

Shipper_Dim ShipperKey ShipperID . . .

Sales_Fact TimeKey EmployeeKey ProductKey CustomerKey ShipperKey Required Data (Business Metrics) or (Measures) . . .

Product_Dim ProductKey ProductID . . .

Customer_Dim CustomerKey CustomerID . . .

5 November 2012

57

Snow Flake Schema Complex dimensions are re-normalized

Different levels or hierarchies of a dimension are kept separateGiven dimension has relationship to other levels of same dimension

5 November 2012

58

Star and Snow are De-normalizedIt violates 3NF in dimensions by collapsing higher-level dimensions into the lowest level as in Brand and Category. It violates 2NF in facts by collapsing common fact data from Order Header into the transaction, such as Order Date. It often violates Boyce-Codd Normal Form (BCNF) by recording redundant relationships, such as the relationships both from Customer and Customer Demographics to Booked Order. However, it supports changing dimensions by preserving 1NF in Customer and Customer Demographics.

5 November 2012

59

A shot at Dimensional Modeling.STEP 1 Identify Subjects (Dimensions) Identify Hierarchies of a Dimension

Identify Attributes of levels in Hierarchies Define Grain Country Industry Segment Industry Type State City Customer Fin. Class

5 November 2012

60

A shot at Dimensional Modeling.STEP 2 Use KPIs to identify the Facts Group the Facts in a logical set

Financial Transactions Trans. Amount No. of Bonds No. of Transactions Service Cost ...

Non-Financial Transactions

No. of Cheques Cleared No. of Visits to a Branch No. of DEMAT Transactions ...

5 November 2012

61

A shot at Dimensional Modeling.STEP 3 Link the Group of Facts to the Dimensions that participate in the Facts

Customer

Product

Time

Financial Transactions

Organization

Channel

5 November 2012

62

A shot at Dimensional Modeling.STEP 4 Define Granularity for each Group of Facts

Customer (Customer)

Product (Scheme)

Time (Day-Hour)

Financial Transactions

Organization (Branch)

Channel (Channel)

5 November 2012

63

Dimensional Modeling AgendaA.What is Dimensional Modeling ?

B.What are the various types of Dimension Tables ?C.What are the various types of Fact Tables ? D.How do I model a star schema ?

5 November 2012

64

Types of Dimensions Primary Dimension Contributes to the fact grain A set of these uniquely defines the associated fact E.g. SALES fact is typically completely defined by store, product and time Secondary Dimension Does NOT contribute to the fact grain Non-primary dimensions such as payment type, customer, manufacturer are still important for analysis of SALES fact Useful for rich analytic slicing and dicing, e.g. Top 10 customers. Degenerate Dimension A dimension without any attributes; but useful for analysis Generally included in the associated fact table before facts E.g. invoice number, by itself, in a shipping fact

5 November 2012

65

Types of Dimensions Conformed Dimension A dimension used across the enterprise Requires standardized structure and definition Requires to be designed upfront before individual designing schemas Plugs into multiple stars as either primary or secondary dimension E.g. Customer, Product, Store, Time, Employee Customer could be captured at the store card swiping machine (sales fact), be part of Marketing promotion strategy (campaign fact) and also could be serviced by call center for warranty replacements (warranty fact) Employee may be a Sales-rep claiming credit for sales (Sales fact) or may be a Finance manager authorizing vendor payments (vendor Payment fact) or a call center person taking customer calls (Service Call fact)

5 November 2012

66

Types of Dimensions Slowly Changing Dimension Dimensional attributes change over time Need to capture these changing realities as history Requires special design techniques to keep it single valued for each fact row while still retaining history E.g. Customer City, Marriage status, salesrep department Type 1 (Overwrite previous values) Type 2 (Create additional time-stamped dimension record Type 2 automatically partitions history Type 3 (Create additional attribute column to retain any one previous value e.g. first value, previous value) Requires dimension key to be generalized

5 November 2012

67

Types of Dimensions Rapidly Changing Small Dimension Same as SCD except frequency is higher Need to track changes to attributes E.g. Employee attributes such as appraisal rating E.g. telecom product: rate plans keep changing Large Dimension Size increases with decreasing level of granularity Typical of public utility companies, government agencies Human records kept by supermarkets e.g. Shoppers Stop Do NOT create SCDs to address slow changes/ history See Monster Dimension for SCD strategy Choose indexing strategies to reduce query run times Choose RDBMS wisely e.g. SQLServer vs. Oracle vs. Teradata

5 November 2012

68

Types of Dimensions Rapidly Changing Monster Dimension Similar to large dimension but typical of a large insurance customer dimension Customers and Claims are rapidly created and changed Need to track history for credit and legal reasons Remove the continuously changing attributes to another dimension table e.g. demographic Reduce the cardinality of these attributes by banding them e.g. income_band, credit_band, etc. Then create all possible combinations of these attributes Then assign a dimension key to each unique set of these combinations; this is the demographics table For each combination that represents the customers status in a particular period, plug the demographic key into the fact as an additional key

5 November 2012

69

Types of Dimensions Junk Dimension A convenient grouping of random flags and attributes to get them out of the fact table Retain only useful fields Remove fields that make no sense at all Remove fields that are inconsistently filled Remove fields that are of operational interest only Design similar to demographics; maximum unique combinations, assign integer key, plug into fact Create new combination (insert new dimension record) at ETL run-time E.g. Yes/No Flags in old retail transaction data

5 November 2012

70

Types of Dimensions Role-playing Dimension dimension appears several times in the same fact table Typically, Date/Time dimension plays many roles E.g. Order Fulfillment is a typical retail fact table having the following dimensions: Order Date, Packaging date, Shipping date, Delivery date Create one fact table key for each role Create one SQL view of the dimension for each role Use view names to run SQL queries In Business Objects, this scenario is designed using Aliases and Contexts E.g. (2) Employee dimension: Salesrep, Manager, Appraiser, Appraisee in Sales Compensation fact and Employee Appraisal fact respectively.

5 November 2012

71

Dimensional HierarchyGeography DimensionWorld Level

World America USA Europe Canada Asia Argentin a

Continent Level Country Level State Level City Level

FLMiami

GATampa

VA

CAOrlando

WANaples

Attributes: Population, Tourists Place

Dimension Member / Business Entity

5 November 2012

72

Dimensional Modeling AgendaA.What is Dimensional Modeling ?

B.What are the various types of Dimension Tables ?C.What are the various types of Fact Tables ? D.How do I model a star schema ?

5 November 2012

73

Types of FactsValue Based Classification Numeric Facts Count / Occurrence Based (e.g. Employees assigned to a project) Non-numeric Facts (e.g. Comments in fact tables)

Summary Based Classification Additive (along all dimensions) Semi Additive (mostly along Time dimension) Non Additive (cannot be added along any dimension)In our example discussed earlier in these slides, sales, number of total minutes are examples of value based and additive facts as these can be added across all dimensions Whereas, Price, Quantity Sold are examples of value based but semi-additive facts as these can be added only across some dimensions e.g. they cannot be added across the product dimension as the fact Total price does not make sense. Rather Average Price across products makes more sense.5 November 2012 74

Types of Fact TablesFact tables are classified based upon: the type of grain they address or the level of detail they contain AND the way the measurements are taken with respect to time Thus we have: Transaction Fact Table

Snapshot Fact Table Summary Fact Table Figure 1 The context of a transaction is modeled as a set of generally independent dimensions. Figure 1 shows seven such dimensions. The measured transaction amount is in a fact table that refers to all the dimensions by foreign keys pointing outward to their respective dimension tables. The clean removal of all the context detail from the transaction record is an important normalization step and is why fact tables are highly normalized.5 November 2012 75

Types of Fact TablesKimball mentions 3 fundamental types of fact table grain: Transaction: A transaction is a set of data fields that record a basic business event. e.g. a point-of-sale in a supermarket, attendance in a classroom, an insurance claim, etc.

Figure 2

The measurements group nicely together into a single fact table with the same grain. Periodic Snapshot: A snapshot is a measurement of status at a specific point in time. E.g. In Figure 2, earned premium is the fraction of the total policy premium that the insurance company can book as revenue during the particular reporting period. The periodic-snapshot-grained fact table represents a predefined time span.5 November 2012 76

Types of Fact TablesThe accumulating-snapshotgrained fact table represents an indeterminate time span, covering the entire history starting when the collision coverage was created for the car in our example and ending with the present moment. Figure 2 In dramatic contrast to the other fact-table types, we frequently revisit accumulating-snapshot fact records to update the facts. Remember that in this table there is generally only one fact record for the collision coverage on a particular customers car. As history unfolds, we must revisit the same record several times to revise the accumulating status.5 November 2012 77

Dimensional Modeling AgendaA.What is Dimensional Modeling ?

B.What are the various types of Dimension Tables ?C.What are the various types of Fact Tables ? D.How do I model a star schema ?

5 November 2012

78

Visualizing a dimensional modelThe most popular way of visualizing a dimensional model is to draw a cube. We can represent a three-dimensional model using a cube. Usually a dimensional model consists of more than three dimensions and is referred to as a hypercube. However, a hypercube is difficult to visualize, so a cube is the more commonly used term. In Figure 1, the measurement is the volume of production, which is determined by the combination of three dimensions: location, product, and time.

Figure 1The location dimension and product dimension have their own two levels of hierarchy. For example, the location dimension has the region level and plant level. In each dimension, there are members such as the east region and west region of the location dimension. Although not shown in the figure, the time dimension has its numbers, such as 1996 and 1997. Each sub-cube has its own numbers, which represent the volume of production as a measurement. For example, in a specific time period (not expressed in the figure), the Armonk plant in East region has produced 11,000 Cell Phones, of model number 1001.5 November 2012 79

Data Warehouse Bus MatrixAll Dimensional models together form the logical design of the data warehouse.To Decide which Dimensional Models to build we start with a top-down planning approach called the Data Warehouse Bus Architecture Matrix.

This Matrix forces us to list all the possible data marts we could possibly build and name all the dimensions that are present in those data marts (at a high level).A Dimensional Model is made up of one or more star schemas. Some of these star schemas may be snow flaked for better organization and storage.

DimensionOrganization Equipment Employee Customer Accounts Calendar

Subject Area Accounts Sales Quotes General Ledger Shipment Parts/Finance

5 November 2012 80

Outage

Vendor

Dimensional Modeling ApproachCDM LDM PDMEach star schema has a single fact table at its centre surrounded by multiple dimension tables. Once we do this, we can then start the design of each individual fact table/star schema using a 4-step process.

STEP 1: Identify Subject Area/ Business ProcessStart the model by choosing a single business process or a business sub-process to model so that you have only one fact table: e.g. SALES business process.

STEP 2: Define Fact Table GrainChoose the GRAIN of the central fact table. e.g. i) Each Sales transaction is a fact record: Grain is sales by product by store by transaction ii) Each daily product sales total in each store is a fact record: Grain is sales by product by store by day

STEP 3a: Identify DimensionsChoose the DIMENSIONS as follows: Primary dimensions from fact grain e.g. product, store, day (or date)

Additional dimensions based on user interviews, reports analysis Ensure each dimension is at its lowest level of detail possible while still being single valued.

5 November 2012

81

Dimensional Modeling ApproachSTEP 3b: Identify grain of dimension tableEnsure that each dimension table grain is NOT lower than the central fact table grain: e.g. the store dimension should have one row for each store. Each store may have departments but the store dimensions row should represent only the store and not the department. STEP 3c: Identify all dimensional Attributes For each dimension choose only SINGLE VALUED attributes e.g. if Region is an attribute of the store dimension then it should have one and only one value for each store. STEP 3d: Identify Dimension Hierarchies and attributes of levels in Hierarchies

Country Industry Segment Industry Type State City Customer Fin. Class

SalesDate Key (int) Store Key (int) Product Key (int) Customer Key (int) Sales (float) Qty Sold (int) Figure: Customer Dimension Hierarchies (Industry, Geography)

Price (float)Discount (float)

STEP 4a: Choose Facts

Choose each fact for the fact table making sure that the fact is relevant and also has the same grain has the fact table e.g. for SALES fact table, typical facts would be price, quantity sold and sales amount as these are all dimensioned by product, by store, by day.5 November 2012 82

Dimensional Modeling ApproachSTEP 4b: Connect Fact to dimension tables by means of surrogate keysCustomer (Customer) Product (Scheme)

Time (Day-Hour)

Financial Transactions

Organization (Branch)

Channel (Channel)

Important Notes:1. Each dimension table will have a MEANINGLESS SINGLE PART INTEGER PRIMARY KEY called surrogate key. This key also occurs as part of the central fact primary key.

2.

All fact table primary keys should ideally be foreign keys to the corresponding dimension tables.

5 November 2012

83

Implementation and MaintenanceDWH Design, Deployment and Maintenance

5 November 2012

84

Implementation & Maintenance Agenda

A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 85

DWH Physical Design Process

5 November 2012

86

Physical Design Process Develop Standards.Process Standards

Naming StandardsDatabase Objects Word Separators Names in Logical and Physical Model Physical File naming standards

Naming of files & tables in Staging area

5 November 2012

87

ContinueCreate Aggregates PlanIdentify granularity level

Determine the Data Partitioning SchemeSelecting fact and dimensions Horizontal or Vertical Number of partitions Criteria for partitions

5 November 2012

88

ContinueEstablish Clustering OptionsPlacing and managing related units of data together

Prepare Indexing Strategy

Identify the columns.Identify the sequence.

Assign Storage Structures Complete Physical ModelReview all the above activities

Create physical model

5 November 2012

89

Implementation & Maintenance Agenda

A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 90

Physical Design ConsiderationsImprove Performance

Ensure ScalabilityManage Storage Provide Ease of Administration Design Flexibility Assign Storage Structures

5 November 2012

91

Implementation & Maintenance Agenda

A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 92

Physical Storage Types of Storage StructureFiles Facts Dimensions Indexed Data Structures

5 November 2012

93

Storage ConsiderationsSet correct units of units of database space allocationData Block

Set proper block usage parametersFree and Used Space

Manage data migrationRow Chaining

Row Migrating

Manage block utilization Should have less free space5 November 2012 94

ContinueResolve dynamic extensionInserting a new record. Updating the existing record.

Employ file striping techniquesSplitting files into multiple physical partsEnables Concurrent I/O

5 November 2012

95

Implementation & Maintenance Agenda

A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 96

Indexing the Data WarehouseWhat are Indexes?Why are Indexes required? When should I create an Index?

What are different types of Indexes?

5 November 2012

97

Types of IndexesB-tree: The default and the most common. B-tree cluster:Defined specifically for cluster. Hash cluster:Defined specifically for a hash cluster Bitmap indexes: Compact, work best for columns with a small set of values.

Bitmap Join Indexes - Index based on one table that involves columns of one or more different tables through a join.Function-based: Contain the pre- computed value of a function/expression.

5 November 2012

98

B-tree Index

5 November 2012

99

B-tree Index - AdvantagesAll leaf blocks of the tree are at the same depth, so retrieval of any record from anywhere in the index takes approximately the same amount of time. B-tree indexes automatically stay balanced. All blocks of the B-tree are three-quarters full on the average. B-trees provide excellent retrieval performance for a wide range of queries, including exact match and range searches. Inserts, updates, and deletes are efficient, maintaining key order for fast retrieval. B-tree performance is good for both small and large tables and does not degrade as the size of a table grows.

5 November 2012

100

Bitmapped Index

5 November 2012

101

Bitmapped Index - AdvantagesReduced response time for large classes of ad hoc queries A substantial reduction of space use compared to other indexing techniques Dramatic performance gains even on very low end hardware Very efficient parallel DML and loads

5 November 2012

102

Clustered Index

5 November 2012

103

Implementation & Maintenance Agenda

A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 104

Performance Enhancement TechniquesData PartitioningDecomposing tables into smaller and more manageable pieces called partitions

Range, list, hash & composite partitioning.

Data Clustering Parallel Processing Summary levels Referential Integrity Checks

Initialization ParametersData Arrays5 November 2012 105

Implementation & Maintenance Agenda

A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 106

Major Deployment ActivitiesComplete User Acceptance Perform Initial Loads Get User Desktops Ready Complete Initial User Training

5 November 2012

107

Deployment ApproachesTop-Down ApproachDeploy the overall enterprise DWH followed by the dependent data marts,one by one.

Bottom-up ApproachGather departmental requirements, and deploy the independent data marts,one by one.

Practical/general ApproachDeploy the Subject data marts, one by one, with flexible approach with fully conformed dimension following water fall model.

Note It is always advisable to deploy in stages5 November 2012 108

Implementation & Maintenance Agenda

A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 109

SecurityPrepare a Security PolicyShould cover scope of information, physical security, network and connections, DB access privileges, & access matrix.

Manage user privilegesPassword considerations Security tools

5 November 2012

110

Implementation & Maintenance Agenda

A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 111

Backup and RecoveryWhy Backup is required? What is Data warehouse Administration? What are the roles of Data warehouse Administrator (DWA)?

5 November 2012

112

DWA - RolesBuilding the data warehouseOngoing monitoring and maintenance for the data warehouse

Coordinating usage of the data warehouseManagement feedback as to successes and failures Competition for resources for making the data warehouse a reality Selection of hardware and software platforms

5 November 2012

113

Backup StrategyShould the data be actually discarded or should the data be removed to lower-cost, bulk storage? What criteria should be applied to data to determine whether it is a candidate for removal? Should the data be condensed (profile records, rolling summarization, etc.)? If so, what condensation technique should be used? How should the data be indexed once it is removed (if there is ever to be any attempt to retrieve the data)? Where and how should metadata be stored once the data is removed?

5 November 2012

114

ContinueShould metadata be allowed to be stored in the data warehouse for data that is not actively stored in the warehouse?

What release information should be stored for the base technology (i.e., the DBMS) so that the data as stored will not become stale and unreadable? How reliable (physically) is the media that the data will be stored on?What seek time to first record is there for the data upon retrieval?

5 November 2012

115

Implementation & Maintenance Agenda

A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 116

Monitoring the Data Warehouse

Collection of StatisticsUsing Statistics for growth planning. Using Statistics for Fine-Tuning. Publishing Trends for users.

5 November 2012

117

Implementation & Maintenance Agenda

A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 118

SupportHelp Desk support.Hotline Support Technical Support. User representative.

Note - We should always follow multi-tier support structure

5 November 2012

119

User TrainingUser Training ContentsShould provide enough Data Content. Should talk about all the applications involved. Should talks features and usage of tools used.

Identifying the users to be trained.Delivering the training program.

5 November 2012

120

Implementation & Maintenance Agenda

A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 121

Managing the Data Warehouse

Platform Upgrades.Managing Data Growth. Storage Management. ETL Management. Data Model Revisions. Information Delivery Enhancements. Ongoing fine tuning.

5 November 2012

122

Data Management

5 November 2012

12 123 3

Data Storage, Movement and Access

Data Governance

Master Data Management

Enterprise Data Management Framework

Data Architecture

Data Quality Management

Metadata Management

5 November 2012 124 124

Enterprise Data Management Framework explained Data Governance Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise Data governance encompasses the people, processes and procedures required to create a consistent, enterprise view of a company's data Data Storage, Movement and Access Data Movement involves translating/moving data from one format/storage device to another. Data security is the means of ensuring that data is kept safe from corruption and that access to it is suitably controlled.

Metadata Management Metadata is Data about data. Metadata describes how and when and by whom a particular set of data was collected, and how the data is formatted. Metadata Management is becoming very important because as systems become more interdependent, it is vital to know the impact of altering data

Data Architecture

Data Quality Management Data quality assurance (DQA) is the process of verifying the reliability and effectiveness of data. Maintaining data quality requires going through the data periodically and scrubbing it. Typically this involves updating it, standardizing it, and deduplicating records to create a single view of the data, even if it is stored in multiple disparate systems5 November 2012 125 125

Data Quality Analysis

5 November 2012

12 126 6

Data Quality Analysis AgendaA.Why Data Quality Management ?

B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology

5 November 2012

127

Why DQM?Why is this NULL?

Where can I get just one view for all the data? Is empid same as emp_id?

So many duplicate products on this list I am still not able to see latest data Returns on Investment are below expectations

Holland??? Is this customer in Europe or USA?

5 November 2012

128

Data Quality Analysis AgendaA.Why Data Quality Management ?

B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology

5 November 2012

129

Elements for Data QualityData Quality can be hampered by errors in following elements: Definitions Domains Completeness Validity Data Flows Structural Integrity Business Rules Transformations

5 November 2012

130

Definition This indicates how entities are referenced throughout the enterprise Definition problems can be further categorized as below: Synonyms Homonyms The fields EMP_ID, EMPID, and EM01 may or may not all actually refer to the same type of data These indicate fields that are spelled the same, but really arent the same (id or ID)

Relationships Just because a field is named FK_INVOICE doesnt mean that is is really a foreign key to the invoice file.

5 November 2012

131

Domains Domains describe the range and types of values present that can be present in a data set Some examples of domain errors are: Unexpected values - e.g. Home State = one of {Kan, Mic, Min,) Cardinality - A Yes/No field can have only two credible values Uniqueness for a field, 98% of data is NULL Constant Outliers Length of field Precision Scale Internationalization Date formats, postal codes, time zones, etc

5 November 2012

132

CompletenessThis indicates whether or not all of the data is actually present Completeness of dataset can be gauged by its Integrity - Is actual data mapping to our definition of data? Accuracy Name and address matching, demographics check Reliability Zip code should match to city and state Redundancy Data duplication Consistency Is same invoice number referenced with different amounts?

5 November 2012

133

ValidityValidity indicates whether or not the data is valid Validity checks used to spot data problems are Acceptability Product part number should consist of 7character alphanumeric string with two characters and 5 digits Anomalies Timeliness

5 November 2012

134

Data FlowsThese checks are related to the aggregate results of movement of data from source to target Many data quality problems can be traced back to incorrect data loads, missed loads or system failures that go unnoticed Data flow checks to ensure data quality are Record counts Reconciliation of source and target record counts Checksums Timestamps Process Time

5 November 2012

135

Structural IntegrityThese checks ensure that when data is taken as a whole, you are getting correct results Structural integrity checks include Cardinality Checks between tables Primary keys Are these unique? Referential integrity Product available on invoice but missing from product catalog

5 November 2012

136

Business RulesBusiness rule checks measure the degree of compliance between actual data and expected data These checks constitute of Constraints Does the data comply to a known set of validations? Computational rules Is formula for deriving amount correct? Comparisons Functional dependencies Conditions

5 November 2012

137

TransformationsTransformation checks examine the impact of data transformations as data moves from system to system Quality of data can be affected by incorrect transformation logic Only way to identify these are to compare source data set with target data set and verify transformations for Computations Merging Filtering Relationships

5 November 2012

138

Data Quality Analysis AgendaA.Why Data Quality Management ?

B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology

5 November 2012

139

Data Quality Issues - ClassificationData Quality Issues

Physical Issues

Logical Issues

Unmanaged Data Issues

Data Profile

Business Rules(for Cleansing)

Data Parsing(using Rules, Text Mining etc.)

5 November 2012

140

Data Quality Analysis AgendaA.Why Data Quality Management ?

B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology

5 November 2012

141

Dimensions of Data Quality Sufficiency for the purpose of Business IntelligenceSufficiency

Consistency of definition of data across the Data WarehouseAccuracy Redundancy

Accuracy as defined by business rules No redundancy across the warehouse Latency - No major change of data between the instance of data capture and when processed

Data Quality

Latency

Consistency

The FIVE Dimensions of Data Quality

Data Quality is measured across these dimensions!!!5 November 2012 142

Data Quality Analysis AgendaA.Why Data Quality Management ?

B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology

5 November 2012

143

TCS DQM Approach Analysis of source data quality User driven Data driven Characterization of quality data This implies identification of necessary and sufficient criteria that define quality data Domain validations, Business Rules validations are touched upon in this Feasibility Analysis Mapping of data elements to rules Mapping of relationships to rules Assessment of grain of data Design and Implementation BIDS DQM Methodology

5 November 2012

144

Data Quality Analysis AgendaA.Why Data Quality Management ?

B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology

5 November 2012

145

DQM Architecture

DQM at Source DQM as part of ETL processes DQM in the target

5 November 2012

146

Data Quality Analysis AgendaA.Why Data Quality Management ?

B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology

5 November 2012

147

DQM Methodology Modular approach to building solutions Clear and well defined guidelines, checklists and standards Supports the OnsiteOffshore delivery model Flexibility to adapt with other methodologies E-T-V-X criteria reenforced by best practices and TCS quality initiatives

5 November 2012

148

Data Quality Analysis AgendaA.Why Data Quality Management ?

B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology

5 November 2012

149

Tools and TechnologyCommon software products for Name and Address cleansing Trillium First Logic TCS DataClean Common ETL Tools Informatica Ab Initio

5 November 2012

150

Tools and TechnologyTCS expertise in industry standard tools and products ranging from RDBMS, ETL, CGI & Web products in conjunction with in-house developed tools . TCS Knowledge Base also includes a number of specialised tools that are proficient in data cleansing, validation and trending The common software products in use Trillium Software : Used for cleansing the name and address data. The software is able to identifyand match households, business contacts and other relationships to eliminate duplicates in large databases using fuzzy matching techniques

Ab Initio : Provides a suite of software packages used for ETL in data warehouses. Its features includeParallel data transformation , validation and filtering, real time data capture and integration with relational DBMS systems and Data Profiling Capability

Unitech and Actuate : Comprises of a set of reporting tools used for trend analysis, point to pointreconciliation and detecting data inconsistency.

5 November 2012

151

Case Study: British Telecom Retail : SWIFT

5 November 2012

15 152 2

Client Profile and Business DriversClient Profile BT Retail is a significant player in the communications market in the UK BT Retail has three main customer groups, namely, consumer, business and major business or corporate. The products and services cover the entire range from traditional telephony service, mobile technology, internet access and web-based services. Business drivers To unify existing marketing systems for providing centralized customer repository to cater to Developing, targeting and presenting propositions Managing customer relationships Undertaking rapid tactical marketing Improving campaign effectiveness Reducing marketing operational costs

5 November 2012

153

Business Objectives Reduction in marketing operational costs Reduction of marketing cycle times 360 degree view of customers Delivery of consistent messages across all customer channels Increased customer focus, better understanding and segmentation of customers Event driven campaigns targeted at focused customer group Improvement in campaign effectiveness Maintaining a large data volume - One of the largest data warehouses in Europe 3.36 TB of data with a growth projection of 1% per week

5 November 2012

154

Challenges Data quality issues in the vital data attributes in BTs operational data store Cleansing a backlog of erroneous information stored in database Decommissioning and migration of data from legacy systems Decommissioning of 30 TB of RDBMS Maintaining data integrity

5 November 2012

155

Proposed Profiling and DQM Solution As part of the solution, the team deployed Business Rules Repository (BRR) to store all business rules scattered across the enterprise. This enabled Sharing of information within business and IT stakeholders in an effective and efficient manner Storage for basic information about each Business Rule with a history of changes applied to it over time Types of Business Rules Format check numeric, character or date with a specific pattern Cross Attribute Value Check within dataset (compare multiple attributes in a dataset) List of Value for small list of valid values Lookup for large list of values like list of country codes, etc Uniqueness or duplication check Data integrity check Cross Attribute Value Check across datasets Data Profiler reports were created to report Structure and statistics for each data element Data value, range, distribution, pattern and format of each attribute Relationship of various attributes within and across datasets join key, primary key, potential foreign key, data dependencies, etc Web based application Quest delivered as a value-add for Data Quality Management

5 November 2012

156

Integration of Data Profiler and BRRSource System Rules Common Rules Target System Rules BRR

BRR for Source & Target system

ADAPTOR Embed Business Rules into Data Profiler Analyze Profile

Data Profiler Schema Transformation

Profile

Analyze

Source System Test data

Target System Test data Compare

5 November 2012

157

Software and Hardware SoftwareDatabase Layer Oracle Application Layer Ab Initio, Trillium, Unitech and Actuate

HardwareIBM Sequent NUMA-Q server 16 quad machine and 2GB RAM

5 November 2012

158

Application ArchitectureProfile Full volume Source Data Legacy System Sales System ERP Profiled Information Data Analysts, Business Owners, Design document DQ reports

Reduces Data Assumption Use Profiler output for different Design phases Use Profiler output to build BRR Embed the rules defined in BRR into Data Profiler Data Audit and Data Quality Monitors (DQM)

Requirement Analysis Solution Design High/Low Source Level Design system Test Data

Business Rules Repository

DQ Monitors

CRM

Billing System

3rd Party DataSource Data

Various Test Phases

Test Target System

Profile Live data for Data Audit Deployment Live Target System

Profile Source Test Data

Profile Target Test Data Compare and analyse profiler output to validate the transformation process

5 November 2012

159

Benefits to Client A huge backlog of data quality issues resolved leading to millions of pounds worth of saving to BT A generic name and address data cleansing methodology designed that can be used as a prototype for similar requirement time effectively Profiling of Live data on an ongoing basis to check compliance over time BRR was used for developing Data Quality monitors (DQM) Development of uniform data dictionary for all disparate source systems Reduced Risks and accurate planning

Client Speak...We have made fantastic progress in managing to roll out some really big, complex deliveries... all thanks to your commitment, and your ability to work as a team in order to resolve issues quickly whilst under a lot of pressure. Well done everyone. - Simon Riley

5 November 2012

160

Metadata Management

5 November 2012

16 161 1

MetadataData about Data

For every data element definition characteristics relationships to other data elements Metadata categories Business Metadata Technical Metadata Metadata currency Static (Slowly changing) Metadata Dynamic Metadata Metadata types Control Metadata Process Metadata

5 November 2012

162

Metadata ProductsCandidate Products SuperGlue System Architect MetaStage Platinum Repository Advantage Rochade Microsoft Repository MetaCentre Repository Informatica Popkin Ascential Platinum Technologies Computer Associates Allen Systems Group Platinum/Microsoft Partnership Data Advantage Group Vendor

MDM

I2

5 November 2012

163

Data Governance

5 November 2012

16 164 4

Data GovernanceData governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise

Why go in for it ?

Increase consistency & confidence in decision making

What is Data Governance ?

Decrease the risk of regulatory finesImprove data security

5 November 2012

165

Master Data Management

5 November 2012

16 166 6

What is Master Data Management Master Data Management (MDM), is a discipline in Information Technology (IT) that focuses on the management of reference or master data that is shared by several disparate IT systems and groups. MDM enables consistent computing between diverse system architectures and business functions. MDM integrates dimensional and master data across BI, data warehouse, financial & operational systems, providing for accurate, consistent and compliant enterprise reporting. MDM supplies meta-data for aggregating and integrating transactional data.Practices Processes Technologies

Master Data

Metadata

5 November 2012

167 167

Typical Requirements for MDM Role Definition Support: Support for definition of roles with access rights enforced depending on the responsibilities assigned for that role ETL: ETL capabilities for extracting master data/reference data files or tables from multiple sources and loading the data into the master data repository Data Cleansing: Data cleansing capabilities for de-duplication and matching of master data records Collaborative platform: A collaborative platform for coordinating decisions on master data reconciliation and rationalization. The platform should be supported by standards, if available, or via industry knowledge of a master data domain. An example is a standard product hierarchy for a particular industry Data synchronization and replication support: For applying changes established in a central server to each consuming application. Incremental change support is important for performance reasons Version control and Change monitoring: Version control at the central policy hub combined with change monitoring across all of the participating systems. This is needed in order to track changes to master data over time.

5 November 2012

168 168

Processes required for Master Data Management Master Data is managed Via a Policy Hub as shown in the figure The policy hub for master data management collects master data from participating analytical and transactional systems Collaborative applications run on the central policy hub to coordinate decisions among team members on master data policies The standard master data is published to each participating system (transactional and analytical) so that they are synchronized with the hub

5 November 2012

169 169

Processes required for Master Data Management Steps in the Process for Managing and Maintaining Master Data Assign business responsibility for each master data domain such as products, customers, suppliers, organizational structure Extract master data for a domain from separate operational and reporting systems to a central server Apply data quality standards, such as de-duplication and matching of master data records, to get a clean set of master data for the domain Reconcile and rationalize the master data records. This process entails setting policies pertaining to an optimal product hierarchy, organizational structure, or preferred supplier list Synchronize participating operational and reporting systems with the centrally managed, canonical master data Monitor changes or updates to master data in each participating system. Then repeat the preceding steps for ongoing maintenance of master data. Over time, with the centralization of master data management responsibilities, the origination of master data changes moves from the participating systems to the master data management hub or server

5 November 2012

170 170

Data Storage, Movement and Access

5 November 2012

17 171 1

Data Security Access Data security is the means of ensuring that data is kept safe from corruption and that access to it is suitably controlled. Thus data security helps to ensure privacy. It also helps in protecting personal data. It is the process of protecting data from unauthorized access, use, disclosure, destruction, modification, or disruption. Protecting confidential information is a business requirement, and in many cases, it is also a legal requirement, and some would say that it is the right thing to do.

5 November 2012

172

Question & Answer Session...

5 November 2012

173

5 November 2012

174