Data Design Process and Considerations

Information Excellence informationexcellence.wordpress.com

Harvesting Information Excellence

Information Excellence2013 Oct Knowledge Share Session

Balaji Venkataraman, Data Architect, Dell

Data Design Process, Considerations and Practices

Hosted by

Data Design Process

Considerations and Practices Balaji Venkataraman, Data Architect, DELL

Acknowledgement

Attila Finta – Chief Architect, Dell EBI

- My Mentor [Content sourced from his work]

DVR Subrahmanyam – Intel

- Support, Guidance and Encouragement

Balaji Venkataraman

3

Balaji Venkataraman, Data Architect, Dell

Information Technology Data Warehouse/ Business IntelligenceProfessional having 16+ years of Industry Experience whichincludes being part of the Architecture and Design teams ofData Warehouses like that of Dell’s.

He has previously worked for Delphi-TVS, PSI Data Systemsand iGate Corporation.

Played several Individual contributor roles like Support,Developer, ETL Designer, Data Designer, Information Architect,etc.

Currently a member of the Analytics and Business IntelligenceInnovation Team at Dell, Bangalore.

http://in.linkedin.com/in/balajivenkataraman03

Agenda

• Data Design and Challenges

• Data Design Process and Deliverables

• Considerations for Standards and Best Practices

• Data Profile, Quality, Metadata, ILM and Columnar

2

EDW:PB Scale1000s of Modeled Entities1000s of User Maintained Data100s of Schemas1000s Users

Data Design?

• Data Architecture is defining, organizing, cataloging – logically and physically – the information of the enterprise that is electronically represented, stored and exchanged in terms of its creation, meaning and utilization.

• Data modeling in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques. – To manage data as a resource

– For the integration of information systems

– For designing databases/data warehouses (aka data repositories)

• Data models provide a structure for data used within information systems by providing specific definition and format.

4

Considerations / Challenges in Data Modeling

• Presentation of models to non-technical audiences

• Selling the value of data modeling to the business

• More focus on business needs, with less focus on implementation and the final product or “the perfect data model”

• More emphasis on conceptual modeling, which also may help data modelers to be more sought out in the emerging non-relational world

• Adaptation of relational approaches to accommodate Big Data and other emerging technologies

• Better engagement with NoSQL and other non-traditional databases

• More education/training of current practitioners in the newest trends and technologies

• Growth of new data architects, data modelers, and other data experts within colleges and universities

• Changing from a control-oriented mindset where the model is the only focus, to a service-oriented mindset that focuses on communication and marketing

5

Concerns:

• Visibility – Data Designers need to provide more visibility

• Speed – Agile Adoption

• Quality - Good rather than Perfect

• Perspective –Incremental better than absolutes

• Collaboration – Work closely with other stake holders – ETLAs / DBAs

• Skills – Expand Skillsets

Types of Data Models

• Conceptual – First step in organizing the data requirements

– Consists of entity classes, representing kinds of things of significance in the domain, and relationships assertions about associations between pairs of entity classes

• Logical – Describes the structure of some domain of information in a normalized

fashion. This consists of meaningful, descriptive, non-circular entity and attribute definitions

– FK relationships to EDW entities already implemented

• Physical – Describes the physical means used to store data. This is concerned with

partitions, CPUs, storage, and the like

6

Data Designer Core KSEs

• Business systems analysis

• Industry DW architectures

• Data profiling

• Data modeling – 3NF and dimensional

• Database basics

• Data Modeling Tools – DB features

– Forward engineering

• Source to target mapping specification

Aptitudes & Interests of a Great Data Architect / Analyst / Designer

• Curiosity: "Why?", "What if?", "How?“

• Ability to move from the conceptual and abstract to specific and back again

• Ability to visualize the big picture and the myriad details and how the latter affects the former

• Ability to clearly communicate: (1) convey, explain, illustrate, (2) hear, listen, elicit , understand -- verbally and in writing

• Ability to “speak the language” of the business and the technical – and translate between them

The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man. ~ George Bernard Shaw

7

C3

communication coordinationcollaboration

Data Design Process

• Analysis

• Logical modeling

• Physical design

Analysis

Logical modeling

Physical design

8

Deliverables:

• Source data profile *

• Data model: logical/physical

• DDL: generated from model

• Source to target (S2T) mapping

9

Why Do You Need a Logical Data Model? Provides discipline and structureFacilitates communication Common understanding of business conceptsCross-functional and unbiased

What goes into a Data Model ?It graphically represents the data requirements and data organization of the business

– Identifies those things about which it is important to track information (entities)

– Facts about those things (attributes)– Associations between those things

(relationships)

Subject-oriented, designed in Third Normal Form – one fact in one place in the right place

10

Data Design Process – Reviews / Check Points

Source Data Profiling

• Examine the nature, scope, content, meaning, structure of data from the source system proposed for inclusion in the DW

• Determine quality of the data: – Completeness

– Consistency

– Integrity

• Determine its fit into the DW – Does it belong? Is it meaningful, useful, necessary?

– How does it integrate with, complement, supplement what already exists in the DW?

11

Source Data Profiling – What To Look For?

• Does the content match the name and expected information

• Candidates to omit – Columns 100% null, or containing only 1 value

• Candidates to transform/conform/default – Code values that are similar or semantically the same

as existing code sets in the DW but are inconsistent, e.g. country codes

• Candidates to – Normalize

– Create FKs

– Compress values

– Reduce column size

– Default value

12

Source to Target Specification (S2T)

• Data designer – Has examined the source data structure, content, and meaning

– Understands the business requirements of the DW

– Designs the DW target structures for the data

• Therefore the data designer is in the best position to specify the movement and transformation of data from the source structure to target structure in the DW … “connecting the dots” – Column specific

– Transformation and validation rules in pseudo-code

13

Data Warehouse Data Layer

Source – SOR / User Maintained Data

STG – Copy of Raw Data from Source within DW

- Transient in Nature

- Minimize Impact on Source

Base – DW Integration Layer

- Subject Oriented

- Integrated

- Single Version of the Truth

- Business Validated Source for all data in DW

• Package – Data Marts – Custom Views;

– Built for BI Performance – SLAs

– Designed to Reduce System Load

14

Data Design Principles

• Model-driven design – Generating DB executables from CASE tool

– CASE tool is where the data design is created, maintained, and documented

– Source code (DDL) must be generated from the CASE tool

– Data model and DDL versions are controlled artifacts

• Build with performance in mind – Careful PK selection

– Value compression

– Minimize row size as much as possible

› “Vertically partition” where it makes sense, i.e. split one large table into multiple tables having the same PI and partitioning

• Extend existing logical data models – Don’t create new models from scratch: extend/modify existing models

– Start with the logical data model, make changes there, not in the physical model – allow the tool to generate derive the physical names

15

Industry Logical Data Models

• Teradata’s iLDM – covers Manufacturing, Finance, Banking, Retail, etc

• Oracle

• ARDM – Applied Resource Data Management – Health Care

• Universal Data Model – Len Silverston

• IBM’s Advanced Data Model

16

Model Management

17

• Integrated model library environment, enabling cross-model analysis and reporting

• Semi-automated version management, enabling a single data model to have multiple versions stored together with the exact deltas recorded and maintain by the tool, rather than model snapshots that exist as separate files stored in manually maintained folder structures

– reduces the number of copies of models and prevents duplication of metadata

• Underlying relational repository storing all data model metadata together, enabling not only repository-wide, cross-model reporting but also metadata extraction for integration with (for example) source-to-target mapping metadata and BI layer metadata

• Integrated model platform can also enable (in a future phase)

– automated monitoring, scorecarding, and other quality checks on a broader scale

– streamlining and decentralizing some model administration functions while maintaining necessary coordination and quality controls

• Collateral benefit: reducing storage and I/O demand on Pub Shares by transferring data models to a different platform

Basic Data Modeling Standards (1)

• Standard audit columns (a.k.a. “plumbing columns”) on all tables

› E.g. DW_SRC_SITE_ID, DW_LD_GRP_VAL, DW_INS_UPD_DTS

• Data naming of entities and attributes – Meaningful and specific business names, sufficiently qualified

› E.g. “ Last Name”, not simply “Last Name”

› Entity names should be unique across the enterprise and the DW, e.g. “ Organization”, not simply “Organization”

– Auto abbreviation of table and column names by data modeling tool

› Based on the logical, unabbreviated name

› Using standard abbreviations list

– Attributes must end with a class word, e.g., ID, CD, DT, AMT, etc.

› Payment , Payment , or Payment , not simply Payment

18

”)

E.g. DW_SRC_SITE_ID, DW_LD_GRP_VAL,

”)

E.g. DW_SRC_SITE_ID, DW_LD_GRP_VAL,

”) ”)

Basic Data Modeling Standards (2) • Use standard domains (pre-defined in the modeling tool) for

common attributes › E.g. SO_NBR varchar(20), BU_ID integer, ITM_NBR varchar(30)

• Business definitions for all entities and attributes – Non-circular, e.g. not “Order Type Code is a code

representing types of orders”

– Make a good faith effort. Involve the BSA, business and source SMEs; give them a XLS to edit

› Include sample/common values based on data profiling

› If no one can tell you what it means and how it’s used then push back: “Then it is useless for reporting or analysis, and so should be omitted from the DW”

› If the PM and BSA still insist it must come into the DW then document its definition: (1) USE THIS DEFINITION: “No definition available from source data stewards or end-users. Do not use this attribute. Its meaning and quality are unknown." And (2) PROVIDE SAMPLE VALUES in the definition, preferably values that have been used recently and frequently.

19

Data Modeling Best Practices

• Always begin with a normalized model – Normalization requires understanding the data

– A normalized data model is inherently optimized for

› No data redundancy

› High data integrity

• Normalized means every non-PK attribute of in the entity is wholly dependent on the entire PK

• Define the unique natural/business key of each table – If the PK is simply a Surrogate Key (SK) adopted from the source,

wherever possible identify the Natural Key, i.e. what makes the record unique in business terms (which drives the source system to generate a new SK value).

– If the NK is different than the PK it can be reflected in the model as an Alternate Key (unique index)

20

› Wide variation of access paths

› Efficient data maintenance


• Name new attributes consistently with the rest of the DW

• Include the parent/reference entity in the model and create FK relationships to it – Ensures data type compatibility

– Inherits name and definitions

• Relationships should be explicitly defined in the data model for two reasons – Referential integrity

– Depict join keys

• When including a reference entity from another data model (e.g. Sales Order in the Manufacturing model) color it gray and do not generate

21


• Definitions – Natural Key (NK) = the column(s) used as the immutable unique

identifier meaningful to business users, i.e. the “business key”. The business key is usually the PK used within and provided by the source system of record (SOR).

– Surrogate Key (SK) = the immutable, non-intelligible, unique numeric identifier (system-generated during ETL) corresponding to a globally unique NK

• Surrogate Keys – Useful to generate in the DW where

› Type 2 SCD to support reproducible “as was” point-in-time reporting

› Key integration from disparate sources with colliding key values

– Comes with overhead of lookups in the load and extra joins in retrieval

22


• “Natural keys” or business keys – Facilitate natural joins of disparate data across the

DW not consciously designed for integration

– Should be used only where the DW team has assurance from Enterprise Architecture and the business segment IT that the business key is global and largely immune to source system changes or additions

• When DW-generated SKs can be avoided: – No requirement for Type 2 SCD or key integration

– Business keys fulfill the basic requirements of a good PK:

› Unchanging, unique

– Avoiding overhead of SK lookup

– Eliminating the need for table joins simply to obtain the business key for a FK

23


• Data that requires versioned history should reside in separate tables from data that does not, even if the identity and granularity of the data is the same – Example: Foobar Header (non-CDC, record updated in place) and

Foobar Status History (CDC, changes are inserted as new records).

– For the CDC table a DATE attribute (such as status date or update date) can be added to the Natural Key.

– Note how the status and status date were also included in the non-CDC table, where status and status date will always be updated with the latest values, making a join to the Status History table unnecessary when the only thing required is the current status.

24

Data Modeling Best Practices • If a parent/reference entity doesn’t exist but it would help

portray join paths and enable data type propagation , then create one but do not generate the table

25


• Unintelligible IDs and Code attributes – Needs a reference entity in the data model

to decode that ID/Code value

– In some cases it is acceptable to simply document the code descriptions in the attribute definition, if the set of valid values is small and stable

–

› The field is of no value in reporting or analysis

› We should exclude it from the DW in transactional data

• Every entity should have relationships – Every entity in a data model should have an

explicit relationship to at least one other

– If not then how can the data be used? Is it really an island of data that has nothing to do with anything else in company?

26

Every entity in a data model should have an

If not then how can the data be used? If not then how can the data be used? Is it really an island of data that has nothing If not then how can the data be used? Is it really an island of data that has nothing

Every entity in a data model should have an

If not then how can the data be used?

In some cases it is acceptable to simply document the code descriptions in the attribute definition, if In some cases it is acceptable to simply document

Needs a reference entity in the data model Needs a reference entity in the data model Needs a reference entity in the data model Needs a reference entity in the data model Needs a reference entity in the data model Needs a reference entity in the data model Needs a reference entity in the data model

In some cases it is acceptable to simply document

Needs a reference entity in the data model Needs a reference entity in the data model Needs a reference entity in the data model Needs a reference entity in the data model Needs a reference entity in the data model


• Get input from anyone you can – Ad hoc, informal with other data designers, SMEs, etc.

– Aim to maintain readability

› Relationships not routed under entities

› Entities not piled on top of one another

› Minimize relation lines on top of one another

› Create custom submodels (ERwin subject areas) to portray particular sets of entities

– can understand a data model and can contribute a valuable insight

27

Design with Performance in Mind • Build performance enhancing features into the physical data

model from the start – Avoid the common mistake of “Build First, Tune Later”

– Any design that meets the functional and business requirements but lacks in performance will always need to be revisited, creating rework, delays, and user dissatisfaction

• Considerations – Load: consider “vertical partitioning” i.e. splitting

a table into 2 or more, with the same PK if …

› There are two separate sources for an entity, for different attributes

› Some data for the entity is volatile (e.g. status) while most of the data remains static, or some attributes should have CDC history but others do not

– Retrieval: ditto if …

› Will data from two particular tables nearly always be retrieved together?

› Will 80% of the queries on a 100 column table access only 20% of the columns? And vice versa?

28

Load: consider “vertical partitioning” i.e. splitting Load: consider “vertical partitioning” i.e. splitting

Indexing

• Index liberally from the beginning, then monitor usage and pare down – When in doubt additional Indexes, then monitor how much it is actually

used by the optimizer

– If it is rarely used in normal production access then it is not worth the operational overhead of maintaining it, so drop the index.

– If it is a multi-column index and not all columns are used, drop and re-create the index with just the columns needed.

• In general, add extra indexes to: – Foreign Keys

– Other join columns

– Frequently used filter columns, e.g. certain date fields

29

Right-sizing columns

• Make columns the right size: large enough to accommodate all anticipated values – and no larger

• Our principal guide is the size in the source system: – Field size and max length of actual values

– Be aware that the source system may have columns that are much large than necessary.

› This often occurs when a company implements a purchased application that has been designed to accommodate the needs of many other purchasers of the application.

› Example: An EDW receives Customer Hub/Master data from a COTS application called Initiate. There Country Code is defined as Varchar(60). But in EDW it is ISO 2-character country codes in this field. So we don’t need to make that column in the Base table Varchar(60) when all it ever contains is Char(2).

30

Right-sizing columns (cont’d)

• Corollary: “Know thy data” i.e., – In the previous example we discovered that EDW

actual data requires only 3% of the size. (Not only that, it can be a CHAR with value compression.)

– Other examples:

› A Varchar(30) code column. When we look at the data we see that it always contains one of three values, none of which is longer than 9 bytes. … And in EDW we can make it char(20) compress (‘X’,’Y’,’Z’)

› A Varchar(20) column that contains two values, e.g. FOREIGN, DOMESTIC. Can this be converted to a Foreign Flag char(1)?

31

Right-sizing columns (cont’d)

• Why bother with this? – Shorter rows => more rows per block => more efficient I/O

– Varchars don’t always save space, and have performance impacts

• What about the risks? Longer data may be added to the source system later! – Talk with source system and business SMEs

– Show them your data profiling results

– “Split the difference”

› A Varchar(600) column in the source but the max actual length of current values = 60. We can double it and round up, to something like 150

– Key: perform “due diligence”

32

Data Type / Domain Practices: Amounts

• In general use the AMOUNT standard domain, DECIMAL(18,3) – This should suffice for most uses. If the business requires greater scale,

it is acceptable to specify 18,4 or 18,5

› Avoid making it larger than Decimal 18 (unless storing the global GDP or the US federal debt)

– In many DBMS’s, Decimal (18) requires 8 bytes, while Decimal(19) and larger takes 16 bytes

• Ensure the currency of the amount is clearly indicated – If the column will contain amounts of varying currencies then a

separate column is needed to specify the currency code for the amount

– If the amount is always in USD or EUR, then include the currency code in the column name, e.g. TXN_USD_AMT

33

Data Type / Domain Practices: Dates and Timestamps When the source provides a Timestamp column (a single field containing both Date and Time) …

• Verify that the column truly contains a real, non-zero Time value. If it contains only a date value then make the target a Date only column, not a Timestamp.

– Date columns are more efficient than Timestamp columns. DATE data type requires only 4 bytes, but 10 for TIMESTAMP

• If … – Timestamp column contains a real, non-zero Time value and

– The source does not provide parallel a Date-only column and

– It is likely that BI consumption will be concerned mainly with the Date only regardless of the Time component

• Then create two target columns: a TIMESTAMP with the full date and time, and a DATE column with the date only.

– This provides more options for queries and joins and reduces the need to use data functions in queries.

34

Date columns are more efficient than Timestamp columns. Date columns are more efficient than Timestamp columns. Date columns are more efficient than Timestamp columns. Date columns are more efficient than Timestamp columns.

Data Type / Domain Practices: Quantities

• If a quantitative numeric column contains data of varying units of measure then a separate column is required identifying the unit of measure for the quantity

• If the quantity is always in a single unit of measure then include the unit of measure in the column name , e.g. HARD_DRIVE_CAPACITY_GB_QTY

35

Uniqueness

• Where possible, have a unique index – PK or AK defined on the table

• Traditionally we have refrained from enforcing uniqueness in the DW using the DB constraints, but …

36

– The cost-based query optimizer loves unique indexes because this enables it to create more efficient queries

Partitioning • Generally refers to “horizontal partitioning”

– Easy way to remember: think of a table as a spreadsheet that you are going to cleave down the middle … will you split it vertically or horizontally?

• The main idea: the system and the user access the data in the table more efficiently most of the time if we create logical sections – If you have a table with a lot of rows and you know that the data is quite

frequently accessed filtering on one or more particular columns, then consider range-partitioning on them, particularly dates.

– Partition elimination is one of the most powerful query performance boosters, occurring when the partition key is used in the WHERE clause in the SELECT statement.

– Note that partitioning incurs some overhead on load, and also in queries when the partition key is not part of the WHERE clause.

37

Columnar Suitability Does the system have spare CPU resources?

-- Is using Insert/Select to load tables possible within the environment?

Columnar Basics Review

Primarily for Read optimized environments

applications

Consider Columnar Tables

Not Null

• Wherever possible define non-quantitative columns as NOT NULL, especially FKs. – This may seem counter-intuitive because of the overhead of constraint

enforcement by the DBMS

– But the query optimizer built into the DBMS creates more efficient query plans based on this knowledge

• Define quantitative columns – amounts, quantities, counts, etc. – as NOT NULL only if the source data is also not null

• Where necessary designate default values – Specified by the business, or …

– ‘N/A’ or 0 for a column where a value is not expected in 100% of source records or …

– Unknown or -1 for a column where a valid value is expected but was not received

39

Metadata Search

Confidential40

ILM

• ILM = Information Lifecycle Management … – All information undergoes a "life" sequence:

– (1) it is created, (2) it used/useful and meaningful for a definable period of time, and (3) at some point it is no longer useful and should be destroyed.

• Archive Data to a Cheaper storage; Purge old data

• Benefits: – Mitigates capital investment required to support growth of the data warehouse

– Reduced table sizes boosts performance on EDW

• Two aspects actively managed: – Policy -

– Implementation -

41

Questions

42


Community Focused

Volunteer Driven

Knowledge Share

Accelerated Learning

Collective Excellence

Distilled Knowledge

Shared, Non Conflicting Goals

Validation / Brainstorm platform

Mentor, Guide, Coach

Satisfied, Empowered Professional

Richer Industry and Academia

About Information Excellence Group

Progress Information Excellence

Towards an Enriched Profession, Business and Society


Host Us

Two Hour Monthly Session

Half Day deep dive Session

Full Day Summit Session

Speaker Support

Recommend Speakers

Suggest Topicss

You Can Help this Community Grow

Progress Information Excellence

Towards an Enriched Profession, Business and Society

All Our Sessions are Free for participants; All Support and Sponsorship in Non Cash mode

Something to Feel

Genuinely Happy and

Proud About

Thank you for Hosting us Today


About Information Excellence Group

Reach us at:

blog: http://informationexcellence.wordpress.com/

linked in: http://www.linkedin.com/groups/Information-Excellence-3893869

facebook: http://www.facebook.com/pages/Information-excellence-group/171892096247159

presentations: http://www.slideshare.net/informationexcellence

twitter: #infoexcel

email: [email protected]@gmail.com

Have you enriched yourself by contributing to the community Knowledge Share..

Data Design Process and Considerations

Technology

data repositories data

big data

data modelers

data architecture

data experts

data requirements

value of data modeling

agenda data design