dwh intrw qutns

7/22/2019 dwh intrw qutns

1/25

What is a Data warehouse?

A:A Data warehouse is a repository of integrated information, available for queries and

analysis.

Data and information are extracted from heterogeneous sources and stored in a database for

easy and more efficient way to run queries and create reports.

A data warehouse is a logical collection of information gathered from many different

operational databases used to create business intelligence that supports business analysis

activities and decision-making tasks, primarily, a record of an enterprises past transactional

and operational information, stored in a database designed to favour efficient data analysis

and reporting (especially OLAP).

What are the characteristics of Data Warehouse?

Subject-Oriented:Information is presented according to specific subjects or areas of

interest, not simply as computer files.

Integrated:Integration is closely related to subject orientation. Data warehouses must put

data from disparate sources into a consistent format. That is, if two different source systems

store conflicting data about entities, or attributes of an entity, the differences need to be

resolved during the process of transforming the source data and loading it into the data

warehouse.

Non-Volatile:Stable information that doesnt change each time an operational process is

executed. Information is consistent regardless of when the warehouse is accessed.

Time-Variant:Containing a history of the subject, as well as current information. Historical

information is an important component of a data warehouse.

Accessible:The primary purpose of a data warehouse is to provide readily accessible

information to end-users.

Process-Oriented:It is important to view data warehousing as a process for delivery of

information. The maintenance of a data warehouse is ongoing and iterative in nature.

What is the Advantages of Data warehouse?

Enhances end-user access to a wide variety of data.

Business decision makers can obtain various kinds of trend reports.

Increased data consistency.

Potentially lower computing costs and increased productivity.

Providing a place to combine related data from separate sources.

Creation of a computing infrastructure that can support changes in computer systems andbusiness structures.


2/25

Empowering end-users to perform any level of ad-hoc queries or reports without impacting

the performance of the operational systems.

Data Warehouse Approaches

Posted bySantosh Kumar Gidadmanion January 9, 2011

There are two major approaches to data warehouse design.

1. Bottom-up approach

This approach is recommended by Kimball.

In the bottom-up approach data marts are first created to provide reporting and analytical

capabilities for specific business processes.

Data marts contain,primarily, dimensions and facts. Facts can contain either atomic data

and, if necessary, summarized data. The single data mart often models a specific business

area such as Sales or Production.

These data marts can eventually be integrated to create a comprehensive data warehouse.

The integration of the data marts in the data warehouse is centered on the conformed

dimensions.

The actual integration of two or more data marts is then done by a process known as "Drill

across". A drill-across works by grouping (summarizing) the data along the keys of the(shared) conformed dimensions of each fact participating in the "drill across" followed by a

join on the keys of these grouped (summarized) facts.

Some consider it an advantage of the Kimball method, that the data warehouse ends up

being "segmented" into a number of logically self contained and consistent data marts, rather

than a big and often complex centralized model.

Business value can be returned as quickly as the first data mart is built.

2. Top-down approach
http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/


3/25

This approach is recommended by Bill Inmon.

Inmon is one of the leading proponents of the top-down approach to data warehouse design,

in which the data warehouse is designed using a normalized enterprise data model.

In the Inmon vision the data warehouse is at the center of the "Corporate InformationFactory" (CIF), which provides a logical framework for delivering business intelligence (BI)

and business management capabilities.

The top-down design methodology generates highly consistent dimensional views of data

across data marts since all data marts are loaded from the centralized repository.

Generating new dimensional data marts against the data stored in the data warehouse is a

relatively simple task.

The main disadvantage to the top-down methodology is that it represents a very large

project with a very broad scope, cost and time.

In addition, the top-down methodology can be inflexible and unresponsive to changing

departmental needs during the implementation phases.

Normalization

This is a technique used in data modeling that emphasizes on avoiding storing the same dataelement at multiple places. We follow the 3 rules of normalization called the First Normal Form,Second Normal Form, and Third Normal Formto achieve a normalized data model.

A normalized data model may result in many tables/entities having multiple levels of relationships,example table1 related to table2, table2 further related to table3, table3 related to table 4 and soon.

First Normal FormThe attributes of the entity must be atomic and must depend on the Key.

Second Normal FormThis rule demands that every aspect of each and every attribute depends onKey.

Third Normal Form (3NF)This rule demands that every aspect of each and every attributesdepends on nothing but the key.


4/25

Theoretically We have further rules called the Boyce-Codd Normal Form, Fourth Normal Form andthe Fifth Normal form. In practice we dont use the rules beyond 3NF.

OLAP

In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) andRelational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP andROLAP.

MOLAP

This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensionalcube. The storage is not in the relational database, but in proprietary formats.

Advantages:

Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicingand dicing operations.

Can perform complex calculations: All calculations have been pre-generated when the cube iscreated. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages:

Limited in the amount of data it can handle: Because all calculations are performed when thecube is built, it is not possible to include a large amount of data in the cube itself. This is not tosay that the data in the cube cannot be derived from a large amount of data. Indeed, this ispossible. But in this case, only summary-level information will be included in the cube itself.

Requires additional investment: Cube technology are often proprietary and do not alreadyexist in the organization. Therefore, to adopt MOLAP technology, chances are additionalinvestments in human and capital resources are needed.

ROLAP

This methodology relies on manipulating the data stored in the relational database to give theappearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicingand dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages:

Can handle large amounts of data: The data size limitation of ROLAP technology is thelimitation on data size of the underlying relational database. In other words, ROLAP itselfplaces no limitation on data amount.

Can leverage functionalities inherent in the relational database: Often, relational databasealready comes with a host of functionalities. ROLAP technologies, since they sit on top of therelational database, can therefore leverage these functionalities.

Disadvantages:

Performance can be slow: Because each ROLAP report is essentially a SQL query (ormultiple SQL queries) in the relational database, the query time can be long if the underlyingdata size is large.

Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL

statements to query the relational database, and SQL statements do not fit all needs (forexample, it is difficult to perform complex calculations using SQL), ROLAP technologies are


5/25

therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk bybuilding into the tool out-of-the-box complex functions as well as the ability to allow users todefine their own functions.

HOLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-typeinformation, HOLAP leverages cube technology for faster performance. When detail information isneeded, HOLAP can "drill through" from the cube into the underlying relational data.

Q: What is data mining?

A:Data mining is a process of extracting hidden trends within a datawarehouse. For example

an insurance data warehouse can be used to mine data for the most high risk people to insure

in a certain geographial area.

Q: What are Data Marts?

A:Data Marts are subset of the corporate-wide data that is of value to a specific group of

users.

There are two types of Data Marts:

1.Independent data martssources from data captured form OLTP system, external providers

or from data generated locally within a particular department or geographic area.

2.Dependent data martsources directly from enterprise data warehouses.

Q: What is OLTP?

A:OnLine Transactional Processing.

Q: What is OLAP?

A:OnLine Analatical Processing.

Q: What are the differences between OLTP and OLAP?

A:Main Differences between OLTP and OLAP are:-

1. User and System OrientationOLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT

professionals.

OLAP: market-oriented, used for data analysis by knowledge workers( managers, executives,

analysis).

2. Data Contents

OLTP: manages current data, very detail-oriented.

OLAP: manages large amounts of historical data, provides facilities for summarization and

aggregation, stores information at different levels of granularity to support decision making

process.

3. Database Design


6/25

OLTP: adopts an entity relationship(ER) model and an application-oriented database design.

OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database

design.

4. View

OLTP: focuses on the current data within an enterprise or department.OLAP: spans multiple versions of a database schema due to the evolutionary process of an

organization; integrates information from many organizational locations and data stores

Q: What is real time data-warehousing?

A:Real-time data warehousing captures business activity data as it occurs. As soon as the

business activity is complete and there is data about it, the completed activity data flows into

the data warehouse and becomes available instantly. In other words, real-time data

warehousing is a framework for deriving information from data as the data becomes

available.

Q: What are the steps to build the datawarehouse ?

A:

Gathering business requirements.

Identifying Sources

Identifying Facts

Defining Dimensions

Define Attributes

Redefine Dimensions & Attributes

Organize Attribute Hierarchy & Define Relationship

Assign Unique Identifiers

Additional conventions:Cardinality/Adding ratios

Q: What is a CUBE in data warehousing concept?

A:Cubes are logical representation of multidimensional data.The edge of the cube contains

dimension members and the body of the cube contains data values.

Q: What is a linked cube?

A: Linked cube is a cube, in which a sub-set of the data can be analysed into greater detail.

The linking ensures that the data in the cubes remain consistent.

Q: What is the main difference between Inmon and Kimball philosophies of data

warehousing?

A:Both differed in the concept of building the datawarehosue.

Kimball views data warehousing as a constituency of Data marts. Data marts are focused on

delivering business objectives for departments in the organization. And the data warehouse is

a conformed dimension of the data marts. Hence a unified view of the enterprise can be

obtain from the dimension modeling on a local departmental level.


7/25

Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the

development of the data warehouse can start with data from the online store. Other subject

areas can be added to the data warehouse as their needs arise.

i.e.,

KimballFirst DataMartsCombined wayDatawarehouseInmonFirst DatawarehouseLater-Datamarts

Q: What is Hierarchy in data warehouse terms?A:Hierarchies are logical structures that use ordered levels as a means of organizing data. A

hierarchy can be used to define data aggregation. For example, in a time dimension, a

hierarchy might aggregate data from the month level to the quarter level to the year level. A

hierarchy can also be used to define a navigational drill path and to establish a family

structure.

Within a hierarchy, each level is logically connected to the levels above and below it. Data

values at lower levels aggregate into the data values at higher levels. A dimension can be

composed of more than one hierarchy. For example, in the product dimension, there might betwo hierarchiesone for product categories and one for product suppliers.

Dimension hierarchies also group levels from general to granular. Query tools use hierarchies

to enable you to drill down into your data to view different levels of granularity. This is one

of the key benefits of a data warehouse.

When designing hierarchies, you must consider the relationships in business structures. For

example, a divisional multilevel sales organization.

Hierarchies impose a family structure on dimension values. For a particular level value, a

value at the next higher level is its parent, and values at the next lower level are its children.

These familial relationships enable analysts to access data quickly.

Q: What are the differnces between a RDBMS schema and a data warehouse schema?

A:

RDBMS Schema* Used for OLTP systems

* Highly Normalized

* Difficult to understand and navigate

* Difficult to extract and solve complex problems

DWH Schema* Used for OLAP systems

* De-normalized* Easy to understand and navigate

* Relatively easier in extracting the data and solving complex problems

Q:What is meant by metadata in the context of a Data warehouse?

A: Meta data is the data about data; Business Analyst or data modeler usually capture

information about datathe source (where and how the data is originated), nature of data

(char, varchar, nullable, existance, valid values etc) and behavior of data (how it is modified /

derived and the life cycle ) in data dictionary a.k.a metadata. Metadata is also presented at the

Datamart level, subsets, fact and dimensions, ODS etc. For a DW user, metadata provides

vital information for analysis / DSS.


8/25

Q. How OLAP is different than OLTP System?

A. Main Differences between OLTP and OLAP are:-

S.No. OLTP System OLAP System

1.Customer-oriented, used for dataanalysis and querying by clerks, clients

and IT professionals

Market-oriented, used for data analysis by knowledge workers(

managers, executives, analysis)

2.Manages current data, very detail-

oriented

Manages large amounts of historical data, provides facilities forsummarization and aggregation, stores information at different

levels of granularity to support decision making process

3.Adopts an entity relationship(ER)model and an application-oriented

database design

Adopts star, snowflake or fact constellation model and a subject-

oriented database design

4.Focuses on the current data within an

enterprise or department

Spans multiple versions of a database schema due to the

evolutionary process of an organization; integrates information

from many organizational locations and data stores

5.Large volumes of simple transactional

queriesSmall number of Diverse queries

6.Data changes in OLTP system are

continues and data is very volatileOLAP systems have periodic updates to the data

7. Data processing time is low Data processing time is high

8. Highly normalized dataOLAP systems contain fewer tables, but more columns per table

thus reducing degree of normalization

DIFFERENCE BETWEEN

STAR AND SNOWFLAKE

SCHEMA

Answers:


9/25

Data Modeling Tools

Data Modeling is heart and soul for any development project. Be it a simple application development or a data

warehouse project, data modeling becomes a core step in development life cycle. There are a variety of toolsavailable; this article gives a brief overview of a couple of these tools.

CA ErWin Data Modeler

fabFORCE.net DBDesigner

What is BUS Schema?

BUS Schema is composed of a master suite of confirmed dimension and standardizeddefinition if facts.

In a BUS schema we would eventually have conformed dimensions and facts defined to

be shared across all enterprise data marts. This way all Data Marts can use theconformed dimensions and facts without having them locally. This is the first step
http://bidw.techtiks.com/data_modeling/ca_erwin.htmlhttp://bidw.techtiks.com/data_modeling/db_designer.htmlhttp://bidw.techtiks.com/data_modeling/db_designer.htmlhttp://bidw.techtiks.com/data_modeling/ca_erwin.html


10/25

towards building an enterprise Data Warehousefrom Kimball's perspective. For (e.g)we may have different data marts for Sales, Inventory and Marketing and we needcommonentities like Customer, Product etc to be seen across these data marts andhence would be ideal to have these as Conformed objects. The challenge here is thatsome times each line of businessmay have different definitions for these conformedobjects and hence choosing conformed objects have to be designed with some extra

care.

What is the data type of the surrogate key?

Data type of the surrogate key is either integer or numeric or number

There is no data type for a SurrogateKey.

Requirement of a surrogate Key:UNIQUERecommended data type of a Surrogate key is NUMERIC.

What is surrogate key ? where we use it explain with example

surrogate key is a substitution for the natural primary key.

It is just a unique identifier or number for each row that can be used for the primarykey to the table. The only requirement for a surrogateprimary key is that it is uniquefor each row in the table.

Data warehouses typically use a surrogate, (also known as artificial or identity key),key for the dimension tables primary keys. They can use Infa sequence generator, or

Oracle sequence, or SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table)can change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated asthe primary keys (according to the business users) but ,not only can these change,indexing on a numerical value is probably better and you could consider creating asurrogate key called, say, AIRPORT_ID. This would be internal to the system and as far

as the client is concerned you may display only the AIRPORT_NAME.

2. Adapted from response by Vincent on Thursday, March 13, 2003

Another benefit you can get from surrogate keys (SID) is :

Trackingthe SCD - Slowly Changing Dimension.

Let me give you a simple, classical example:

On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what

would be in your Employee Dimension). This employee has a turnover allocated to himon the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted fromBusiness Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong to thenew BusinessUnit 'BU2' but the old one should Belong to the Business Unit 'BU1.'

If you used the natural business key'E1' for your employee within your datawarehouseeverything would be allocated to Business Unit 'BU2' even what actualy belongs to


11/25

'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record for theEmployee 'E1' in your Employee Dimension with a new surrogate key.

This way, in your fact table, you have your old data (before 2nd of June) with the SID

of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID ofthe employee 'E1' + 'BU2.'

You could consider Slowly Changing Dimension as an enlargement of your natural key:

natural key of the Employee was Employee Code 'E1' but for you it becomesEmployee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference withthe natural key enlargement process, is that you might not have all part of your new

key within your fact table, so you might not be able to do the join on the new enlargekey -> so you need another id.

A surrogate key is a system generated sequential number which acts as a primary key.

What is a lookup table?

A lookUp table is the one which is used when updating a warehouse. When the lookup is placed on the target table(fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only newrecords or updated records based on the lookup condition.

What is Dimensional Modelling? Why is it important ?

Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. Inthis design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table containsthe facts/measurements of the business and the dimension table contains the context of measuremnets ie, thedimensions on which the facts are calculated.

Why is Data Modeling Important?

Data modeling is probably the most labor intensive and time consuming part of the development process. Why botherespecially if you are pressed for time? A common response by practitioners who write on the subject is that you shouldno more build a database without a model than you should build a house without blueprints.

The goal of the data model is to make sure that the all data objects required by the database are completely andaccurately represented. Because the data model uses easily understood notations and natural language , it can bereviewed and verified as correct by the end-users.

The data model is also detailed enough to be used by the database developers to use as a "blueprint" for building thephysical database.The information contained in the data model will be used to define the relational tables, primaryand foreign keys,stored procedures,and triggers. A poorly designed database will require more time in the long-term.

Without careful planning you may create a database that omits data required to create critical reports, producesresults that are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements.

What is data mining?

Data mining is a process of extracting hidden trends within a datawarehouse. For example an insurance datawarehouse can be used to mine data for the most high risk people to insure in a certain geographial area.

What is ETL?

ETL stands for extraction, transformation and loading.

ETL provide developers with an interface for designing source-to-target mappings, ransformation and job controlparameter.

ExtractionTake data from an external source and move it to the warehouse pre-processor database. Transformation
http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333


12/25

Transform data task allows point-to-point generating, modifying and transforming data. LoadingLoad data task adds records to a database table in a warehouse.

What does level of Granularity of a fact table signify?

Granularity

The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean thelowest level of information that will be stored in the fact table. This constitutes two steps:

Determine which dimensions will be included.Determine where along the hierarchy of each dimension the information will be kept.The determining factors usually goes back to the requirements

What is SCD1 , SCD2 , SCD3?

SCD Stands for Slowly changing dimensions.

SCD1: only maintained updated values.

Ex: a customer address modified we update existing record with new address.

SCD2: maintaining historical information and current information by using

A) Effective DateB) Versions

C) Flags

or combination of these

SCD3: by adding new columns to target table we maintain historical information and current information.

What is Fact table?

Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales" ,then a measurement of this business process such as "monthly sales number" is captured in the Fact table. Fact tablealso contains the foriegn keys for the dimension tables.

What are conformed dimensions?

Answer1:Conformed dimensions mean the exact same thing with every possible fact table to which they are joined Ex:DateDimensions is connected all facts like Sales facts,Inventory facts..etc

Answer2:Conformed dimentions are dimensions which are common to the cubes.(cubes are the schemas contains facts anddimension tables)Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and Dimensions here D1,D2are the Conformed Dimensions

What are the Different methods of loading Dimension tables?

Conventional Load:Before loading the data, all the Table constraints will be checked against the data.

Direct load:(Faster Loading)All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the tableconstraints and the bad data won't be indexed.

What is conformed fact?

Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiplefacts tables accordingly

What are Data Marts?

Data Marts are designed to help manager make strategic decisions about their business.Data Marts are subset of the corporate-wide data that is of value to a specific group of users.


13/25

There are two types of Data Marts:

1.Independentdata martssources from data captured form OLTP system, external providers or from data generatedlocally within a particular department or geographic area.

2.Dependent data mart sources directly form enterprise data warehouses.

What is a level of Granularity of a fact table?

Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based ondesign you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail areyou willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate itupto minute and put that data.

What is the purpose of Factless Fact Table?

Fact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they

dont really have facts or any information but are more commonly used for tracking some information of an event.

Eg. To find the number of leaves taken by an employee in a month.

A tracking process or collecting status can be performed by using fact less fact tables. The fact table does not

have numeric values that are aggregate, hence the name. Mere key values that are referenced by the

dimensions, from which the status is collected, are available in fact less fact tables

What is junk dimension?.

In scenarios where certain data may not be appropriate to store in the schema, this data (or attributes) can be

stored in a junk dimension. The nature of data of junk dimension is usually Boolean or flag values.

E.g. whether the performance of employee was up to the mark? , Comments on performance.

A single dimension is formed by lumping a number of small dimensions. This dimension is called a junkdimension. Junk dimension has unrelated attributes. The process of grouping random flags and text attributes in

dimension by transmitting them to a distinguished sub dimension is related to junk dimension.

What are fundamental stages of Data Warehousing?

Offline Operational Databases:This is the initial stage of data warehousing. In this stage the development of

database of an operational system to an off-line server is done by simply copying the databases.

Offline Data warehouse:In this stage the data warehouses are updated on a regular time cycle from operational

system and the data is persisted in an reporting-oriented data structure.

Real time Data Warehouse:Data warehouses are updated based on transaction or event basis in this stage. Anoperational system performs a transaction every time.

Integrated Data Warehouse:The activity or transactions generation which are passed back into the operational

system is done in this stage. These transactions or generated transactions are used in the daily activity of the

organization.

What is Virtual Data Warehousing?

The aggregate view of complete data inventory is provided by Virtual Warehousing. The metadata is utilized for

forming logical enterprise data model which is a part of database of record infrastructure , is contained in virtual

data warehousing. The infrastructure consists of publishments of legacy database sysems with their metadta

extracted. The standards JEE, JMS and EJBs are used in the infrastructure for the purpose of transactional unitrequests and extract-tranform-load tools are used for loading real time bulk data.
http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333


14/25

What is active data warehousing?

The transactional data captured and reposited in the Active Data Warehouse. This repository can be utilized in

finding trends and patterns that can be used in future decision making.

An Active data warehouse aims to capture data continuously and deliver real time data. They provide a single

integrated view of a customer across multiple business lines. It is associated with Business Intelligence Systems.

Difference between ER Modeling and Dimensional Modeling.

Dimensional modelling is very flexible for the user perspective. Dimensional data model is mapped for creating

schemas. Where as ER Model is not mapped for creating shemas and does not use in conversion of

normalization of data into denormalized form.

ER Model is utilized for OLTP databases that uses any of the 1st or 2nd or 3rd normal forms, where as

dimensional data model is used for data warehousing and uses 3rd normal form.

ER model contains normalized data where as Dimensional model contains denormalized data.

Describe the various methods of loading Dimension tables.

The following are the methods of loading dimension tables:

Conventional Load:

In this method all the table constraints will be checked against the data, before loading the data.

Direct Load or Faster Load:

As the name suggests, the data will be loaded directly without checking the constraints. The data checking

against the table constraints will be performed later and indexing will not be done on bad data.

Describe the foreign key columns in fact table and dimension table.

The primary keys of entity tables are the foreign keys of dimension tables.

The Primary keys of fact dimensional table are the foreign keys of fact tables.

What is Data Cardinality?

Cardinality is the term used in database relations to denote the occurrences of data on either side of the relation.

There are 3 basic types of cardinality:

High data cardinality:

Values of a data column are very uncommon.

e.g.: email ids and the user names

Normal data cardinality:

Values of a data column are somewhat uncommon but never unique.

e.g.: A data column containing LAST_NAME (there may be several entries of the same last name)

Low data cardinality:

Values of a data column are very usual.

e.g.: flag statuses: 0/1


15/25

Determining data cardinality is a substantial aspect used in data modeling. This is used to determine the

relationships

Types of cardinalities:

The Link Cardinality - 0:0 relationships

The Sub-type Cardinality - 1:0 relationshipsThe Physical Segment Cardinality - 1:1 relationship

The Possession Cardinality - 0: M relation

The Child Cardinality - 1: M mandatory relationship

The Characteristic Cardinality - 0: M relationship

The Paradox Cardinality - 1: M relationship.

What are Critical Success Factors?

Key areas of activity in which favorable results are necessary for a company to reach its goal.

There are four basic types of CSFs which are:

Industry CSFs

Strategy CSFs

Environmental CSFs

Temporal CSFs

A few CSFs are:

Money

Your future

Customer satisfaction

Quality

Product or service development

Intellectual capital

Strategic relationshipsEmployee attraction and retention

Sustainability

The advantages of identifying CSFs are:

they are simple to understand;

they help focus attention on major concerns;

they are easy to communicate to coworkers;

they are easy to monitor;

and they can be used in concert with strategic planning methodologies.

SCD Type 1,Slowly Changing Dimension

Use,Example,Advantage,DisadvantageIn Type 1 Slowly Changing Dimension, the new information simply overwrites the original

information. In other words, no history is kept.

In our example, recall we originally have the following table:

Customer Key Name State1001 Williams New York


16/25


17/25

Customer Key Name State1001 Williams New York1005 Williams Los Angeles

Advantages This allows us to accurately keep all historical information.

Disadvantages This will cause the size of the table to grow fast. In cases where the number of rows for

the table is very high to start with, storage and performance can become a concern. This necessarily complicates the ETL process.

UsageAbout 50% of the time.

When to use Type 2

Type 2 slowly changing dimension should be used when it is necessary for the data warehouse

to track historical changes.SCD Type 3,Slowly Changing Dimension

Use,Example,Advantage,DisadvantageIn Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular

attribute of interest, one indicating the original value, and one indicating the current value.

There will also be a column that indicates when the current value becomes active.

In our example, recall we originally have the following table:

Customer Key Name State1001 Williams New York

To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:

Customer Key Name Original State Current State Effective Date


18/25

After Williams moved from New York to Los Angeles, the original information gets updated, and

we have the following table (assuming the effective date of change is February 20, 2010):

Customer Key

Name

Original State

Current State

Effective Date

1001 Williams New York Los Angeles 20-FEB-2010

Advantages This does not increase the size of the table, since new information is updated. This allows us to keep some part of history.

Disadvantages Type 3 will not be able to keep all history where an attribute is changed more than once.

For example, if Williams later moves to Texas on December 15, 2003, the Los Angeles

information will be lost.UsageType 3 is rarely used in actual practice.

When to use Type 3Type III slowly changing dimension should only be used when it is necessary for the datawarehouse to track historical changes, and when such changes will only occur for a finite

number of time.Explain degenerated dimension in detail.

Degenerate dimension: A column of the key section of the fact table that does not have the associated dimension

table but used for reporting and analysis, such column is called degenerate dimension or line item dimension.For

ex, we have a fact table with customer_id, product_id, branch_id, employee_id, bill_no, date in key section and

price, quantity, amount in measure section. In this fact table, bill_no from key section is a single value, it has no

associated dimension table. Instead of cteating a seperate dimension table for that single value, we can

include it in fact table to improve performance.SO here the column, bill_no is a degenerate dimension or line item

dimension.

Degenerated Dimension is achieved through a gradual modeling approach

following Dimensional Modeling standards. Let's take example of a Star Schema

representing Sales Invoices. The FACT would have the "Invoiced Amount" as

primary measure. Now when we look at the source of the Invoice, it is the body

if the Paper Invoice that gives us the following particulars about each Invoice:

Invoice Date

Customer ID


19/25

Products within the Invoice

Reference to Order Number(s)

Invoice Number

Invoice Line Numbers (which are multiple lines in single Invoice)

Invoice Line Amount

Invoice Total Amount

When we model the above following Dimensional Modeling standards, we get

following distinct Dimensions:

Calendar Dimension - representing the Invoice Date

Customer Dimension - representing Customer ID

Product Dimension - representing Products within the Invoice

Order Dimension - representing Orders

Invoice Dimension representing Invoice Number & Invoice Line Numbers

Question comes - what attributes would be left to be part of the INVOICE

DIMENSION, if at all we decide to have one! Only candidate attributes are

Invoice Number and Invoice Line Numbers. But, this is at the granularity of the

FACT, which stores references to all above said Dimensions as well as the

measures i.e. Invoice Line Amount, Invoice Total Amount (Derived by

aggregation).

It is at this situation, we may decide to degenerate the attributes Invoice

Number & Invoice Line Number into the Fact and avoid having a distinct entity to

represent Invoice Number / Line Numbers as a Dimension. What we achieve by this:

1. avoiding a huge join as both Fact and this Dimension would have the same

granularity,

2. still able to query with Invoice Number as the entry point

So, when such a scenario appears, we make the left out attributes (i.e.

Invoice Number & Invoice Line Number in our case) part of the Fact and part of

the Primary Key in the Fact. This is why and how we model Degenerated dimension.


20/25

A degenerate dimension is data that is dimensional in nature but stored in a fact table. For example,

if you have a dimension that only has Order Number and Order Line Number, you would have a 1:1

relationship with the Fact table. Do you want to have two tables with a billion rows or one table with

a billion rows. Therefore, this would be a degenerate dimension and Order Number and Order Line

Number would

A degenerate dimension is when the dimension attribute is stored as part of fact

table, and not in a separate dimension table. These are essentially dimension keys

for which there are no other attributes. In a data warehouse, these are often used

as the result of a drill through query to analyze the source of an aggregated

number in a report. You can use these values to trace back to transactions in the

OLTP system.

1. what is junk dimension? Give an example

When developing a dimensional model, we often encounter miscellaneous flags and indicators. These

flags do not logically belong to the core dimension tables.

A junk dimension is grouping of low cardinality flags and indicators. This junk dimension helps in

avoiding cluttered design of data warehouse. Provides an easy way to access the dimensions from a

single point of entry and improves the performance of sql queries.

Example: For example, assume that there are two dimension tables (gender and marital status). The

data of these two tables are shown below:

Code:

Table: GenderId Gender_status----------------1 Male

2 Female

Table: Marital Status

Id Marital_Status----------------1 Single

2 Married

Here both the dimensions have low cardinality flags. This will cause maintenance of two tables and

decrease performance of sql queries.

We can combine these two dimensions into a single table by cross joining and can maintain a single

dimension table. The result of cross join is shown below:

Code:

id gender mrg_status--------------------1 Male Single

2 Male Married3 Female Single4 Female Married

This new dimension table is called a junk dimension. This will improve the manageability and

improves the sql queries performance.


21/25

In data warehouse design, frequently we run into a situation where there are yes/no indicator fields inthe source system. Through business analysis, we know it is necessary to keep those information inthe fact table. However, if keep all those indicator fields in the fact table, not only do we need to buildmany small dimension tables, but the amount of information stored in the fact table also increasestremendously, leading to possible performance and management issues.

Junk dimension is the way to solve this problem. In a junk dimension, we combine these indicatorfields into a single dimension. This way, we'll only need to build a single dimension table, and thenumber of fields in the fact table, as well as the size of the fact table, can be decreased. The contentin the junk dimension table is the combination of all possible values of the individual indicator fields.

Let's look at an example. Assuming that we have the following fact table:

In this example, the last 3 fields are all indicator fields. In this existing format, each one of them is adimension. Using the junk dimension principle, we can combine them into a single junk dimension,

resulting in the following fact table:

Note that now the number of dimensions in the fact table went from 7 to 5.

The content of the junk dimension table would look like the following:


22/25

In this case, we have 3 possible values for the TXN_CODE field, 2 possible values for theCOUPON_IND field, and 2 possible values for the PREPAY_IND field. This results in a total of 3 x 2 x2 = 12 rows for the junk dimension table.

By using a junk dimension to replace the 3 indicator fields, we have decreased the number ofdimensions by 2 and also decreased the number of fields in the fact table by 2. This will result in adata warehousing environment that offer better performance as well as being easier to manage.

What are conformed dimensions

A conformed dimension can exist as a single dimension table that relates to multiple fact tableswithin the samedata warehouse,or as identical dimension tables in separate data marts. Date is a

common conformed dimension because its attributes (day, week, month, quarter, year, etc.) have

the same meaning when joined to any fact table. A conformed product dimension with product

name, description, SKU, and other common attributes could exist in multiple data marts, each

containing data for one store in a chain.

Examples of obvious conformed dimensions include Customer, Location, Organization, Time, and

Product.

Conformed Dimensionsare nothing but dimensions, but they are linked to many fact tables where as the

Dimensions are meant to be linked with its own Fact table.

The following figure gives the exact difference between these two...
http://searchsqlserver.techtarget.com/definition/data-warehousehttp://searchsqlserver.techtarget.com/definition/data-warehousehttp://searchsqlserver.techtarget.com/definition/data-warehousehttp://searchsqlserver.techtarget.com/definition/data-warehouse


23/25

Here F1, F2,... are Fact tables

D1, D2... are Dimension tableswhere as D3, D4 and D5 are Conformed Dimensions.

Rapidly Changing Dimensions

Posted bySantosh Kumar Gidadmanion April 12, 2011

Fast changing dimensions are those dimensions if one or more of its attributes changes

frequently and in many rows. A fast changing dimension can grow very large if we use the

Type-2 approach to track numerous changes. These dimensions some time called rapidlychanging dimensions.

Examples of fast changing dimensions are

Age

Income

Test score

Rating

Credit history score

Customer account

status

Weight
http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/


24/25

How to solve fast changing dimensions

After identifying the fast changing dimensions attributes, you have to create a mini dimensiontable with these attributes joined directly to fact table and not snowflaking.

A mini-dimension is a dimension that usually contains fast changing attributes of a larger

dimension table. This is to improve accessibility to data in the fact table.

Before creating a mini dimension table, you have to convert these identified attributes

individually into band ranges. The concept behind this method is to take limited discreet

values as shown below.

Rows in mini-dimensions will be fewer than rows in large dimension tables because it

restricts the rows in mini-dimensions by using the band range value method.

After identifying the fast changing attributes of the primary customer dimension, and

determining the band ranges for these attributes, a new mini-dimension is formed called

Cust_Mini_Dim as shown below.
http://santoshbidw.wordpress.com/2011/04/12/fast-changing-dimensions/fcd2/http://santoshbidw.wordpress.com/2011/04/12/fast-changing-dimensions/fcd1/


25/25
http://santoshbidw.wordpress.com/2011/04/12/fast-changing-dimensions/fcd3/

dwh intrw qutns

Documents