7/22/2019 dwh intrw qutns
1/25
What is a Data warehouse?
A:A Data warehouse is a repository of integrated information, available for queries and
analysis.
Data and information are extracted from heterogeneous sources and stored in a database for
easy and more efficient way to run queries and create reports.
A data warehouse is a logical collection of information gathered from many different
operational databases used to create business intelligence that supports business analysis
activities and decision-making tasks, primarily, a record of an enterprises past transactional
and operational information, stored in a database designed to favour efficient data analysis
and reporting (especially OLAP).
What are the characteristics of Data Warehouse?
Subject-Oriented:Information is presented according to specific subjects or areas of
interest, not simply as computer files.
Integrated:Integration is closely related to subject orientation. Data warehouses must put
data from disparate sources into a consistent format. That is, if two different source systems
store conflicting data about entities, or attributes of an entity, the differences need to be
resolved during the process of transforming the source data and loading it into the data
warehouse.
Non-Volatile:Stable information that doesnt change each time an operational process is
executed. Information is consistent regardless of when the warehouse is accessed.
Time-Variant:Containing a history of the subject, as well as current information. Historical
information is an important component of a data warehouse.
Accessible:The primary purpose of a data warehouse is to provide readily accessible
information to end-users.
Process-Oriented:It is important to view data warehousing as a process for delivery of
information. The maintenance of a data warehouse is ongoing and iterative in nature.
What is the Advantages of Data warehouse?
Enhances end-user access to a wide variety of data.
Business decision makers can obtain various kinds of trend reports.
Increased data consistency.
Potentially lower computing costs and increased productivity.
Providing a place to combine related data from separate sources.
Creation of a computing infrastructure that can support changes in computer systems andbusiness structures.
7/22/2019 dwh intrw qutns
2/25
Empowering end-users to perform any level of ad-hoc queries or reports without impacting
the performance of the operational systems.
Data Warehouse Approaches
Posted bySantosh Kumar Gidadmanion January 9, 2011
There are two major approaches to data warehouse design.
1. Bottom-up approach
This approach is recommended by Kimball.
In the bottom-up approach data marts are first created to provide reporting and analytical
capabilities for specific business processes.
Data marts contain,primarily, dimensions and facts. Facts can contain either atomic data
and, if necessary, summarized data. The single data mart often models a specific business
area such as Sales or Production.
These data marts can eventually be integrated to create a comprehensive data warehouse.
The integration of the data marts in the data warehouse is centered on the conformed
dimensions.
The actual integration of two or more data marts is then done by a process known as "Drill
across". A drill-across works by grouping (summarizing) the data along the keys of the(shared) conformed dimensions of each fact participating in the "drill across" followed by a
join on the keys of these grouped (summarized) facts.
Some consider it an advantage of the Kimball method, that the data warehouse ends up
being "segmented" into a number of logically self contained and consistent data marts, rather
than a big and often complex centralized model.
Business value can be returned as quickly as the first data mart is built.
2. Top-down approach
http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/7/22/2019 dwh intrw qutns
3/25
This approach is recommended by Bill Inmon.
Inmon is one of the leading proponents of the top-down approach to data warehouse design,
in which the data warehouse is designed using a normalized enterprise data model.
In the Inmon vision the data warehouse is at the center of the "Corporate InformationFactory" (CIF), which provides a logical framework for delivering business intelligence (BI)
and business management capabilities.
The top-down design methodology generates highly consistent dimensional views of data
across data marts since all data marts are loaded from the centralized repository.
Generating new dimensional data marts against the data stored in the data warehouse is a
relatively simple task.
The main disadvantage to the top-down methodology is that it represents a very large
project with a very broad scope, cost and time.
In addition, the top-down methodology can be inflexible and unresponsive to changing
departmental needs during the implementation phases.
Normalization
This is a technique used in data modeling that emphasizes on avoiding storing the same dataelement at multiple places. We follow the 3 rules of normalization called the First Normal Form,Second Normal Form, and Third Normal Formto achieve a normalized data model.
A normalized data model may result in many tables/entities having multiple levels of relationships,example table1 related to table2, table2 further related to table3, table3 related to table 4 and soon.
First Normal FormThe attributes of the entity must be atomic and must depend on the Key.
Second Normal FormThis rule demands that every aspect of each and every attribute depends onKey.
Third Normal Form (3NF)This rule demands that every aspect of each and every attributesdepends on nothing but the key.
7/22/2019 dwh intrw qutns
4/25
Theoretically We have further rules called the Boyce-Codd Normal Form, Fourth Normal Form andthe Fifth Normal form. In practice we dont use the rules beyond 3NF.
OLAP
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) andRelational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP andROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensionalcube. The storage is not in the relational database, but in proprietary formats.
Advantages:
Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicingand dicing operations.
Can perform complex calculations: All calculations have been pre-generated when the cube iscreated. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
Limited in the amount of data it can handle: Because all calculations are performed when thecube is built, it is not possible to include a large amount of data in the cube itself. This is not tosay that the data in the cube cannot be derived from a large amount of data. Indeed, this ispossible. But in this case, only summary-level information will be included in the cube itself.
Requires additional investment: Cube technology are often proprietary and do not alreadyexist in the organization. Therefore, to adopt MOLAP technology, chances are additionalinvestments in human and capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database to give theappearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicingand dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Advantages:
Can handle large amounts of data: The data size limitation of ROLAP technology is thelimitation on data size of the underlying relational database. In other words, ROLAP itselfplaces no limitation on data amount.
Can leverage functionalities inherent in the relational database: Often, relational databasealready comes with a host of functionalities. ROLAP technologies, since they sit on top of therelational database, can therefore leverage these functionalities.
Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL query (ormultiple SQL queries) in the relational database, the query time can be long if the underlyingdata size is large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL
statements to query the relational database, and SQL statements do not fit all needs (forexample, it is difficult to perform complex calculations using SQL), ROLAP technologies are
7/22/2019 dwh intrw qutns
5/25
therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk bybuilding into the tool out-of-the-box complex functions as well as the ability to allow users todefine their own functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-typeinformation, HOLAP leverages cube technology for faster performance. When detail information isneeded, HOLAP can "drill through" from the cube into the underlying relational data.
Q: What is data mining?
A:Data mining is a process of extracting hidden trends within a datawarehouse. For example
an insurance data warehouse can be used to mine data for the most high risk people to insure
in a certain geographial area.
Q: What are Data Marts?
A:Data Marts are subset of the corporate-wide data that is of value to a specific group of
users.
There are two types of Data Marts:
1.Independent data martssources from data captured form OLTP system, external providers
or from data generated locally within a particular department or geographic area.
2.Dependent data martsources directly from enterprise data warehouses.
Q: What is OLTP?
A:OnLine Transactional Processing.
Q: What is OLAP?
A:OnLine Analatical Processing.
Q: What are the differences between OLTP and OLAP?
A:Main Differences between OLTP and OLAP are:-
1. User and System OrientationOLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT
professionals.
OLAP: market-oriented, used for data analysis by knowledge workers( managers, executives,
analysis).
2. Data Contents
OLTP: manages current data, very detail-oriented.
OLAP: manages large amounts of historical data, provides facilities for summarization and
aggregation, stores information at different levels of granularity to support decision making
process.
3. Database Design
7/22/2019 dwh intrw qutns
6/25
OLTP: adopts an entity relationship(ER) model and an application-oriented database design.
OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database
design.
4. View
OLTP: focuses on the current data within an enterprise or department.OLAP: spans multiple versions of a database schema due to the evolutionary process of an
organization; integrates information from many organizational locations and data stores
Q: What is real time data-warehousing?
A:Real-time data warehousing captures business activity data as it occurs. As soon as the
business activity is complete and there is data about it, the completed activity data flows into
the data warehouse and becomes available instantly. In other words, real-time data
warehousing is a framework for deriving information from data as the data becomes
available.
Q: What are the steps to build the datawarehouse ?
A:
Gathering business requirements.
Identifying Sources
Identifying Facts
Defining Dimensions
Define Attributes
Redefine Dimensions & Attributes
Organize Attribute Hierarchy & Define Relationship
Assign Unique Identifiers
Additional conventions:Cardinality/Adding ratios
Q: What is a CUBE in data warehousing concept?
A:Cubes are logical representation of multidimensional data.The edge of the cube contains
dimension members and the body of the cube contains data values.
Q: What is a linked cube?
A: Linked cube is a cube, in which a sub-set of the data can be analysed into greater detail.
The linking ensures that the data in the cubes remain consistent.
Q: What is the main difference between Inmon and Kimball philosophies of data
warehousing?
A:Both differed in the concept of building the datawarehosue.
Kimball views data warehousing as a constituency of Data marts. Data marts are focused on
delivering business objectives for departments in the organization. And the data warehouse is
a conformed dimension of the data marts. Hence a unified view of the enterprise can be
obtain from the dimension modeling on a local departmental level.
7/22/2019 dwh intrw qutns
7/25
Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the
development of the data warehouse can start with data from the online store. Other subject
areas can be added to the data warehouse as their needs arise.
i.e.,
KimballFirst DataMartsCombined wayDatawarehouseInmonFirst DatawarehouseLater-Datamarts
Q: What is Hierarchy in data warehouse terms?A:Hierarchies are logical structures that use ordered levels as a means of organizing data. A
hierarchy can be used to define data aggregation. For example, in a time dimension, a
hierarchy might aggregate data from the month level to the quarter level to the year level. A
hierarchy can also be used to define a navigational drill path and to establish a family
structure.
Within a hierarchy, each level is logically connected to the levels above and below it. Data
values at lower levels aggregate into the data values at higher levels. A dimension can be
composed of more than one hierarchy. For example, in the product dimension, there might betwo hierarchiesone for product categories and one for product suppliers.
Dimension hierarchies also group levels from general to granular. Query tools use hierarchies
to enable you to drill down into your data to view different levels of granularity. This is one
of the key benefits of a data warehouse.
When designing hierarchies, you must consider the relationships in business structures. For
example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular level value, a
value at the next higher level is its parent, and values at the next lower level are its children.
These familial relationships enable analysts to access data quickly.
Q: What are the differnces between a RDBMS schema and a data warehouse schema?
A:
RDBMS Schema* Used for OLTP systems
* Highly Normalized
* Difficult to understand and navigate
* Difficult to extract and solve complex problems
DWH Schema* Used for OLAP systems
* De-normalized* Easy to understand and navigate
* Relatively easier in extracting the data and solving complex problems
Q:What is meant by metadata in the context of a Data warehouse?
A: Meta data is the data about data; Business Analyst or data modeler usually capture
information about datathe source (where and how the data is originated), nature of data
(char, varchar, nullable, existance, valid values etc) and behavior of data (how it is modified /
derived and the life cycle ) in data dictionary a.k.a metadata. Metadata is also presented at the
Datamart level, subsets, fact and dimensions, ODS etc. For a DW user, metadata provides
vital information for analysis / DSS.
7/22/2019 dwh intrw qutns
8/25
Q. How OLAP is different than OLTP System?
A. Main Differences between OLTP and OLAP are:-
S.No. OLTP System OLAP System
1.Customer-oriented, used for dataanalysis and querying by clerks, clients
and IT professionals
Market-oriented, used for data analysis by knowledge workers(
managers, executives, analysis)
2.Manages current data, very detail-
oriented
Manages large amounts of historical data, provides facilities forsummarization and aggregation, stores information at different
levels of granularity to support decision making process
3.Adopts an entity relationship(ER)model and an application-oriented
database design
Adopts star, snowflake or fact constellation model and a subject-
oriented database design
4.Focuses on the current data within an
enterprise or department
Spans multiple versions of a database schema due to the
evolutionary process of an organization; integrates information
from many organizational locations and data stores
5.Large volumes of simple transactional
queriesSmall number of Diverse queries
6.Data changes in OLTP system are
continues and data is very volatileOLAP systems have periodic updates to the data
7. Data processing time is low Data processing time is high
8. Highly normalized dataOLAP systems contain fewer tables, but more columns per table
thus reducing degree of normalization
DIFFERENCE BETWEEN
STAR AND SNOWFLAKE
SCHEMA
Answers:
7/22/2019 dwh intrw qutns
9/25
Data Modeling Tools
Data Modeling is heart and soul for any development project. Be it a simple application development or a data
warehouse project, data modeling becomes a core step in development life cycle. There are a variety of toolsavailable; this article gives a brief overview of a couple of these tools.
CA ErWin Data Modeler
fabFORCE.net DBDesigner
What is BUS Schema?
BUS Schema is composed of a master suite of confirmed dimension and standardizeddefinition if facts.
In a BUS schema we would eventually have conformed dimensions and facts defined to
be shared across all enterprise data marts. This way all Data Marts can use theconformed dimensions and facts without having them locally. This is the first step
http://bidw.techtiks.com/data_modeling/ca_erwin.htmlhttp://bidw.techtiks.com/data_modeling/db_designer.htmlhttp://bidw.techtiks.com/data_modeling/db_designer.htmlhttp://bidw.techtiks.com/data_modeling/ca_erwin.html7/22/2019 dwh intrw qutns
10/25
towards building an enterprise Data Warehousefrom Kimball's perspective. For (e.g)we may have different data marts for Sales, Inventory and Marketing and we needcommonentities like Customer, Product etc to be seen across these data marts andhence would be ideal to have these as Conformed objects. The challenge here is thatsome times each line of businessmay have different definitions for these conformedobjects and hence choosing conformed objects have to be designed with some extra
care.
What is the data type of the surrogate key?
Data type of the surrogate key is either integer or numeric or number
There is no data type for a SurrogateKey.
Requirement of a surrogate Key:UNIQUERecommended data type of a Surrogate key is NUMERIC.
What is surrogate key ? where we use it explain with example
surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the primarykey to the table. The only requirement for a surrogateprimary key is that it is uniquefor each row in the table.
Data warehouses typically use a surrogate, (also known as artificial or identity key),key for the dimension tables primary keys. They can use Infa sequence generator, or
Oracle sequence, or SQL Server Identity values for the surrogate key.
It is useful because the natural primary key (i.e. Customer Number in Customer table)can change and this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated asthe primary keys (according to the business users) but ,not only can these change,indexing on a numerical value is probably better and you could consider creating asurrogate key called, say, AIRPORT_ID. This would be internal to the system and as far
as the client is concerned you may display only the AIRPORT_NAME.
2. Adapted from response by Vincent on Thursday, March 13, 2003
Another benefit you can get from surrogate keys (SID) is :
Trackingthe SCD - Slowly Changing Dimension.
Let me give you a simple, classical example:
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what
would be in your Employee Dimension). This employee has a turnover allocated to himon the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted fromBusiness Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong to thenew BusinessUnit 'BU2' but the old one should Belong to the Business Unit 'BU1.'
If you used the natural business key'E1' for your employee within your datawarehouseeverything would be allocated to Business Unit 'BU2' even what actualy belongs to
7/22/2019 dwh intrw qutns
11/25
'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record for theEmployee 'E1' in your Employee Dimension with a new surrogate key.
This way, in your fact table, you have your old data (before 2nd of June) with the SID
of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID ofthe employee 'E1' + 'BU2.'
You could consider Slowly Changing Dimension as an enlargement of your natural key:
natural key of the Employee was Employee Code 'E1' but for you it becomesEmployee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference withthe natural key enlargement process, is that you might not have all part of your new
key within your fact table, so you might not be able to do the join on the new enlargekey -> so you need another id.
A surrogate key is a system generated sequential number which acts as a primary key.
What is a lookup table?
A lookUp table is the one which is used when updating a warehouse. When the lookup is placed on the target table(fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only newrecords or updated records based on the lookup condition.
What is Dimensional Modelling? Why is it important ?
Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. Inthis design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table containsthe facts/measurements of the business and the dimension table contains the context of measuremnets ie, thedimensions on which the facts are calculated.
Why is Data Modeling Important?
Data modeling is probably the most labor intensive and time consuming part of the development process. Why botherespecially if you are pressed for time? A common response by practitioners who write on the subject is that you shouldno more build a database without a model than you should build a house without blueprints.
The goal of the data model is to make sure that the all data objects required by the database are completely andaccurately represented. Because the data model uses easily understood notations and natural language , it can bereviewed and verified as correct by the end-users.
The data model is also detailed enough to be used by the database developers to use as a "blueprint" for building thephysical database.The information contained in the data model will be used to define the relational tables, primaryand foreign keys,stored procedures,and triggers. A poorly designed database will require more time in the long-term.
Without careful planning you may create a database that omits data required to create critical reports, producesresults that are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements.
What is data mining?
Data mining is a process of extracting hidden trends within a datawarehouse. For example an insurance datawarehouse can be used to mine data for the most high risk people to insure in a certain geographial area.
What is ETL?
ETL stands for extraction, transformation and loading.
ETL provide developers with an interface for designing source-to-target mappings, ransformation and job controlparameter.
ExtractionTake data from an external source and move it to the warehouse pre-processor database. Transformation
http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=413337/22/2019 dwh intrw qutns
12/25
Transform data task allows point-to-point generating, modifying and transforming data. LoadingLoad data task adds records to a database table in a warehouse.
What does level of Granularity of a fact table signify?
Granularity
The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean thelowest level of information that will be stored in the fact table. This constitutes two steps:
Determine which dimensions will be included.Determine where along the hierarchy of each dimension the information will be kept.The determining factors usually goes back to the requirements
What is SCD1 , SCD2 , SCD3?
SCD Stands for Slowly changing dimensions.
SCD1: only maintained updated values.
Ex: a customer address modified we update existing record with new address.
SCD2: maintaining historical information and current information by using
A) Effective DateB) Versions
C) Flags
or combination of these
SCD3: by adding new columns to target table we maintain historical information and current information.
What is Fact table?
Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales" ,then a measurement of this business process such as "monthly sales number" is captured in the Fact table. Fact tablealso contains the foriegn keys for the dimension tables.
What are conformed dimensions?
Answer1:Conformed dimensions mean the exact same thing with every possible fact table to which they are joined Ex:DateDimensions is connected all facts like Sales facts,Inventory facts..etc
Answer2:Conformed dimentions are dimensions which are common to the cubes.(cubes are the schemas contains facts anddimension tables)Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and Dimensions here D1,D2are the Conformed Dimensions
What are the Different methods of loading Dimension tables?
Conventional Load:Before loading the data, all the Table constraints will be checked against the data.
Direct load:(Faster Loading)All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the tableconstraints and the bad data won't be indexed.
What is conformed fact?
Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiplefacts tables accordingly
What are Data Marts?
Data Marts are designed to help manager make strategic decisions about their business.Data Marts are subset of the corporate-wide data that is of value to a specific group of users.
7/22/2019 dwh intrw qutns
13/25
There are two types of Data Marts:
1.Independentdata martssources from data captured form OLTP system, external providers or from data generatedlocally within a particular department or geographic area.
2.Dependent data mart sources directly form enterprise data warehouses.
What is a level of Granularity of a fact table?
Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based ondesign you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail areyou willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate itupto minute and put that data.
What is the purpose of Factless Fact Table?
Fact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they
dont really have facts or any information but are more commonly used for tracking some information of an event.
Eg. To find the number of leaves taken by an employee in a month.
A tracking process or collecting status can be performed by using fact less fact tables. The fact table does not
have numeric values that are aggregate, hence the name. Mere key values that are referenced by the
dimensions, from which the status is collected, are available in fact less fact tables
What is junk dimension?.
In scenarios where certain data may not be appropriate to store in the schema, this data (or attributes) can be
stored in a junk dimension. The nature of data of junk dimension is usually Boolean or flag values.
E.g. whether the performance of employee was up to the mark? , Comments on performance.
A single dimension is formed by lumping a number of small dimensions. This dimension is called a junkdimension. Junk dimension has unrelated attributes. The process of grouping random flags and text attributes in
dimension by transmitting them to a distinguished sub dimension is related to junk dimension.
What are fundamental stages of Data Warehousing?
Offline Operational Databases:This is the initial stage of data warehousing. In this stage the development of
database of an operational system to an off-line server is done by simply copying the databases.
Offline Data warehouse:In this stage the data warehouses are updated on a regular time cycle from operational
system and the data is persisted in an reporting-oriented data structure.
Real time Data Warehouse:Data warehouses are updated based on transaction or event basis in this stage. Anoperational system performs a transaction every time.
Integrated Data Warehouse:The activity or transactions generation which are passed back into the operational
system is done in this stage. These transactions or generated transactions are used in the daily activity of the
organization.
What is Virtual Data Warehousing?
The aggregate view of complete data inventory is provided by Virtual Warehousing. The metadata is utilized for
forming logical enterprise data model which is a part of database of record infrastructure , is contained in virtual
data warehousing. The infrastructure consists of publishments of legacy database sysems with their metadta
extracted. The standards JEE, JMS and EJBs are used in the infrastructure for the purpose of transactional unitrequests and extract-tranform-load tools are used for loading real time bulk data.
http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=413337/22/2019 dwh intrw qutns
14/25
What is active data warehousing?
The transactional data captured and reposited in the Active Data Warehouse. This repository can be utilized in
finding trends and patterns that can be used in future decision making.
An Active data warehouse aims to capture data continuously and deliver real time data. They provide a single
integrated view of a customer across multiple business lines. It is associated with Business Intelligence Systems.
Difference between ER Modeling and Dimensional Modeling.
Dimensional modelling is very flexible for the user perspective. Dimensional data model is mapped for creating
schemas. Where as ER Model is not mapped for creating shemas and does not use in conversion of
normalization of data into denormalized form.
ER Model is utilized for OLTP databases that uses any of the 1st or 2nd or 3rd normal forms, where as
dimensional data model is used for data warehousing and uses 3rd normal form.
ER model contains normalized data where as Dimensional model contains denormalized data.
Describe the various methods of loading Dimension tables.
The following are the methods of loading dimension tables:
Conventional Load:
In this method all the table constraints will be checked against the data, before loading the data.
Direct Load or Faster Load:
As the name suggests, the data will be loaded directly without checking the constraints. The data checking
against the table constraints will be performed later and indexing will not be done on bad data.
Describe the foreign key columns in fact table and dimension table.
The primary keys of entity tables are the foreign keys of dimension tables.
The Primary keys of fact dimensional table are the foreign keys of fact tables.
What is Data Cardinality?
Cardinality is the term used in database relations to denote the occurrences of data on either side of the relation.
There are 3 basic types of cardinality:
High data cardinality:
Values of a data column are very uncommon.
e.g.: email ids and the user names
Normal data cardinality:
Values of a data column are somewhat uncommon but never unique.
e.g.: A data column containing LAST_NAME (there may be several entries of the same last name)
Low data cardinality:
Values of a data column are very usual.
e.g.: flag statuses: 0/1
7/22/2019 dwh intrw qutns
15/25
Determining data cardinality is a substantial aspect used in data modeling. This is used to determine the
relationships
Types of cardinalities:
The Link Cardinality - 0:0 relationships
The Sub-type Cardinality - 1:0 relationshipsThe Physical Segment Cardinality - 1:1 relationship
The Possession Cardinality - 0: M relation
The Child Cardinality - 1: M mandatory relationship
The Characteristic Cardinality - 0: M relationship
The Paradox Cardinality - 1: M relationship.
What are Critical Success Factors?
Key areas of activity in which favorable results are necessary for a company to reach its goal.
There are four basic types of CSFs which are:
Industry CSFs
Strategy CSFs
Environmental CSFs
Temporal CSFs
A few CSFs are:
Money
Your future
Customer satisfaction
Quality
Product or service development
Intellectual capital
Strategic relationshipsEmployee attraction and retention
Sustainability
The advantages of identifying CSFs are:
they are simple to understand;
they help focus attention on major concerns;
they are easy to communicate to coworkers;
they are easy to monitor;
and they can be used in concert with strategic planning methodologies.
SCD Type 1,Slowly Changing Dimension
Use,Example,Advantage,DisadvantageIn Type 1 Slowly Changing Dimension, the new information simply overwrites the original
information. In other words, no history is kept.
In our example, recall we originally have the following table:
Customer Key Name State1001 Williams New York
7/22/2019 dwh intrw qutns
16/25
7/22/2019 dwh intrw qutns
17/25
Customer Key Name State1001 Williams New York1005 Williams Los Angeles
Advantages This allows us to accurately keep all historical information.
Disadvantages This will cause the size of the table to grow fast. In cases where the number of rows for
the table is very high to start with, storage and performance can become a concern. This necessarily complicates the ETL process.
UsageAbout 50% of the time.
When to use Type 2
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse
to track historical changes.SCD Type 3,Slowly Changing Dimension
Use,Example,Advantage,DisadvantageIn Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular
attribute of interest, one indicating the original value, and one indicating the current value.
There will also be a column that indicates when the current value becomes active.
In our example, recall we originally have the following table:
Customer Key Name State1001 Williams New York
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
Customer Key Name Original State Current State Effective Date
7/22/2019 dwh intrw qutns
18/25
After Williams moved from New York to Los Angeles, the original information gets updated, and
we have the following table (assuming the effective date of change is February 20, 2010):
Customer Key
Name
Original State
Current State
Effective Date
1001 Williams New York Los Angeles 20-FEB-2010
Advantages This does not increase the size of the table, since new information is updated. This allows us to keep some part of history.
Disadvantages Type 3 will not be able to keep all history where an attribute is changed more than once.
For example, if Williams later moves to Texas on December 15, 2003, the Los Angeles
information will be lost.UsageType 3 is rarely used in actual practice.
When to use Type 3Type III slowly changing dimension should only be used when it is necessary for the datawarehouse to track historical changes, and when such changes will only occur for a finite
number of time.Explain degenerated dimension in detail.
Degenerate dimension: A column of the key section of the fact table that does not have the associated dimension
table but used for reporting and analysis, such column is called degenerate dimension or line item dimension.For
ex, we have a fact table with customer_id, product_id, branch_id, employee_id, bill_no, date in key section and
price, quantity, amount in measure section. In this fact table, bill_no from key section is a single value, it has no
associated dimension table. Instead of cteating a seperate dimension table for that single value, we can
include it in fact table to improve performance.SO here the column, bill_no is a degenerate dimension or line item
dimension.
Degenerated Dimension is achieved through a gradual modeling approach
following Dimensional Modeling standards. Let's take example of a Star Schema
representing Sales Invoices. The FACT would have the "Invoiced Amount" as
primary measure. Now when we look at the source of the Invoice, it is the body
if the Paper Invoice that gives us the following particulars about each Invoice:
Invoice Date
Customer ID
7/22/2019 dwh intrw qutns
19/25
Products within the Invoice
Reference to Order Number(s)
Invoice Number
Invoice Line Numbers (which are multiple lines in single Invoice)
Invoice Line Amount
Invoice Total Amount
When we model the above following Dimensional Modeling standards, we get
following distinct Dimensions:
Calendar Dimension - representing the Invoice Date
Customer Dimension - representing Customer ID
Product Dimension - representing Products within the Invoice
Order Dimension - representing Orders
Invoice Dimension representing Invoice Number & Invoice Line Numbers
Question comes - what attributes would be left to be part of the INVOICE
DIMENSION, if at all we decide to have one! Only candidate attributes are
Invoice Number and Invoice Line Numbers. But, this is at the granularity of the
FACT, which stores references to all above said Dimensions as well as the
measures i.e. Invoice Line Amount, Invoice Total Amount (Derived by
aggregation).
It is at this situation, we may decide to degenerate the attributes Invoice
Number & Invoice Line Number into the Fact and avoid having a distinct entity to
represent Invoice Number / Line Numbers as a Dimension. What we achieve by this:
1. avoiding a huge join as both Fact and this Dimension would have the same
granularity,
2. still able to query with Invoice Number as the entry point
So, when such a scenario appears, we make the left out attributes (i.e.
Invoice Number & Invoice Line Number in our case) part of the Fact and part of
the Primary Key in the Fact. This is why and how we model Degenerated dimension.
7/22/2019 dwh intrw qutns
20/25
A degenerate dimension is data that is dimensional in nature but stored in a fact table. For example,
if you have a dimension that only has Order Number and Order Line Number, you would have a 1:1
relationship with the Fact table. Do you want to have two tables with a billion rows or one table with
a billion rows. Therefore, this would be a degenerate dimension and Order Number and Order Line
Number would
A degenerate dimension is when the dimension attribute is stored as part of fact
table, and not in a separate dimension table. These are essentially dimension keys
for which there are no other attributes. In a data warehouse, these are often used
as the result of a drill through query to analyze the source of an aggregated
number in a report. You can use these values to trace back to transactions in the
OLTP system.
1. what is junk dimension? Give an example
When developing a dimensional model, we often encounter miscellaneous flags and indicators. These
flags do not logically belong to the core dimension tables.
A junk dimension is grouping of low cardinality flags and indicators. This junk dimension helps in
avoiding cluttered design of data warehouse. Provides an easy way to access the dimensions from a
single point of entry and improves the performance of sql queries.
Example: For example, assume that there are two dimension tables (gender and marital status). The
data of these two tables are shown below:
Code:
Table: GenderId Gender_status----------------1 Male
2 Female
Table: Marital Status
Id Marital_Status----------------1 Single
2 Married
Here both the dimensions have low cardinality flags. This will cause maintenance of two tables and
decrease performance of sql queries.
We can combine these two dimensions into a single table by cross joining and can maintain a single
dimension table. The result of cross join is shown below:
Code:
id gender mrg_status--------------------1 Male Single
2 Male Married3 Female Single4 Female Married
This new dimension table is called a junk dimension. This will improve the manageability and
improves the sql queries performance.
7/22/2019 dwh intrw qutns
21/25
In data warehouse design, frequently we run into a situation where there are yes/no indicator fields inthe source system. Through business analysis, we know it is necessary to keep those information inthe fact table. However, if keep all those indicator fields in the fact table, not only do we need to buildmany small dimension tables, but the amount of information stored in the fact table also increasestremendously, leading to possible performance and management issues.
Junk dimension is the way to solve this problem. In a junk dimension, we combine these indicatorfields into a single dimension. This way, we'll only need to build a single dimension table, and thenumber of fields in the fact table, as well as the size of the fact table, can be decreased. The contentin the junk dimension table is the combination of all possible values of the individual indicator fields.
Let's look at an example. Assuming that we have the following fact table:
In this example, the last 3 fields are all indicator fields. In this existing format, each one of them is adimension. Using the junk dimension principle, we can combine them into a single junk dimension,
resulting in the following fact table:
Note that now the number of dimensions in the fact table went from 7 to 5.
The content of the junk dimension table would look like the following:
7/22/2019 dwh intrw qutns
22/25
In this case, we have 3 possible values for the TXN_CODE field, 2 possible values for theCOUPON_IND field, and 2 possible values for the PREPAY_IND field. This results in a total of 3 x 2 x2 = 12 rows for the junk dimension table.
By using a junk dimension to replace the 3 indicator fields, we have decreased the number ofdimensions by 2 and also decreased the number of fields in the fact table by 2. This will result in adata warehousing environment that offer better performance as well as being easier to manage.
What are conformed dimensions
A conformed dimension can exist as a single dimension table that relates to multiple fact tableswithin the samedata warehouse,or as identical dimension tables in separate data marts. Date is a
common conformed dimension because its attributes (day, week, month, quarter, year, etc.) have
the same meaning when joined to any fact table. A conformed product dimension with product
name, description, SKU, and other common attributes could exist in multiple data marts, each
containing data for one store in a chain.
Examples of obvious conformed dimensions include Customer, Location, Organization, Time, and
Product.
Conformed Dimensionsare nothing but dimensions, but they are linked to many fact tables where as the
Dimensions are meant to be linked with its own Fact table.
The following figure gives the exact difference between these two...
http://searchsqlserver.techtarget.com/definition/data-warehousehttp://searchsqlserver.techtarget.com/definition/data-warehousehttp://searchsqlserver.techtarget.com/definition/data-warehousehttp://searchsqlserver.techtarget.com/definition/data-warehouse7/22/2019 dwh intrw qutns
23/25
Here F1, F2,... are Fact tables
D1, D2... are Dimension tableswhere as D3, D4 and D5 are Conformed Dimensions.
Rapidly Changing Dimensions
Posted bySantosh Kumar Gidadmanion April 12, 2011
Fast changing dimensions are those dimensions if one or more of its attributes changes
frequently and in many rows. A fast changing dimension can grow very large if we use the
Type-2 approach to track numerous changes. These dimensions some time called rapidlychanging dimensions.
Examples of fast changing dimensions are
Age
Income
Test score
Rating
Credit history score
Customer account
status
Weight
http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/7/22/2019 dwh intrw qutns
24/25
How to solve fast changing dimensions
After identifying the fast changing dimensions attributes, you have to create a mini dimensiontable with these attributes joined directly to fact table and not snowflaking.
A mini-dimension is a dimension that usually contains fast changing attributes of a larger
dimension table. This is to improve accessibility to data in the fact table.
Before creating a mini dimension table, you have to convert these identified attributes
individually into band ranges. The concept behind this method is to take limited discreet
values as shown below.
Rows in mini-dimensions will be fewer than rows in large dimension tables because it
restricts the rows in mini-dimensions by using the band range value method.
After identifying the fast changing attributes of the primary customer dimension, and
determining the band ranges for these attributes, a new mini-dimension is formed called
Cust_Mini_Dim as shown below.
http://santoshbidw.wordpress.com/2011/04/12/fast-changing-dimensions/fcd2/http://santoshbidw.wordpress.com/2011/04/12/fast-changing-dimensions/fcd1/7/22/2019 dwh intrw qutns
25/25
http://santoshbidw.wordpress.com/2011/04/12/fast-changing-dimensions/fcd3/