Top Banner

of 25

dwh intrw qutns

Feb 08, 2018

Download

Documents

Chava Teja
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/22/2019 dwh intrw qutns

    1/25

    What is a Data warehouse?

    A:A Data warehouse is a repository of integrated information, available for queries and

    analysis.

    Data and information are extracted from heterogeneous sources and stored in a database for

    easy and more efficient way to run queries and create reports.

    A data warehouse is a logical collection of information gathered from many different

    operational databases used to create business intelligence that supports business analysis

    activities and decision-making tasks, primarily, a record of an enterprises past transactional

    and operational information, stored in a database designed to favour efficient data analysis

    and reporting (especially OLAP).

    What are the characteristics of Data Warehouse?

    Subject-Oriented:Information is presented according to specific subjects or areas of

    interest, not simply as computer files.

    Integrated:Integration is closely related to subject orientation. Data warehouses must put

    data from disparate sources into a consistent format. That is, if two different source systems

    store conflicting data about entities, or attributes of an entity, the differences need to be

    resolved during the process of transforming the source data and loading it into the data

    warehouse.

    Non-Volatile:Stable information that doesnt change each time an operational process is

    executed. Information is consistent regardless of when the warehouse is accessed.

    Time-Variant:Containing a history of the subject, as well as current information. Historical

    information is an important component of a data warehouse.

    Accessible:The primary purpose of a data warehouse is to provide readily accessible

    information to end-users.

    Process-Oriented:It is important to view data warehousing as a process for delivery of

    information. The maintenance of a data warehouse is ongoing and iterative in nature.

    What is the Advantages of Data warehouse?

    Enhances end-user access to a wide variety of data.

    Business decision makers can obtain various kinds of trend reports.

    Increased data consistency.

    Potentially lower computing costs and increased productivity.

    Providing a place to combine related data from separate sources.

    Creation of a computing infrastructure that can support changes in computer systems andbusiness structures.

  • 7/22/2019 dwh intrw qutns

    2/25

    Empowering end-users to perform any level of ad-hoc queries or reports without impacting

    the performance of the operational systems.

    Data Warehouse Approaches

    Posted bySantosh Kumar Gidadmanion January 9, 2011

    There are two major approaches to data warehouse design.

    1. Bottom-up approach

    This approach is recommended by Kimball.

    In the bottom-up approach data marts are first created to provide reporting and analytical

    capabilities for specific business processes.

    Data marts contain,primarily, dimensions and facts. Facts can contain either atomic data

    and, if necessary, summarized data. The single data mart often models a specific business

    area such as Sales or Production.

    These data marts can eventually be integrated to create a comprehensive data warehouse.

    The integration of the data marts in the data warehouse is centered on the conformed

    dimensions.

    The actual integration of two or more data marts is then done by a process known as "Drill

    across". A drill-across works by grouping (summarizing) the data along the keys of the(shared) conformed dimensions of each fact participating in the "drill across" followed by a

    join on the keys of these grouped (summarized) facts.

    Some consider it an advantage of the Kimball method, that the data warehouse ends up

    being "segmented" into a number of logically self contained and consistent data marts, rather

    than a big and often complex centralized model.

    Business value can be returned as quickly as the first data mart is built.

    2. Top-down approach

    http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/
  • 7/22/2019 dwh intrw qutns

    3/25

    This approach is recommended by Bill Inmon.

    Inmon is one of the leading proponents of the top-down approach to data warehouse design,

    in which the data warehouse is designed using a normalized enterprise data model.

    In the Inmon vision the data warehouse is at the center of the "Corporate InformationFactory" (CIF), which provides a logical framework for delivering business intelligence (BI)

    and business management capabilities.

    The top-down design methodology generates highly consistent dimensional views of data

    across data marts since all data marts are loaded from the centralized repository.

    Generating new dimensional data marts against the data stored in the data warehouse is a

    relatively simple task.

    The main disadvantage to the top-down methodology is that it represents a very large

    project with a very broad scope, cost and time.

    In addition, the top-down methodology can be inflexible and unresponsive to changing

    departmental needs during the implementation phases.

    Normalization

    This is a technique used in data modeling that emphasizes on avoiding storing the same dataelement at multiple places. We follow the 3 rules of normalization called the First Normal Form,Second Normal Form, and Third Normal Formto achieve a normalized data model.

    A normalized data model may result in many tables/entities having multiple levels of relationships,example table1 related to table2, table2 further related to table3, table3 related to table 4 and soon.

    First Normal FormThe attributes of the entity must be atomic and must depend on the Key.

    Second Normal FormThis rule demands that every aspect of each and every attribute depends onKey.

    Third Normal Form (3NF)This rule demands that every aspect of each and every attributesdepends on nothing but the key.

  • 7/22/2019 dwh intrw qutns

    4/25

    Theoretically We have further rules called the Boyce-Codd Normal Form, Fourth Normal Form andthe Fifth Normal form. In practice we dont use the rules beyond 3NF.

    OLAP

    In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) andRelational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP andROLAP.

    MOLAP

    This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensionalcube. The storage is not in the relational database, but in proprietary formats.

    Advantages:

    Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicingand dicing operations.

    Can perform complex calculations: All calculations have been pre-generated when the cube iscreated. Hence, complex calculations are not only doable, but they return quickly.

    Disadvantages:

    Limited in the amount of data it can handle: Because all calculations are performed when thecube is built, it is not possible to include a large amount of data in the cube itself. This is not tosay that the data in the cube cannot be derived from a large amount of data. Indeed, this ispossible. But in this case, only summary-level information will be included in the cube itself.

    Requires additional investment: Cube technology are often proprietary and do not alreadyexist in the organization. Therefore, to adopt MOLAP technology, chances are additionalinvestments in human and capital resources are needed.

    ROLAP

    This methodology relies on manipulating the data stored in the relational database to give theappearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicingand dicing is equivalent to adding a "WHERE" clause in the SQL statement.

    Advantages:

    Can handle large amounts of data: The data size limitation of ROLAP technology is thelimitation on data size of the underlying relational database. In other words, ROLAP itselfplaces no limitation on data amount.

    Can leverage functionalities inherent in the relational database: Often, relational databasealready comes with a host of functionalities. ROLAP technologies, since they sit on top of therelational database, can therefore leverage these functionalities.

    Disadvantages:

    Performance can be slow: Because each ROLAP report is essentially a SQL query (ormultiple SQL queries) in the relational database, the query time can be long if the underlyingdata size is large.

    Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL

    statements to query the relational database, and SQL statements do not fit all needs (forexample, it is difficult to perform complex calculations using SQL), ROLAP technologies are

  • 7/22/2019 dwh intrw qutns

    5/25

    therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk bybuilding into the tool out-of-the-box complex functions as well as the ability to allow users todefine their own functions.

    HOLAP

    HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-typeinformation, HOLAP leverages cube technology for faster performance. When detail information isneeded, HOLAP can "drill through" from the cube into the underlying relational data.

    Q: What is data mining?

    A:Data mining is a process of extracting hidden trends within a datawarehouse. For example

    an insurance data warehouse can be used to mine data for the most high risk people to insure

    in a certain geographial area.

    Q: What are Data Marts?

    A:Data Marts are subset of the corporate-wide data that is of value to a specific group of

    users.

    There are two types of Data Marts:

    1.Independent data martssources from data captured form OLTP system, external providers

    or from data generated locally within a particular department or geographic area.

    2.Dependent data martsources directly from enterprise data warehouses.

    Q: What is OLTP?

    A:OnLine Transactional Processing.

    Q: What is OLAP?

    A:OnLine Analatical Processing.

    Q: What are the differences between OLTP and OLAP?

    A:Main Differences between OLTP and OLAP are:-

    1. User and System OrientationOLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT

    professionals.

    OLAP: market-oriented, used for data analysis by knowledge workers( managers, executives,

    analysis).

    2. Data Contents

    OLTP: manages current data, very detail-oriented.

    OLAP: manages large amounts of historical data, provides facilities for summarization and

    aggregation, stores information at different levels of granularity to support decision making

    process.

    3. Database Design

  • 7/22/2019 dwh intrw qutns

    6/25

    OLTP: adopts an entity relationship(ER) model and an application-oriented database design.

    OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database

    design.

    4. View

    OLTP: focuses on the current data within an enterprise or department.OLAP: spans multiple versions of a database schema due to the evolutionary process of an

    organization; integrates information from many organizational locations and data stores

    Q: What is real time data-warehousing?

    A:Real-time data warehousing captures business activity data as it occurs. As soon as the

    business activity is complete and there is data about it, the completed activity data flows into

    the data warehouse and becomes available instantly. In other words, real-time data

    warehousing is a framework for deriving information from data as the data becomes

    available.

    Q: What are the steps to build the datawarehouse ?

    A:

    Gathering business requirements.

    Identifying Sources

    Identifying Facts

    Defining Dimensions

    Define Attributes

    Redefine Dimensions & Attributes

    Organize Attribute Hierarchy & Define Relationship

    Assign Unique Identifiers

    Additional conventions:Cardinality/Adding ratios

    Q: What is a CUBE in data warehousing concept?

    A:Cubes are logical representation of multidimensional data.The edge of the cube contains

    dimension members and the body of the cube contains data values.

    Q: What is a linked cube?

    A: Linked cube is a cube, in which a sub-set of the data can be analysed into greater detail.

    The linking ensures that the data in the cubes remain consistent.

    Q: What is the main difference between Inmon and Kimball philosophies of data

    warehousing?

    A:Both differed in the concept of building the datawarehosue.

    Kimball views data warehousing as a constituency of Data marts. Data marts are focused on

    delivering business objectives for departments in the organization. And the data warehouse is

    a conformed dimension of the data marts. Hence a unified view of the enterprise can be

    obtain from the dimension modeling on a local departmental level.

  • 7/22/2019 dwh intrw qutns

    7/25

    Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the

    development of the data warehouse can start with data from the online store. Other subject

    areas can be added to the data warehouse as their needs arise.

    i.e.,

    KimballFirst DataMartsCombined wayDatawarehouseInmonFirst DatawarehouseLater-Datamarts

    Q: What is Hierarchy in data warehouse terms?A:Hierarchies are logical structures that use ordered levels as a means of organizing data. A

    hierarchy can be used to define data aggregation. For example, in a time dimension, a

    hierarchy might aggregate data from the month level to the quarter level to the year level. A

    hierarchy can also be used to define a navigational drill path and to establish a family

    structure.

    Within a hierarchy, each level is logically connected to the levels above and below it. Data

    values at lower levels aggregate into the data values at higher levels. A dimension can be

    composed of more than one hierarchy. For example, in the product dimension, there might betwo hierarchiesone for product categories and one for product suppliers.

    Dimension hierarchies also group levels from general to granular. Query tools use hierarchies

    to enable you to drill down into your data to view different levels of granularity. This is one

    of the key benefits of a data warehouse.

    When designing hierarchies, you must consider the relationships in business structures. For

    example, a divisional multilevel sales organization.

    Hierarchies impose a family structure on dimension values. For a particular level value, a

    value at the next higher level is its parent, and values at the next lower level are its children.

    These familial relationships enable analysts to access data quickly.

    Q: What are the differnces between a RDBMS schema and a data warehouse schema?

    A:

    RDBMS Schema* Used for OLTP systems

    * Highly Normalized

    * Difficult to understand and navigate

    * Difficult to extract and solve complex problems

    DWH Schema* Used for OLAP systems

    * De-normalized* Easy to understand and navigate

    * Relatively easier in extracting the data and solving complex problems

    Q:What is meant by metadata in the context of a Data warehouse?

    A: Meta data is the data about data; Business Analyst or data modeler usually capture

    information about datathe source (where and how the data is originated), nature of data

    (char, varchar, nullable, existance, valid values etc) and behavior of data (how it is modified /

    derived and the life cycle ) in data dictionary a.k.a metadata. Metadata is also presented at the

    Datamart level, subsets, fact and dimensions, ODS etc. For a DW user, metadata provides

    vital information for analysis / DSS.

  • 7/22/2019 dwh intrw qutns

    8/25

    Q. How OLAP is different than OLTP System?

    A. Main Differences between OLTP and OLAP are:-

    S.No. OLTP System OLAP System

    1.Customer-oriented, used for dataanalysis and querying by clerks, clients

    and IT professionals

    Market-oriented, used for data analysis by knowledge workers(

    managers, executives, analysis)

    2.Manages current data, very detail-

    oriented

    Manages large amounts of historical data, provides facilities forsummarization and aggregation, stores information at different

    levels of granularity to support decision making process

    3.Adopts an entity relationship(ER)model and an application-oriented

    database design

    Adopts star, snowflake or fact constellation model and a subject-

    oriented database design

    4.Focuses on the current data within an

    enterprise or department

    Spans multiple versions of a database schema due to the

    evolutionary process of an organization; integrates information

    from many organizational locations and data stores

    5.Large volumes of simple transactional

    queriesSmall number of Diverse queries

    6.Data changes in OLTP system are

    continues and data is very volatileOLAP systems have periodic updates to the data

    7. Data processing time is low Data processing time is high

    8. Highly normalized dataOLAP systems contain fewer tables, but more columns per table

    thus reducing degree of normalization

    DIFFERENCE BETWEEN

    STAR AND SNOWFLAKE

    SCHEMA

    Answers:

  • 7/22/2019 dwh intrw qutns

    9/25

    Data Modeling Tools

    Data Modeling is heart and soul for any development project. Be it a simple application development or a data

    warehouse project, data modeling becomes a core step in development life cycle. There are a variety of toolsavailable; this article gives a brief overview of a couple of these tools.

    CA ErWin Data Modeler

    fabFORCE.net DBDesigner

    What is BUS Schema?

    BUS Schema is composed of a master suite of confirmed dimension and standardizeddefinition if facts.

    In a BUS schema we would eventually have conformed dimensions and facts defined to

    be shared across all enterprise data marts. This way all Data Marts can use theconformed dimensions and facts without having them locally. This is the first step

    http://bidw.techtiks.com/data_modeling/ca_erwin.htmlhttp://bidw.techtiks.com/data_modeling/db_designer.htmlhttp://bidw.techtiks.com/data_modeling/db_designer.htmlhttp://bidw.techtiks.com/data_modeling/ca_erwin.html
  • 7/22/2019 dwh intrw qutns

    10/25

    towards building an enterprise Data Warehousefrom Kimball's perspective. For (e.g)we may have different data marts for Sales, Inventory and Marketing and we needcommonentities like Customer, Product etc to be seen across these data marts andhence would be ideal to have these as Conformed objects. The challenge here is thatsome times each line of businessmay have different definitions for these conformedobjects and hence choosing conformed objects have to be designed with some extra

    care.

    What is the data type of the surrogate key?

    Data type of the surrogate key is either integer or numeric or number

    There is no data type for a SurrogateKey.

    Requirement of a surrogate Key:UNIQUERecommended data type of a Surrogate key is NUMERIC.

    What is surrogate key ? where we use it explain with example

    surrogate key is a substitution for the natural primary key.

    It is just a unique identifier or number for each row that can be used for the primarykey to the table. The only requirement for a surrogateprimary key is that it is uniquefor each row in the table.

    Data warehouses typically use a surrogate, (also known as artificial or identity key),key for the dimension tables primary keys. They can use Infa sequence generator, or

    Oracle sequence, or SQL Server Identity values for the surrogate key.

    It is useful because the natural primary key (i.e. Customer Number in Customer table)can change and this makes updates more difficult.

    Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated asthe primary keys (according to the business users) but ,not only can these change,indexing on a numerical value is probably better and you could consider creating asurrogate key called, say, AIRPORT_ID. This would be internal to the system and as far

    as the client is concerned you may display only the AIRPORT_NAME.

    2. Adapted from response by Vincent on Thursday, March 13, 2003

    Another benefit you can get from surrogate keys (SID) is :

    Trackingthe SCD - Slowly Changing Dimension.

    Let me give you a simple, classical example:

    On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what

    would be in your Employee Dimension). This employee has a turnover allocated to himon the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted fromBusiness Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong to thenew BusinessUnit 'BU2' but the old one should Belong to the Business Unit 'BU1.'

    If you used the natural business key'E1' for your employee within your datawarehouseeverything would be allocated to Business Unit 'BU2' even what actualy belongs to

  • 7/22/2019 dwh intrw qutns

    11/25

    'BU1.'

    If you use surrogate keys, you could create on the 2nd of June a new record for theEmployee 'E1' in your Employee Dimension with a new surrogate key.

    This way, in your fact table, you have your old data (before 2nd of June) with the SID

    of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID ofthe employee 'E1' + 'BU2.'

    You could consider Slowly Changing Dimension as an enlargement of your natural key:

    natural key of the Employee was Employee Code 'E1' but for you it becomesEmployee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference withthe natural key enlargement process, is that you might not have all part of your new

    key within your fact table, so you might not be able to do the join on the new enlargekey -> so you need another id.

    A surrogate key is a system generated sequential number which acts as a primary key.

    What is a lookup table?

    A lookUp table is the one which is used when updating a warehouse. When the lookup is placed on the target table(fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only newrecords or updated records based on the lookup condition.

    What is Dimensional Modelling? Why is it important ?

    Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. Inthis design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table containsthe facts/measurements of the business and the dimension table contains the context of measuremnets ie, thedimensions on which the facts are calculated.

    Why is Data Modeling Important?

    Data modeling is probably the most labor intensive and time consuming part of the development process. Why botherespecially if you are pressed for time? A common response by practitioners who write on the subject is that you shouldno more build a database without a model than you should build a house without blueprints.

    The goal of the data model is to make sure that the all data objects required by the database are completely andaccurately represented. Because the data model uses easily understood notations and natural language , it can bereviewed and verified as correct by the end-users.

    The data model is also detailed enough to be used by the database developers to use as a "blueprint" for building thephysical database.The information contained in the data model will be used to define the relational tables, primaryand foreign keys,stored procedures,and triggers. A poorly designed database will require more time in the long-term.

    Without careful planning you may create a database that omits data required to create critical reports, producesresults that are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements.

    What is data mining?

    Data mining is a process of extracting hidden trends within a datawarehouse. For example an insurance datawarehouse can be used to mine data for the most high risk people to insure in a certain geographial area.

    What is ETL?

    ETL stands for extraction, transformation and loading.

    ETL provide developers with an interface for designing source-to-target mappings, ransformation and job controlparameter.

    ExtractionTake data from an external source and move it to the warehouse pre-processor database. Transformation

    http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333
  • 7/22/2019 dwh intrw qutns

    12/25

    Transform data task allows point-to-point generating, modifying and transforming data. LoadingLoad data task adds records to a database table in a warehouse.

    What does level of Granularity of a fact table signify?

    Granularity

    The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean thelowest level of information that will be stored in the fact table. This constitutes two steps:

    Determine which dimensions will be included.Determine where along the hierarchy of each dimension the information will be kept.The determining factors usually goes back to the requirements

    What is SCD1 , SCD2 , SCD3?

    SCD Stands for Slowly changing dimensions.

    SCD1: only maintained updated values.

    Ex: a customer address modified we update existing record with new address.

    SCD2: maintaining historical information and current information by using

    A) Effective DateB) Versions

    C) Flags

    or combination of these

    SCD3: by adding new columns to target table we maintain historical information and current information.

    What is Fact table?

    Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales" ,then a measurement of this business process such as "monthly sales number" is captured in the Fact table. Fact tablealso contains the foriegn keys for the dimension tables.

    What are conformed dimensions?

    Answer1:Conformed dimensions mean the exact same thing with every possible fact table to which they are joined Ex:DateDimensions is connected all facts like Sales facts,Inventory facts..etc

    Answer2:Conformed dimentions are dimensions which are common to the cubes.(cubes are the schemas contains facts anddimension tables)Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and Dimensions here D1,D2are the Conformed Dimensions

    What are the Different methods of loading Dimension tables?

    Conventional Load:Before loading the data, all the Table constraints will be checked against the data.

    Direct load:(Faster Loading)All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the tableconstraints and the bad data won't be indexed.

    What is conformed fact?

    Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiplefacts tables accordingly

    What are Data Marts?

    Data Marts are designed to help manager make strategic decisions about their business.Data Marts are subset of the corporate-wide data that is of value to a specific group of users.

  • 7/22/2019 dwh intrw qutns

    13/25

    There are two types of Data Marts:

    1.Independentdata martssources from data captured form OLTP system, external providers or from data generatedlocally within a particular department or geographic area.

    2.Dependent data mart sources directly form enterprise data warehouses.

    What is a level of Granularity of a fact table?

    Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based ondesign you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail areyou willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate itupto minute and put that data.

    What is the purpose of Factless Fact Table?

    Fact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they

    dont really have facts or any information but are more commonly used for tracking some information of an event.

    Eg. To find the number of leaves taken by an employee in a month.

    A tracking process or collecting status can be performed by using fact less fact tables. The fact table does not

    have numeric values that are aggregate, hence the name. Mere key values that are referenced by the

    dimensions, from which the status is collected, are available in fact less fact tables

    What is junk dimension?.

    In scenarios where certain data may not be appropriate to store in the schema, this data (or attributes) can be

    stored in a junk dimension. The nature of data of junk dimension is usually Boolean or flag values.

    E.g. whether the performance of employee was up to the mark? , Comments on performance.

    A single dimension is formed by lumping a number of small dimensions. This dimension is called a junkdimension. Junk dimension has unrelated attributes. The process of grouping random flags and text attributes in

    dimension by transmitting them to a distinguished sub dimension is related to junk dimension.

    What are fundamental stages of Data Warehousing?

    Offline Operational Databases:This is the initial stage of data warehousing. In this stage the development of

    database of an operational system to an off-line server is done by simply copying the databases.

    Offline Data warehouse:In this stage the data warehouses are updated on a regular time cycle from operational

    system and the data is persisted in an reporting-oriented data structure.

    Real time Data Warehouse:Data warehouses are updated based on transaction or event basis in this stage. Anoperational system performs a transaction every time.

    Integrated Data Warehouse:The activity or transactions generation which are passed back into the operational

    system is done in this stage. These transactions or generated transactions are used in the daily activity of the

    organization.

    What is Virtual Data Warehousing?

    The aggregate view of complete data inventory is provided by Virtual Warehousing. The metadata is utilized for

    forming logical enterprise data model which is a part of database of record infrastructure , is contained in virtual

    data warehousing. The infrastructure consists of publishments of legacy database sysems with their metadta

    extracted. The standards JEE, JMS and EJBs are used in the infrastructure for the purpose of transactional unitrequests and extract-tranform-load tools are used for loading real time bulk data.

    http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333http://www.discussionsworld.com/forum_posts.asp?TID=41333
  • 7/22/2019 dwh intrw qutns

    14/25

    What is active data warehousing?

    The transactional data captured and reposited in the Active Data Warehouse. This repository can be utilized in

    finding trends and patterns that can be used in future decision making.

    An Active data warehouse aims to capture data continuously and deliver real time data. They provide a single

    integrated view of a customer across multiple business lines. It is associated with Business Intelligence Systems.

    Difference between ER Modeling and Dimensional Modeling.

    Dimensional modelling is very flexible for the user perspective. Dimensional data model is mapped for creating

    schemas. Where as ER Model is not mapped for creating shemas and does not use in conversion of

    normalization of data into denormalized form.

    ER Model is utilized for OLTP databases that uses any of the 1st or 2nd or 3rd normal forms, where as

    dimensional data model is used for data warehousing and uses 3rd normal form.

    ER model contains normalized data where as Dimensional model contains denormalized data.

    Describe the various methods of loading Dimension tables.

    The following are the methods of loading dimension tables:

    Conventional Load:

    In this method all the table constraints will be checked against the data, before loading the data.

    Direct Load or Faster Load:

    As the name suggests, the data will be loaded directly without checking the constraints. The data checking

    against the table constraints will be performed later and indexing will not be done on bad data.

    Describe the foreign key columns in fact table and dimension table.

    The primary keys of entity tables are the foreign keys of dimension tables.

    The Primary keys of fact dimensional table are the foreign keys of fact tables.

    What is Data Cardinality?

    Cardinality is the term used in database relations to denote the occurrences of data on either side of the relation.

    There are 3 basic types of cardinality:

    High data cardinality:

    Values of a data column are very uncommon.

    e.g.: email ids and the user names

    Normal data cardinality:

    Values of a data column are somewhat uncommon but never unique.

    e.g.: A data column containing LAST_NAME (there may be several entries of the same last name)

    Low data cardinality:

    Values of a data column are very usual.

    e.g.: flag statuses: 0/1

  • 7/22/2019 dwh intrw qutns

    15/25

    Determining data cardinality is a substantial aspect used in data modeling. This is used to determine the

    relationships

    Types of cardinalities:

    The Link Cardinality - 0:0 relationships

    The Sub-type Cardinality - 1:0 relationshipsThe Physical Segment Cardinality - 1:1 relationship

    The Possession Cardinality - 0: M relation

    The Child Cardinality - 1: M mandatory relationship

    The Characteristic Cardinality - 0: M relationship

    The Paradox Cardinality - 1: M relationship.

    What are Critical Success Factors?

    Key areas of activity in which favorable results are necessary for a company to reach its goal.

    There are four basic types of CSFs which are:

    Industry CSFs

    Strategy CSFs

    Environmental CSFs

    Temporal CSFs

    A few CSFs are:

    Money

    Your future

    Customer satisfaction

    Quality

    Product or service development

    Intellectual capital

    Strategic relationshipsEmployee attraction and retention

    Sustainability

    The advantages of identifying CSFs are:

    they are simple to understand;

    they help focus attention on major concerns;

    they are easy to communicate to coworkers;

    they are easy to monitor;

    and they can be used in concert with strategic planning methodologies.

    SCD Type 1,Slowly Changing Dimension

    Use,Example,Advantage,DisadvantageIn Type 1 Slowly Changing Dimension, the new information simply overwrites the original

    information. In other words, no history is kept.

    In our example, recall we originally have the following table:

    Customer Key Name State1001 Williams New York

  • 7/22/2019 dwh intrw qutns

    16/25

  • 7/22/2019 dwh intrw qutns

    17/25

    Customer Key Name State1001 Williams New York1005 Williams Los Angeles

    Advantages This allows us to accurately keep all historical information.

    Disadvantages This will cause the size of the table to grow fast. In cases where the number of rows for

    the table is very high to start with, storage and performance can become a concern. This necessarily complicates the ETL process.

    UsageAbout 50% of the time.

    When to use Type 2

    Type 2 slowly changing dimension should be used when it is necessary for the data warehouse

    to track historical changes.SCD Type 3,Slowly Changing Dimension

    Use,Example,Advantage,DisadvantageIn Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular

    attribute of interest, one indicating the original value, and one indicating the current value.

    There will also be a column that indicates when the current value becomes active.

    In our example, recall we originally have the following table:

    Customer Key Name State1001 Williams New York

    To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:

    Customer Key Name Original State Current State Effective Date

  • 7/22/2019 dwh intrw qutns

    18/25

    After Williams moved from New York to Los Angeles, the original information gets updated, and

    we have the following table (assuming the effective date of change is February 20, 2010):

    Customer Key

    Name

    Original State

    Current State

    Effective Date

    1001 Williams New York Los Angeles 20-FEB-2010

    Advantages This does not increase the size of the table, since new information is updated. This allows us to keep some part of history.

    Disadvantages Type 3 will not be able to keep all history where an attribute is changed more than once.

    For example, if Williams later moves to Texas on December 15, 2003, the Los Angeles

    information will be lost.UsageType 3 is rarely used in actual practice.

    When to use Type 3Type III slowly changing dimension should only be used when it is necessary for the datawarehouse to track historical changes, and when such changes will only occur for a finite

    number of time.Explain degenerated dimension in detail.

    Degenerate dimension: A column of the key section of the fact table that does not have the associated dimension

    table but used for reporting and analysis, such column is called degenerate dimension or line item dimension.For

    ex, we have a fact table with customer_id, product_id, branch_id, employee_id, bill_no, date in key section and

    price, quantity, amount in measure section. In this fact table, bill_no from key section is a single value, it has no

    associated dimension table. Instead of cteating a seperate dimension table for that single value, we can

    include it in fact table to improve performance.SO here the column, bill_no is a degenerate dimension or line item

    dimension.

    Degenerated Dimension is achieved through a gradual modeling approach

    following Dimensional Modeling standards. Let's take example of a Star Schema

    representing Sales Invoices. The FACT would have the "Invoiced Amount" as

    primary measure. Now when we look at the source of the Invoice, it is the body

    if the Paper Invoice that gives us the following particulars about each Invoice:

    Invoice Date

    Customer ID

  • 7/22/2019 dwh intrw qutns

    19/25

    Products within the Invoice

    Reference to Order Number(s)

    Invoice Number

    Invoice Line Numbers (which are multiple lines in single Invoice)

    Invoice Line Amount

    Invoice Total Amount

    When we model the above following Dimensional Modeling standards, we get

    following distinct Dimensions:

    Calendar Dimension - representing the Invoice Date

    Customer Dimension - representing Customer ID

    Product Dimension - representing Products within the Invoice

    Order Dimension - representing Orders

    Invoice Dimension representing Invoice Number & Invoice Line Numbers

    Question comes - what attributes would be left to be part of the INVOICE

    DIMENSION, if at all we decide to have one! Only candidate attributes are

    Invoice Number and Invoice Line Numbers. But, this is at the granularity of the

    FACT, which stores references to all above said Dimensions as well as the

    measures i.e. Invoice Line Amount, Invoice Total Amount (Derived by

    aggregation).

    It is at this situation, we may decide to degenerate the attributes Invoice

    Number & Invoice Line Number into the Fact and avoid having a distinct entity to

    represent Invoice Number / Line Numbers as a Dimension. What we achieve by this:

    1. avoiding a huge join as both Fact and this Dimension would have the same

    granularity,

    2. still able to query with Invoice Number as the entry point

    So, when such a scenario appears, we make the left out attributes (i.e.

    Invoice Number & Invoice Line Number in our case) part of the Fact and part of

    the Primary Key in the Fact. This is why and how we model Degenerated dimension.

  • 7/22/2019 dwh intrw qutns

    20/25

    A degenerate dimension is data that is dimensional in nature but stored in a fact table. For example,

    if you have a dimension that only has Order Number and Order Line Number, you would have a 1:1

    relationship with the Fact table. Do you want to have two tables with a billion rows or one table with

    a billion rows. Therefore, this would be a degenerate dimension and Order Number and Order Line

    Number would

    A degenerate dimension is when the dimension attribute is stored as part of fact

    table, and not in a separate dimension table. These are essentially dimension keys

    for which there are no other attributes. In a data warehouse, these are often used

    as the result of a drill through query to analyze the source of an aggregated

    number in a report. You can use these values to trace back to transactions in the

    OLTP system.

    1. what is junk dimension? Give an example

    When developing a dimensional model, we often encounter miscellaneous flags and indicators. These

    flags do not logically belong to the core dimension tables.

    A junk dimension is grouping of low cardinality flags and indicators. This junk dimension helps in

    avoiding cluttered design of data warehouse. Provides an easy way to access the dimensions from a

    single point of entry and improves the performance of sql queries.

    Example: For example, assume that there are two dimension tables (gender and marital status). The

    data of these two tables are shown below:

    Code:

    Table: GenderId Gender_status----------------1 Male

    2 Female

    Table: Marital Status

    Id Marital_Status----------------1 Single

    2 Married

    Here both the dimensions have low cardinality flags. This will cause maintenance of two tables and

    decrease performance of sql queries.

    We can combine these two dimensions into a single table by cross joining and can maintain a single

    dimension table. The result of cross join is shown below:

    Code:

    id gender mrg_status--------------------1 Male Single

    2 Male Married3 Female Single4 Female Married

    This new dimension table is called a junk dimension. This will improve the manageability and

    improves the sql queries performance.

  • 7/22/2019 dwh intrw qutns

    21/25

    In data warehouse design, frequently we run into a situation where there are yes/no indicator fields inthe source system. Through business analysis, we know it is necessary to keep those information inthe fact table. However, if keep all those indicator fields in the fact table, not only do we need to buildmany small dimension tables, but the amount of information stored in the fact table also increasestremendously, leading to possible performance and management issues.

    Junk dimension is the way to solve this problem. In a junk dimension, we combine these indicatorfields into a single dimension. This way, we'll only need to build a single dimension table, and thenumber of fields in the fact table, as well as the size of the fact table, can be decreased. The contentin the junk dimension table is the combination of all possible values of the individual indicator fields.

    Let's look at an example. Assuming that we have the following fact table:

    In this example, the last 3 fields are all indicator fields. In this existing format, each one of them is adimension. Using the junk dimension principle, we can combine them into a single junk dimension,

    resulting in the following fact table:

    Note that now the number of dimensions in the fact table went from 7 to 5.

    The content of the junk dimension table would look like the following:

  • 7/22/2019 dwh intrw qutns

    22/25

    In this case, we have 3 possible values for the TXN_CODE field, 2 possible values for theCOUPON_IND field, and 2 possible values for the PREPAY_IND field. This results in a total of 3 x 2 x2 = 12 rows for the junk dimension table.

    By using a junk dimension to replace the 3 indicator fields, we have decreased the number ofdimensions by 2 and also decreased the number of fields in the fact table by 2. This will result in adata warehousing environment that offer better performance as well as being easier to manage.

    What are conformed dimensions

    A conformed dimension can exist as a single dimension table that relates to multiple fact tableswithin the samedata warehouse,or as identical dimension tables in separate data marts. Date is a

    common conformed dimension because its attributes (day, week, month, quarter, year, etc.) have

    the same meaning when joined to any fact table. A conformed product dimension with product

    name, description, SKU, and other common attributes could exist in multiple data marts, each

    containing data for one store in a chain.

    Examples of obvious conformed dimensions include Customer, Location, Organization, Time, and

    Product.

    Conformed Dimensionsare nothing but dimensions, but they are linked to many fact tables where as the

    Dimensions are meant to be linked with its own Fact table.

    The following figure gives the exact difference between these two...

    http://searchsqlserver.techtarget.com/definition/data-warehousehttp://searchsqlserver.techtarget.com/definition/data-warehousehttp://searchsqlserver.techtarget.com/definition/data-warehousehttp://searchsqlserver.techtarget.com/definition/data-warehouse
  • 7/22/2019 dwh intrw qutns

    23/25

    Here F1, F2,... are Fact tables

    D1, D2... are Dimension tableswhere as D3, D4 and D5 are Conformed Dimensions.

    Rapidly Changing Dimensions

    Posted bySantosh Kumar Gidadmanion April 12, 2011

    Fast changing dimensions are those dimensions if one or more of its attributes changes

    frequently and in many rows. A fast changing dimension can grow very large if we use the

    Type-2 approach to track numerous changes. These dimensions some time called rapidlychanging dimensions.

    Examples of fast changing dimensions are

    Age

    Income

    Test score

    Rating

    Credit history score

    Customer account

    status

    Weight

    http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/http://santoshbidw.wordpress.com/
  • 7/22/2019 dwh intrw qutns

    24/25

    How to solve fast changing dimensions

    After identifying the fast changing dimensions attributes, you have to create a mini dimensiontable with these attributes joined directly to fact table and not snowflaking.

    A mini-dimension is a dimension that usually contains fast changing attributes of a larger

    dimension table. This is to improve accessibility to data in the fact table.

    Before creating a mini dimension table, you have to convert these identified attributes

    individually into band ranges. The concept behind this method is to take limited discreet

    values as shown below.

    Rows in mini-dimensions will be fewer than rows in large dimension tables because it

    restricts the rows in mini-dimensions by using the band range value method.

    After identifying the fast changing attributes of the primary customer dimension, and

    determining the band ranges for these attributes, a new mini-dimension is formed called

    Cust_Mini_Dim as shown below.

    http://santoshbidw.wordpress.com/2011/04/12/fast-changing-dimensions/fcd2/http://santoshbidw.wordpress.com/2011/04/12/fast-changing-dimensions/fcd1/
  • 7/22/2019 dwh intrw qutns

    25/25

    http://santoshbidw.wordpress.com/2011/04/12/fast-changing-dimensions/fcd3/