Top Banner

of 24

Ch4_data Access and Delivery

May 30, 2018

Download

Documents

api-26413529
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/14/2019 Ch4_data Access and Delivery

    1/24

    1.Starter 22. OLTP 2

    2.1 A Data Warehouse Supports OLTP 2

    3. OLAP in Data Warehouse 3 3.1. OLAP Example 4

    3.2. Comparison of OLAP and OLTP3.3. Storing Active OLAP data 63.4. Multi-dimension OLAP (MOLAP) 63.5. Relational OLAP (ROLAP) 6

    4. Design the Dimensional Model 75. Dimensional Model Schemas 8

    5.1Star schema 85.2Aggregation 135.3Snowflake Schemas 175.4Star or Snowflake 19

    6. Dimension Tables 19

    7. Hierarchies 20

    8. Surrogate Keys 21

    9. Date and Time Dimensions 21 9.1 Time Granularity 22

    9.2 Date and Time Dimension Attributes 2210. Fact Tables 22 10.1 Multiple Fact Tables 23

    10.2 Additive and Non-additive Measures 2310.3 Calculated Measures 23

    10.4 Fact Table Keys 23

    INFORMATION ACCESS AND DELIVERY

    DATA WAREHOUSING

    OLAP is The dynamic synthesis,analysis and consolidation of largevolumes of multidimensional data

  • 8/14/2019 Ch4_data Access and Delivery

    2/24

    SUSHIL KULKARNI 2

    [email protected]

    1. Starter

    This chapter will give introduction to OLTP, OLAP and add information about designissues to be considered when developing a data warehouse.

    2. OLTP

    Data warehouses support business decisions by collecting, consolidating, and organizingdata for reporting and analysis with tools such as online analytical processing (OLAP)and data mining. Although data warehouses are built on relational database technology,the design of a data warehouse database differs substantially from the design of anOnLine Transaction Processing system (OLTP) database.

    Following table gives the difference between data warehouse database and OLTP.

    Data warehouse database OLTP database

    Designed for analysis of business

    measures by categories and attributes

    Designed for real-time business operations

    Optimized for bulk loads and large,

    complex, unpredictable queries that accessmany rows per table

    Optimized for a common set of

    transactions, usually adding or retrieving asingle row at a time per table

    Loaded with consistent, valid data;requires no real time validation

    Optimized for validation of incoming dataduring transactions; uses validation datatables

    Supports few concurrent users relative toOLTP

    Supports thousands of concurrent users

    2.1 A Data Warehouse Supports OLTP

    A data warehouse supports an OLTP system by providing a place for the OLTP databaseto offload data as it accumulates, and by providing services that would complicate anddegrade OLTP operations if they were performed in the OLTP database.

    Without a data warehouse to hold historical information, data is archived to static mediasuch as magnetic tape, or allowed to accumulate in the OLTP database. If data is simplyarchived for preservation, it is not available or organized for use by analysts anddecision-makers. If data is allowed to accumulate in the OLTP so it can be used foranalysis, the OLTP database continues to grow in size and requires more indexes toservice analytical and report queries. These queries access and process large portions ofthe continually growing historical data and add a substantial load to the database. The

    large indexes needed to support these queries also tax the OLTP transactions withadditional index maintenance. These queries can also be complicated to develop due tothe typically complex OLTP database schema.

    A data warehouse offloads the historical data from the OLTP, allowing the OLTP tooperate at peak transaction efficiency. High volume analytical and reporting queries arehandled by the data warehouse and do not load the OLTP, which does not need

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    3/24

    SUSHIL KULKARNI 3

    [email protected]

    additional indexes for their support. As data is moved to the data warehouse, it is also

    reorganized and consolidated so those analytical queries are simpler and more efficient.

    3. OLAP in Data Warehouse

    A major issue in information processing is how to process larger and larger databases,containing increasingly complex data, without sacrificing response time. Theclient/server architecture gives organizations the opportunity to deploy specializedservers, which are optimized for handling specific data management problems. Untilrecently, organizations have tried to target relational database management systems(RDBMSs) for the complete spectrum of database applications. It is however apparentthat there are major categories of database applications which are not suitably servicedby relational database systems. Oracle, for example, has built a totally new Media Serverfor handling multimedia applications. Sybase uses an object-oriented DBMS (OODBMS)in its Gain Momentum product which is designed to handle complex data such as imagesand audio. Another category of applications is that ofon-line analytical processing(OLAP). OLAP was a term coined by E F Codd (1993) and was defined by him as:

    The dynamic synthesis, analysis and consolidation of large volumes ofmultidimensional data

    Codd has developed rules or requirements for an OLAP system:

    o multidimensional conceptual viewo transparencyo accessibilityo consistent reporting performanceo client/server architectureo generic dimensionalityo dynamic sparse matrix handlingo multi-user supporto unrestricted cross dimensional operationso intuitative data manipulationo flexible reportingo unlimited dimensions and aggregation levels

    An alternative definition of OLAP has been supplied by Nigel Pendse, who unlike Codd,does not mix technology prescriptions with application requirements. Pendse definesOLAP as,

    Fast Analysis of Shared Multidimensional Information

    Which means:

    Fast in that users should get a response in seconds and so doesn't lose their chain ofthought;

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    4/24

    SUSHIL KULKARNI 4

    [email protected]

    Analysis in that the system can provide analysis functions in an intuitive manner and

    that the functions should supply business logic and statistical analysis relevant to theusers application;

    Shared from the point of view of supporting multiple users concurrently;

    Multidimensional as a main requirement so that the system supplies amultidimensional conceptual view of the data including support for multiple hierarchies;

    Information is the data and the derived information required by the user application.

    One question is what is multidimensional data and when does it becomes OLAP? It isessentially a way to build associations between dissimilar pieces of information usingpredefined business rules about the information you are using. Following threecomponents are identified to OLAP,

    oA multidimensional database must be able to express complex business calculationsvery easily. The data must be referenced and mathematically defined. In a relationalsystem there is no relation between line items, which makes it very difficult toexpress business mathematics.

    o Mining hierarchies are required for data in order to roam aroundo Instant response is the need to give the user the information as quickly as possible.Dimensional databases are not a solution for the problem as they are not suited to store

    all types of data such as listing all customer addresses and purchase orders.

    Relational systems are also superior in security, backup and replication services, asthese tend not to be available at the same level in dimensional systems. The advantagesof a dimensional system are the freedom they offer in that the user is free to explorethe data and receive the type of report they want without being restricted to a setformat.

    We can also define alternatively OLAP as follows:

    OLAP applications and tools are those that are designed to ask ad hoc,complex queries of large multidimensional collections of data. It is for thisreason that OLAP is often mentioned in the context of Data Warehouses.

    3.1. OLAP Example

    An example of OLAP database can be given as the comprised ofsales data which hasbeen aggregated by region, product type, and sales channel.

    A typical OLAP query might access a multi-gigabyte/multi-year sales database in order tofind all product sales in each region for each product type. After reviewing the results,an analyst might further refine the query to find sales volume for each sales channel

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    5/24

    SUSHIL KULKARNI 5

    [email protected]

    within region/product classifications. As a last step the analyst might want to perform

    year-to-year or quarter-to-quarter comparisons for each sales channel. This wholeprocess must be carried out on-line with rapid response time so that the analysisprocess is undisturbed.

    OLAP queries can be characterized as on-line transactions which:

    oAccess very large amounts of data, e.g. several years of sales data.oAnalyse the relationships between many types of business elements e.g. sales,

    products, regions, channels.

    o Involve aggregated data e.g. sales volumes, budgeted Rs. and Rs. spend.o Compare aggregated data over hierarchical time periods e.g. monthly, quarterly,

    yearly.

    o Present data in different perspectives e.g. sales by region vs. sales by channels byproduct within each region.

    o Involve complex calculations between data elements e.g. expected profit ascalculated as a function of sales revenue for each type of sales channel in aparticular region.

    oAre able to respond quickly to user requests so that users can pursue an analyticalthought process without being stymied by the system.

    3.2. Comparison of OLAP and OLTP

    OLAP applications are quite different from On-line Transaction Processing (OLTP)applications, which consist of a large number of relatively simple transactions. Thetransactions usually retrieve and update a small number of records that are contained inseveral distinct tables. The relationships between the tables are generally simple.

    A typical customer order entry OLTP transaction might retrieve all data relating to aspecific customer and then insert a new order for the customer. Information is selectedfrom the customer, customer order, and detail line tables. Each row in each tablecontains a customer identification number, which is used to relate the rows from thedifferent tables. The relationships between the records are simple and only a few

    records are actually retrieved or updated by a single transaction.

    The difference between OLAP and OLTP has been summarized as, OLTP servers handleproduction data accessed through simple queries; while OLAP servers handlemanagement-critical data accessed through an iterative analytical investigation. BothOLAP and OLTP have specialized requirements and therefore require special optimizedservers for the two types of processing.

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    6/24

    SUSHIL KULKARNI 6

    [email protected]

    3.3. Storing Active OLAP data

    The 'store' in this context means holding the data in a persistent form (for at least theduration of a session, and often shared between users), not simply for the time requiredto process a single query.

    Relational database: This is an obvious choice, particularly if the data is source froman RDBMS. In most cases, the data would be stored in a denormalised structure such asa star schema, or one of its variants, such as snowflake; a normalized database wouldnot be appropriate for performance and other reasons. Often, summary data will be heldin aggregate tables.

    Multidimensional database: In this case, the active data is stored in amultidimensional database on a server. It may include data extracted and summarizedfrom legacy systems or relational databases and from end-users. It is usually possible(and sometimes compulsory) for data to be pre-computed and the results stored insome form of array structure.

    Client-based files: In this case, relatively small extracts of data are held on clientmachines. They may be distributed in advance, or created on demand (possibly via theWeb).

    3.4. Multi-dimension OLAP (MOLAP)

    MOLAP tools use specialized data structures and multi-dimensional DBMS to organize,navigate and analyze data. To enhance query performance the data is typicallyaggregated and stored according to predicted usage. The development issues associatedwith MOLAP are:

    o The underlying data structures are limited in their ability to support multiple subjectareas and to provide access to detailed data.

    o Navigation and analysis of data is limited because the data is designed according topreviously determined requirements. Data may need to be physically reorganised tooptimally support new requirements.

    o MOLAP products require a different set of skills and tools to build and maintain thedatabase, thus increasing the cost and complexity of support.

    3.5. Relational OLAP (ROLAP)

    ROLAP supports RDBMS products through the use of a metadata layer, thus avoidingthe requirement to create a static multi-dimensional data structure. This facilitates thecreation of multiple multidimensional views of the two-dimensional relation.

    To improve performance, some ROLAP products have enhanced SQL engines to supportthe complexity of multi-dimensional analysis, while others recommend, or require the

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    7/24

    SUSHIL KULKARNI 7

    [email protected]

    use of highly denormalised database designs such as the star schema. The development

    issues associated with ROLAP technology are:

    o Development of middleware to facilitate the development of multi-dimensionalapplications; that is, software that converts the two-dimensional relation into amulti-dimensional structure.

    o Development of an option to create persistent, multidimensional structures withfacilities to assist in the administration of these structures.

    4.Design the Dimensional Model

    In the previous chapters we have learned how to plan and design a data warehouse aswell as we discuss the different architectures for data warehouse.

    User requirements and data realities drive the design of the dimensional model, whichmust address business needs, great in details, and what dimensions and facts toinclude.

    The dimensional model must suit the requirements of the users and support ease of usefor direct access. The model must also be designed so that it is easy to maintain andcan adapt to future changes. The model design must result in a relational database thatsupports OLAP cubes to provide instantaneous query results for analysts.

    An OLTP system requires a normalized structure to minimize redundancy, providevalidation of input data, and support a high volume of fast transactions. A transactionusually involves a single business event, such as placing an order or posting an invoicepayment. An OLTP model often looks like a spider web of hundreds or even thousands

    of related tables.

    In contrast, a typical dimensional model uses a star or snowflake design that is easy tounderstand and relate to business needs, supports simplified business queries, andprovides superior query performance by minimizing table joins.

    For example, contrast the very simplified OLTP data model in the first diagram belowwith the data warehouse dimensional model in the second diagram. Which one bettersupports the ease of developing reports and simple, efficient summarization queries?

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    8/24

  • 8/14/2019 Ch4_data Access and Delivery

    9/24

  • 8/14/2019 Ch4_data Access and Delivery

    10/24

    SUSHIL KULKARNI 10

    [email protected]

    Sales (Market_id, Product_Id, Time_Id, Sales_Amt)

    Can have dimension tables as

    Market (Market_Id, City, State, Region) Product (Product_Id, Name, Category, Price) Time (Time_Id, Week, Month, Quarter)

    The fact and dimension relations can be displayed in an E-R diagram, which suggests astar and is called a star schema. For the above example, we have the following starschema

    Consider the following star schema having dimensional tables:

    Store(store_ id, name, address)Product(prod_ id, name, category)Date(time_ id, day, month, year)

    Sales(time id, store id, prod id, units sold)

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    11/24

    SUSHIL KULKARNI 11

    [email protected]

    Facts in the cube are the actual values of sales relation and the table can be called as

    Quantity with attributes store_id, time_id, prod_id, and figure. The cube can be drawnin the following figure:

    Let us consider another example, of a schema given by

    This schema can be viewed with the instances as follows:

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    12/24

  • 8/14/2019 Ch4_data Access and Delivery

    13/24

    SUSHIL KULKARNI 13

    [email protected]

    For each dimension, the set of associated attributes can be structured as ahierarchy.

    For example, we have the following hierarchy structure for store and customer

    Let us see how to add the values in the cube. Consider again, sales of products that canbe represented in one dimension (as a fact relation) or in two dimensions as clients andproducts

    See the following tables:

    In three dimensions we have to take the following fact table with corresponding cube

    with values

    We will learn about Fact, dimension tables as well as hierarchy in detail later

    on in the articles 6 onwards.

    5.2. Aggregation

    Many OLAP queries involve aggregation of the data in the fact table. Aggregateoperators can be sum, count, max, min, median, ave etc.

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    14/24

    SUSHIL KULKARNI 14

    [email protected]

    For example, to find the total amount for each day, we might use the following SQLstatement:

    SELECT Date, sum(Amt)

    FROMSALE

    GROUP BY Date

    The output is as follows:

    Let us consider another query to find the total amount of each product for each client,we might use the following SQL statement:

    SELECT client, product, sum(amt)

    FROMSALE

    GROUP BY client, product

    The aggregation is over the entire time dimension and thus produces a two-dimensionalview of the data as follows

    In multidimensional data model together with measure values usually we store

    summarizing information i.e an aggregates value. This is shown in the following figure:

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    15/24

    SUSHIL KULKARNI 15

    [email protected]

    Computing the sum aggregate operation becomes very easy. Let us write the above

    values in a cube and then find the sum with row and column wise as shown in thefollowing figure:

    These values can be inserted into a cube as shown below:

    One can find the average sells by region within store or by months within date usingdimension hierarchy. Thus one can find how many products are sold according to theregions? For instance suppose customer c1 belongs to region1 and c2, c3 belongs toregion 2 then we get the following:

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    16/24

    SUSHIL KULKARNI 16

    [email protected]

    One can also find the aggregate sells by city using dimension hierarchy as the following:

    Thus, finally we can incorporate all the entries in the cube as shown in the followingfigure

    NOIDA

    PUNE

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    17/24

    SUSHIL KULKARNI 17

    [email protected]

    Following example shows star schema for sales that we discuss at the starting of thechapter (See Figure B next page)

    5.3 Snowflake Schema

    A schema is called a snowflake schema if one or more dimension tables do not joindirectly to the fact table but must join through other dimension tables. For example, a

    dimension that describes products may be separated into three tables (snowflaked) as

    illustrated in the following diagram

    A snowflake schema with multiple heavily snowflaked dimensions is illustrated in thefollowing diagrams (See Figure A).

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    18/24

    SUSHIL KULKARNI 18

    [email protected]

    FIGURE A

    FIGURE B

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    19/24

    SUSHIL KULKARNI 19

    [email protected]

    5.4 Star or Snowflake

    1. Both star and snowflake schemas are dimensional models; the difference is in theirphysical implementations. Snowflake schemas support ease of dimensionmaintenance because they are more normalized. Star schemas are easier for direct

    user access and often support simpler and more efficient queries.

    2. The decision to model a dimension as a star or snowflake depends on the nature ofthe dimension itself, such as how frequently it changes and which of its elementschange, and often involves evaluating tradeoffs between ease of use and ease ofmaintenance. It is often easiest to maintain a complex dimension by snowflaking thedimension. By pulling hierarchical levels into separate tables, referential integritybetween the levels of the hierarchy is guaranteed.

    6. Dimension Tables

    Dimension tables encapsulate the attributes associated with facts and separate theseattributes into logically distinct groupings, such as time, geographical area, products,customers, and so forth.

    A dimension table may be used in multiple places if the data warehouse containsmultiple fact tables or contributes data to data marts. For example, a product dimensionmay be used with a sales fact table and an inventory fact table in the data warehouse,and also in one or more departmental data marts.

    A dimension such as customer, time, or product that is used in multiple schemas iscalled a conforming dimension if all copies of the dimension are the same.Summarization data and reports will not correspond if different schemas use differentversions of a dimension table. Using conforming dimensions is critical to successful datawarehouse design.

    User input and evaluation of existing business reports help in defining the dimensions toinclude in the data warehouse. For instance, suppose a user wants to see data by regionand by product has just identified two dimensions say region and product. Businessreports that group sales by salesperson or sales by customer identify two moredimensions say salesforce and customer. Almost every data warehouse includes a timedimension.

    In contrast to a fact table, dimension tables are usually small and change relatively

    slowly. Dimension tables are seldom keyed to date.

    The tuples in a dimension table establish one-to-many relationships with the fact table.For example, there may be a number of sales to a single customer, or a number of salesof a single product.

    The dimension table contains attributes associated with the dimension entry. These

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    20/24

    SUSHIL KULKARNI 20

    [email protected]

    attributes are user-oriented textual details, such as product name or customer name and

    address. Attributes serve as report labels and query constraints.

    Attributes that are coded in an OLTP database should be decoded into descriptions. Forexample, product category may exist as a simple integer in the OLTP database, but thedimension table should contain the actual text for the category. The code may also becarried in the dimension table if needed for maintenance.

    7. Hierarchies

    The data in a dimension is usually hierarchical in nature. Hierarchies are determined bythe business need to group and summarize data into usable information. For example, atime dimension often contains the hierarchy elements: (all time), Year, Quarter, Month,Day, or (all time), Year Quarter, Week, Day.

    A dimension may contain multiple hierarchies a time dimension often contains bothcalendar and fiscal year hierarchies. Geography is seldom a dimension of its own; it isusually a hierarchy that imposes a structure on sales points, customers, or othergeographically distributed dimensions. An example geography hierarchy for sales pointsis: (all), Country, Region, State, City, Store.

    Note that each hierarchy example has an (all) entry such as (all time), (all stores), (allcustomers), and so forth. This top-level entry is an artificial category used for groupingthe first-level categories of a dimension and permits summarization of fact data to asingle number for a dimension. For example, if the first level of a product hierarchyincludes product line categories for hardware, software, peripherals, and services, thequestion "What was the total amount for sales of all products last year?" is equivalent to

    Day Quarter Year TimeMonth

    OR

    Day Quarter Year TimeWeek

    Geography

    Customers

    GeographicallyDistributed

    DimensionsSales Point

    Country State City StoreRegion

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    21/24

    SUSHIL KULKARNI 21

    [email protected]

    "What was the total amount for the combined sales of hardware, software, peripherals,

    and services last year?" In this question the product becomes an artificial category as itis used to combine different product line categorues.

    The concept of an (all) node at the top of each hierarchy helps reflect the way userswant to phrase their questions. OLAP tools depend on hierarchies to categorize data

    Analysis Services will create by default an (all) entry for a hierarchy used in a cube ifnone is specified.

    A hierarchy may be balanced, unbalanced, or composed of parent-child relationshipssuch as an organizational structure.

    8. Surrogate Keys

    A critical part of data warehouse design is the creation and use of surrogate keys indimension tables. A surrogate key is the primary key for a dimension table and is

    independent of any keys provided by source data systems. Surrogate keys are createdand maintained in the data warehouse and should not encode any information about thecontents of records; automatically increasing integers make good surrogate keys. Theoriginal key for each record is carried in the dimension table but is not used as theprimary key. Surrogate keys provide the means to maintain data warehouse informationwhen dimensions change. Special keys are used for date and time dimensions, but thesekeys differ from surrogate keys used for other dimension tables.

    9. Date and Time Dimensions

    Each event in a data warehouse occurs at a specific date and time; and data is oftensummarized by a specified time period for analysis. Although the date and time of abusiness fact is usually recorded in the source data, special date and time dimensionsprovide more effective and efficient mechanisms for time-oriented analysis than the rawevent time stamp. Date and time dimensions are designed to meet the needs of thedata warehouse users and are created within the data warehouse.

    A date dimension often contains two hierarchies, one for calendar year and another forfiscal year.

    Sales

    Product lines

    Hard- ware Peripherals ServicesSoft -ware

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch4_data Access and Delivery

    22/24

  • 8/14/2019 Ch4_data Access and Delivery

    23/24

  • 8/14/2019 Ch4_data Access and Delivery

    24/24

    SUSHIL KULKARNI 24

    hilt @ h i

    10.4 Fact Table Keys

    A fact table contains a foreign key column for the primary keys of each dimension. Thecombination of these foreign keys defines the primary key for the fact table. Physicaldesign considerations, such as fact table partitioning, load performance, and queryperformance, may indicate a different structure for the fact table primary key than thecomposite key that is in the logical model.

    The logical model for a fact table resolves many-to-many relationships betweendimensions because dimension tables join through the fact table.

    QQQQQQQQQQQ

    mailto:[email protected]:[email protected]