8/14/2019 Ch4_data Access and Delivery
1/24
1.Starter 22. OLTP 2
2.1 A Data Warehouse Supports OLTP 2
3. OLAP in Data Warehouse 3 3.1. OLAP Example 4
3.2. Comparison of OLAP and OLTP3.3. Storing Active OLAP data 63.4. Multi-dimension OLAP (MOLAP) 63.5. Relational OLAP (ROLAP) 6
4. Design the Dimensional Model 75. Dimensional Model Schemas 8
5.1Star schema 85.2Aggregation 135.3Snowflake Schemas 175.4Star or Snowflake 19
6. Dimension Tables 19
7. Hierarchies 20
8. Surrogate Keys 21
9. Date and Time Dimensions 21 9.1 Time Granularity 22
9.2 Date and Time Dimension Attributes 2210. Fact Tables 22 10.1 Multiple Fact Tables 23
10.2 Additive and Non-additive Measures 2310.3 Calculated Measures 23
10.4 Fact Table Keys 23
INFORMATION ACCESS AND DELIVERY
DATA WAREHOUSING
OLAP is The dynamic synthesis,analysis and consolidation of largevolumes of multidimensional data
8/14/2019 Ch4_data Access and Delivery
2/24
SUSHIL KULKARNI 2
1. Starter
This chapter will give introduction to OLTP, OLAP and add information about designissues to be considered when developing a data warehouse.
2. OLTP
Data warehouses support business decisions by collecting, consolidating, and organizingdata for reporting and analysis with tools such as online analytical processing (OLAP)and data mining. Although data warehouses are built on relational database technology,the design of a data warehouse database differs substantially from the design of anOnLine Transaction Processing system (OLTP) database.
Following table gives the difference between data warehouse database and OLTP.
Data warehouse database OLTP database
Designed for analysis of business
measures by categories and attributes
Designed for real-time business operations
Optimized for bulk loads and large,
complex, unpredictable queries that accessmany rows per table
Optimized for a common set of
transactions, usually adding or retrieving asingle row at a time per table
Loaded with consistent, valid data;requires no real time validation
Optimized for validation of incoming dataduring transactions; uses validation datatables
Supports few concurrent users relative toOLTP
Supports thousands of concurrent users
2.1 A Data Warehouse Supports OLTP
A data warehouse supports an OLTP system by providing a place for the OLTP databaseto offload data as it accumulates, and by providing services that would complicate anddegrade OLTP operations if they were performed in the OLTP database.
Without a data warehouse to hold historical information, data is archived to static mediasuch as magnetic tape, or allowed to accumulate in the OLTP database. If data is simplyarchived for preservation, it is not available or organized for use by analysts anddecision-makers. If data is allowed to accumulate in the OLTP so it can be used foranalysis, the OLTP database continues to grow in size and requires more indexes toservice analytical and report queries. These queries access and process large portions ofthe continually growing historical data and add a substantial load to the database. The
large indexes needed to support these queries also tax the OLTP transactions withadditional index maintenance. These queries can also be complicated to develop due tothe typically complex OLTP database schema.
A data warehouse offloads the historical data from the OLTP, allowing the OLTP tooperate at peak transaction efficiency. High volume analytical and reporting queries arehandled by the data warehouse and do not load the OLTP, which does not need
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
3/24
SUSHIL KULKARNI 3
additional indexes for their support. As data is moved to the data warehouse, it is also
reorganized and consolidated so those analytical queries are simpler and more efficient.
3. OLAP in Data Warehouse
A major issue in information processing is how to process larger and larger databases,containing increasingly complex data, without sacrificing response time. Theclient/server architecture gives organizations the opportunity to deploy specializedservers, which are optimized for handling specific data management problems. Untilrecently, organizations have tried to target relational database management systems(RDBMSs) for the complete spectrum of database applications. It is however apparentthat there are major categories of database applications which are not suitably servicedby relational database systems. Oracle, for example, has built a totally new Media Serverfor handling multimedia applications. Sybase uses an object-oriented DBMS (OODBMS)in its Gain Momentum product which is designed to handle complex data such as imagesand audio. Another category of applications is that ofon-line analytical processing(OLAP). OLAP was a term coined by E F Codd (1993) and was defined by him as:
The dynamic synthesis, analysis and consolidation of large volumes ofmultidimensional data
Codd has developed rules or requirements for an OLAP system:
o multidimensional conceptual viewo transparencyo accessibilityo consistent reporting performanceo client/server architectureo generic dimensionalityo dynamic sparse matrix handlingo multi-user supporto unrestricted cross dimensional operationso intuitative data manipulationo flexible reportingo unlimited dimensions and aggregation levels
An alternative definition of OLAP has been supplied by Nigel Pendse, who unlike Codd,does not mix technology prescriptions with application requirements. Pendse definesOLAP as,
Fast Analysis of Shared Multidimensional Information
Which means:
Fast in that users should get a response in seconds and so doesn't lose their chain ofthought;
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
4/24
SUSHIL KULKARNI 4
Analysis in that the system can provide analysis functions in an intuitive manner and
that the functions should supply business logic and statistical analysis relevant to theusers application;
Shared from the point of view of supporting multiple users concurrently;
Multidimensional as a main requirement so that the system supplies amultidimensional conceptual view of the data including support for multiple hierarchies;
Information is the data and the derived information required by the user application.
One question is what is multidimensional data and when does it becomes OLAP? It isessentially a way to build associations between dissimilar pieces of information usingpredefined business rules about the information you are using. Following threecomponents are identified to OLAP,
oA multidimensional database must be able to express complex business calculationsvery easily. The data must be referenced and mathematically defined. In a relationalsystem there is no relation between line items, which makes it very difficult toexpress business mathematics.
o Mining hierarchies are required for data in order to roam aroundo Instant response is the need to give the user the information as quickly as possible.Dimensional databases are not a solution for the problem as they are not suited to store
all types of data such as listing all customer addresses and purchase orders.
Relational systems are also superior in security, backup and replication services, asthese tend not to be available at the same level in dimensional systems. The advantagesof a dimensional system are the freedom they offer in that the user is free to explorethe data and receive the type of report they want without being restricted to a setformat.
We can also define alternatively OLAP as follows:
OLAP applications and tools are those that are designed to ask ad hoc,complex queries of large multidimensional collections of data. It is for thisreason that OLAP is often mentioned in the context of Data Warehouses.
3.1. OLAP Example
An example of OLAP database can be given as the comprised ofsales data which hasbeen aggregated by region, product type, and sales channel.
A typical OLAP query might access a multi-gigabyte/multi-year sales database in order tofind all product sales in each region for each product type. After reviewing the results,an analyst might further refine the query to find sales volume for each sales channel
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
5/24
SUSHIL KULKARNI 5
within region/product classifications. As a last step the analyst might want to perform
year-to-year or quarter-to-quarter comparisons for each sales channel. This wholeprocess must be carried out on-line with rapid response time so that the analysisprocess is undisturbed.
OLAP queries can be characterized as on-line transactions which:
oAccess very large amounts of data, e.g. several years of sales data.oAnalyse the relationships between many types of business elements e.g. sales,
products, regions, channels.
o Involve aggregated data e.g. sales volumes, budgeted Rs. and Rs. spend.o Compare aggregated data over hierarchical time periods e.g. monthly, quarterly,
yearly.
o Present data in different perspectives e.g. sales by region vs. sales by channels byproduct within each region.
o Involve complex calculations between data elements e.g. expected profit ascalculated as a function of sales revenue for each type of sales channel in aparticular region.
oAre able to respond quickly to user requests so that users can pursue an analyticalthought process without being stymied by the system.
3.2. Comparison of OLAP and OLTP
OLAP applications are quite different from On-line Transaction Processing (OLTP)applications, which consist of a large number of relatively simple transactions. Thetransactions usually retrieve and update a small number of records that are contained inseveral distinct tables. The relationships between the tables are generally simple.
A typical customer order entry OLTP transaction might retrieve all data relating to aspecific customer and then insert a new order for the customer. Information is selectedfrom the customer, customer order, and detail line tables. Each row in each tablecontains a customer identification number, which is used to relate the rows from thedifferent tables. The relationships between the records are simple and only a few
records are actually retrieved or updated by a single transaction.
The difference between OLAP and OLTP has been summarized as, OLTP servers handleproduction data accessed through simple queries; while OLAP servers handlemanagement-critical data accessed through an iterative analytical investigation. BothOLAP and OLTP have specialized requirements and therefore require special optimizedservers for the two types of processing.
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
6/24
SUSHIL KULKARNI 6
3.3. Storing Active OLAP data
The 'store' in this context means holding the data in a persistent form (for at least theduration of a session, and often shared between users), not simply for the time requiredto process a single query.
Relational database: This is an obvious choice, particularly if the data is source froman RDBMS. In most cases, the data would be stored in a denormalised structure such asa star schema, or one of its variants, such as snowflake; a normalized database wouldnot be appropriate for performance and other reasons. Often, summary data will be heldin aggregate tables.
Multidimensional database: In this case, the active data is stored in amultidimensional database on a server. It may include data extracted and summarizedfrom legacy systems or relational databases and from end-users. It is usually possible(and sometimes compulsory) for data to be pre-computed and the results stored insome form of array structure.
Client-based files: In this case, relatively small extracts of data are held on clientmachines. They may be distributed in advance, or created on demand (possibly via theWeb).
3.4. Multi-dimension OLAP (MOLAP)
MOLAP tools use specialized data structures and multi-dimensional DBMS to organize,navigate and analyze data. To enhance query performance the data is typicallyaggregated and stored according to predicted usage. The development issues associatedwith MOLAP are:
o The underlying data structures are limited in their ability to support multiple subjectareas and to provide access to detailed data.
o Navigation and analysis of data is limited because the data is designed according topreviously determined requirements. Data may need to be physically reorganised tooptimally support new requirements.
o MOLAP products require a different set of skills and tools to build and maintain thedatabase, thus increasing the cost and complexity of support.
3.5. Relational OLAP (ROLAP)
ROLAP supports RDBMS products through the use of a metadata layer, thus avoidingthe requirement to create a static multi-dimensional data structure. This facilitates thecreation of multiple multidimensional views of the two-dimensional relation.
To improve performance, some ROLAP products have enhanced SQL engines to supportthe complexity of multi-dimensional analysis, while others recommend, or require the
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
7/24
SUSHIL KULKARNI 7
use of highly denormalised database designs such as the star schema. The development
issues associated with ROLAP technology are:
o Development of middleware to facilitate the development of multi-dimensionalapplications; that is, software that converts the two-dimensional relation into amulti-dimensional structure.
o Development of an option to create persistent, multidimensional structures withfacilities to assist in the administration of these structures.
4.Design the Dimensional Model
In the previous chapters we have learned how to plan and design a data warehouse aswell as we discuss the different architectures for data warehouse.
User requirements and data realities drive the design of the dimensional model, whichmust address business needs, great in details, and what dimensions and facts toinclude.
The dimensional model must suit the requirements of the users and support ease of usefor direct access. The model must also be designed so that it is easy to maintain andcan adapt to future changes. The model design must result in a relational database thatsupports OLAP cubes to provide instantaneous query results for analysts.
An OLTP system requires a normalized structure to minimize redundancy, providevalidation of input data, and support a high volume of fast transactions. A transactionusually involves a single business event, such as placing an order or posting an invoicepayment. An OLTP model often looks like a spider web of hundreds or even thousands
of related tables.
In contrast, a typical dimensional model uses a star or snowflake design that is easy tounderstand and relate to business needs, supports simplified business queries, andprovides superior query performance by minimizing table joins.
For example, contrast the very simplified OLTP data model in the first diagram belowwith the data warehouse dimensional model in the second diagram. Which one bettersupports the ease of developing reports and simple, efficient summarization queries?
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
8/24
8/14/2019 Ch4_data Access and Delivery
9/24
8/14/2019 Ch4_data Access and Delivery
10/24
SUSHIL KULKARNI 10
Sales (Market_id, Product_Id, Time_Id, Sales_Amt)
Can have dimension tables as
Market (Market_Id, City, State, Region) Product (Product_Id, Name, Category, Price) Time (Time_Id, Week, Month, Quarter)
The fact and dimension relations can be displayed in an E-R diagram, which suggests astar and is called a star schema. For the above example, we have the following starschema
Consider the following star schema having dimensional tables:
Store(store_ id, name, address)Product(prod_ id, name, category)Date(time_ id, day, month, year)
Sales(time id, store id, prod id, units sold)
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
11/24
SUSHIL KULKARNI 11
Facts in the cube are the actual values of sales relation and the table can be called as
Quantity with attributes store_id, time_id, prod_id, and figure. The cube can be drawnin the following figure:
Let us consider another example, of a schema given by
This schema can be viewed with the instances as follows:
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
12/24
8/14/2019 Ch4_data Access and Delivery
13/24
SUSHIL KULKARNI 13
For each dimension, the set of associated attributes can be structured as ahierarchy.
For example, we have the following hierarchy structure for store and customer
Let us see how to add the values in the cube. Consider again, sales of products that canbe represented in one dimension (as a fact relation) or in two dimensions as clients andproducts
See the following tables:
In three dimensions we have to take the following fact table with corresponding cube
with values
We will learn about Fact, dimension tables as well as hierarchy in detail later
on in the articles 6 onwards.
5.2. Aggregation
Many OLAP queries involve aggregation of the data in the fact table. Aggregateoperators can be sum, count, max, min, median, ave etc.
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
14/24
SUSHIL KULKARNI 14
For example, to find the total amount for each day, we might use the following SQLstatement:
SELECT Date, sum(Amt)
FROMSALE
GROUP BY Date
The output is as follows:
Let us consider another query to find the total amount of each product for each client,we might use the following SQL statement:
SELECT client, product, sum(amt)
FROMSALE
GROUP BY client, product
The aggregation is over the entire time dimension and thus produces a two-dimensionalview of the data as follows
In multidimensional data model together with measure values usually we store
summarizing information i.e an aggregates value. This is shown in the following figure:
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
15/24
SUSHIL KULKARNI 15
Computing the sum aggregate operation becomes very easy. Let us write the above
values in a cube and then find the sum with row and column wise as shown in thefollowing figure:
These values can be inserted into a cube as shown below:
One can find the average sells by region within store or by months within date usingdimension hierarchy. Thus one can find how many products are sold according to theregions? For instance suppose customer c1 belongs to region1 and c2, c3 belongs toregion 2 then we get the following:
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
16/24
SUSHIL KULKARNI 16
One can also find the aggregate sells by city using dimension hierarchy as the following:
Thus, finally we can incorporate all the entries in the cube as shown in the followingfigure
NOIDA
PUNE
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
17/24
SUSHIL KULKARNI 17
Following example shows star schema for sales that we discuss at the starting of thechapter (See Figure B next page)
5.3 Snowflake Schema
A schema is called a snowflake schema if one or more dimension tables do not joindirectly to the fact table but must join through other dimension tables. For example, a
dimension that describes products may be separated into three tables (snowflaked) as
illustrated in the following diagram
A snowflake schema with multiple heavily snowflaked dimensions is illustrated in thefollowing diagrams (See Figure A).
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
18/24
SUSHIL KULKARNI 18
FIGURE A
FIGURE B
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
19/24
SUSHIL KULKARNI 19
5.4 Star or Snowflake
1. Both star and snowflake schemas are dimensional models; the difference is in theirphysical implementations. Snowflake schemas support ease of dimensionmaintenance because they are more normalized. Star schemas are easier for direct
user access and often support simpler and more efficient queries.
2. The decision to model a dimension as a star or snowflake depends on the nature ofthe dimension itself, such as how frequently it changes and which of its elementschange, and often involves evaluating tradeoffs between ease of use and ease ofmaintenance. It is often easiest to maintain a complex dimension by snowflaking thedimension. By pulling hierarchical levels into separate tables, referential integritybetween the levels of the hierarchy is guaranteed.
6. Dimension Tables
Dimension tables encapsulate the attributes associated with facts and separate theseattributes into logically distinct groupings, such as time, geographical area, products,customers, and so forth.
A dimension table may be used in multiple places if the data warehouse containsmultiple fact tables or contributes data to data marts. For example, a product dimensionmay be used with a sales fact table and an inventory fact table in the data warehouse,and also in one or more departmental data marts.
A dimension such as customer, time, or product that is used in multiple schemas iscalled a conforming dimension if all copies of the dimension are the same.Summarization data and reports will not correspond if different schemas use differentversions of a dimension table. Using conforming dimensions is critical to successful datawarehouse design.
User input and evaluation of existing business reports help in defining the dimensions toinclude in the data warehouse. For instance, suppose a user wants to see data by regionand by product has just identified two dimensions say region and product. Businessreports that group sales by salesperson or sales by customer identify two moredimensions say salesforce and customer. Almost every data warehouse includes a timedimension.
In contrast to a fact table, dimension tables are usually small and change relatively
slowly. Dimension tables are seldom keyed to date.
The tuples in a dimension table establish one-to-many relationships with the fact table.For example, there may be a number of sales to a single customer, or a number of salesof a single product.
The dimension table contains attributes associated with the dimension entry. These
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
20/24
SUSHIL KULKARNI 20
attributes are user-oriented textual details, such as product name or customer name and
address. Attributes serve as report labels and query constraints.
Attributes that are coded in an OLTP database should be decoded into descriptions. Forexample, product category may exist as a simple integer in the OLTP database, but thedimension table should contain the actual text for the category. The code may also becarried in the dimension table if needed for maintenance.
7. Hierarchies
The data in a dimension is usually hierarchical in nature. Hierarchies are determined bythe business need to group and summarize data into usable information. For example, atime dimension often contains the hierarchy elements: (all time), Year, Quarter, Month,Day, or (all time), Year Quarter, Week, Day.
A dimension may contain multiple hierarchies a time dimension often contains bothcalendar and fiscal year hierarchies. Geography is seldom a dimension of its own; it isusually a hierarchy that imposes a structure on sales points, customers, or othergeographically distributed dimensions. An example geography hierarchy for sales pointsis: (all), Country, Region, State, City, Store.
Note that each hierarchy example has an (all) entry such as (all time), (all stores), (allcustomers), and so forth. This top-level entry is an artificial category used for groupingthe first-level categories of a dimension and permits summarization of fact data to asingle number for a dimension. For example, if the first level of a product hierarchyincludes product line categories for hardware, software, peripherals, and services, thequestion "What was the total amount for sales of all products last year?" is equivalent to
Day Quarter Year TimeMonth
OR
Day Quarter Year TimeWeek
Geography
Customers
GeographicallyDistributed
DimensionsSales Point
Country State City StoreRegion
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
21/24
SUSHIL KULKARNI 21
"What was the total amount for the combined sales of hardware, software, peripherals,
and services last year?" In this question the product becomes an artificial category as itis used to combine different product line categorues.
The concept of an (all) node at the top of each hierarchy helps reflect the way userswant to phrase their questions. OLAP tools depend on hierarchies to categorize data
Analysis Services will create by default an (all) entry for a hierarchy used in a cube ifnone is specified.
A hierarchy may be balanced, unbalanced, or composed of parent-child relationshipssuch as an organizational structure.
8. Surrogate Keys
A critical part of data warehouse design is the creation and use of surrogate keys indimension tables. A surrogate key is the primary key for a dimension table and is
independent of any keys provided by source data systems. Surrogate keys are createdand maintained in the data warehouse and should not encode any information about thecontents of records; automatically increasing integers make good surrogate keys. Theoriginal key for each record is carried in the dimension table but is not used as theprimary key. Surrogate keys provide the means to maintain data warehouse informationwhen dimensions change. Special keys are used for date and time dimensions, but thesekeys differ from surrogate keys used for other dimension tables.
9. Date and Time Dimensions
Each event in a data warehouse occurs at a specific date and time; and data is oftensummarized by a specified time period for analysis. Although the date and time of abusiness fact is usually recorded in the source data, special date and time dimensionsprovide more effective and efficient mechanisms for time-oriented analysis than the rawevent time stamp. Date and time dimensions are designed to meet the needs of thedata warehouse users and are created within the data warehouse.
A date dimension often contains two hierarchies, one for calendar year and another forfiscal year.
Sales
Product lines
Hard- ware Peripherals ServicesSoft -ware
mailto:[email protected]:[email protected]8/14/2019 Ch4_data Access and Delivery
22/24
8/14/2019 Ch4_data Access and Delivery
23/24
8/14/2019 Ch4_data Access and Delivery
24/24
SUSHIL KULKARNI 24
hilt @ h i
10.4 Fact Table Keys
A fact table contains a foreign key column for the primary keys of each dimension. Thecombination of these foreign keys defines the primary key for the fact table. Physicaldesign considerations, such as fact table partitioning, load performance, and queryperformance, may indicate a different structure for the fact table primary key than thecomposite key that is in the logical model.
The logical model for a fact table resolves many-to-many relationships betweendimensions because dimension tables join through the fact table.
QQQQQQQQQQQ
mailto:[email protected]:[email protected]