Data Warehouse

What is a data warehouse?

The standard answer to this question is it is a database designed to support the management decision-making process. A more accurate answer might be that it is where business managers can find the information they need to run the business correctly. Notice that the first answer is product-oriented, the second one emphasises functionality. The guru of data warehousing, Bill Inmon, characterises a data warehouse has being:

linked to, but different from, the many production databases that run in an organisation;

subject-oriented, rather than application-oriented, to provide a consistent view of the business;

integrated, because data is consolidated from different application systems; time variant, because information has a time dimension, whereas operational data

is valid only at a particular moment; non-volatile, since data is added to the data warehouse, rather than replaced.

Where did the concept come from?

While many people say that the concept of an ‘information’ warehouse came from IBM, a number of companies were building data warehouses during the 1980s but giving them different names. In South Africa, pioneering organisations like Eskom and the former United Building Society set up very large databases for their management and executive information systems well before the warehouse concept was established. IBM can be credited with realising the true potential of a database for management-oriented rather than operational needs, and of course for promoting the concept world-wide.

Why do I need a data warehouse?

Any of the following reasons can apply. 1. Organisations have spent many years improving ways to put data into their

operational systems; now they are beginning to appreciate the value of getting that data out again. However, in many cases the databases that support those operational systems are not suitable for quick and easy access to information.

2. Application databases are designed to handle data for specific, quite narrow purposes, not integrated organisational views.

3. The IS department often does not want to allow users on the operational system for fear of degrading the system with resource-intensive queries, or breaching data security and integrity.

4. The data management differences between databases for application systems and for decision-making cannot be re-conciled on one system.

5. The software tools that users need should be on platforms that make it easy to access and use data - the operational platform may not be suitable.

6. The growth in recent years of decision support and data mining applications necessitates the move to a different database architecture.

7. ‘Get those xxxx users off our back’!

How is data warehousing developing?

If you were building a data warehouse five years ago, chances are it would be on your company’s mainframe computer. But with increased use of the data warehouse came a heavier load on mainframe computer resources and degradation of overall performance. Nowadays, a common solution is to position the warehouse on a mid-range platform that can scale up easily and at a suitably low cost - conditions not yet available on mainframes. The lower implementation costs of a client-server warehouse make it easier to motivate the development of a warehouse. And data warehouses are not mission-critical systems, so the common client-server criticism about lack of system management tools, is less significant.

What are the implications of data warehousing?

Because a data warehouse crosses organisational boundaries, there are serious non-technical issues that you must bear in mind at an early stage; not to do so can be a career-limiting move. Consider how organisations start data warehousing projects, from two different approaches.

One approach begins with corporate data modelling, and builds the warehouse by taking a broad view of business requirements; you can call this a top-down methodology. The other approach is to focus on a specific information application and use that as the core around which the warehouse will grow; this is the bottom-up methodology. Both approaches are valid but have their own problems. The top-down method requires long hard work to motivate and justify, political savvy, and approval by someone to pay up-front before seeing the final product. And that’s before you even start on development. The bottom-up approach avoids many of those problems, but runs the risk that the information requirement you start with becomes quickly out-dated, or is not accepted by business users.

Before you embark on a data warehouse project, you need to look at your current applications and databases. If the warehouse is being developed because no one apart from programmers can access corporate data in its current state, then have a clear understanding about the data standards you will apply, and how the data warehouse will be designed. Too many organisations discover that they have multiple standards or definitions for their data. For data warehouses built using relational databases, it is a common mistake to design the warehouse database around the same normalised form as the application system. While the ‘normal form’ concept is fine for application efficiency, it creates havoc for users of a data warehouse.

Next, examine the processes that will be implemented to extract and convert data from your existing application systems to the warehouse. Among other things, these processes must ensure cleanliness and integrity - for example, identifying invalid data - as well as extracting the correct data every time - getting daily data for the wrong day is not an

uncommon problem. While these housekeeping functions seem obvious, they can become a major headache as the warehouse grows, unless they are properly thought through at the outset.

As the data warehouse grows, two other warehouse admin. functions will become more important. One function is packaging the data for commonly requested queries so that the queries run quicker. Doing this is easier said than done. Not many database systems analyse the contents of queries. More significantly, common queries at one time become rare at another time due to business changes. This is an area that requires vigilance. The other admin. function is what to do as the data ages. As the data gets older, the importance of detail decreases. So to conserve disk space, this data should be summarised to higher and higher levels of summary, and eventually archived. Deciding how and what to summarise requires knowledge of past query patterns, and some insight into future business information requirements.

To make the contents of a data warehouse easily understood, users have begun to demand decent directories of corporate data, otherwise known as data dictionaries. This is a bad news and good news story. The bad news is that many organisations have spend a lot of time and money trying to implement corporate dictionaries for a data warehouse with little obvious success. While products do exist, the average user has great difficulty understanding them since they are still aimed mainly at programmers. The good news is that some of the end-user query tools that are beginning to emerge have quite adequate data dictionary functions. Watch this space for more developments.

When starting a data warehouse, the best advice is to start with a small and focused need, and work on getting get it right first time. This makes it easier to project benefits, and to get data issues sorted out on a small system before you tackle a large one.

What benefits do data warehouses provide?

As with many strategic systems, determining bottom-line benefits is not straightforward. The major reason for a data warehouse is to improve access to corporate information. If this is accomplished successfully, users should find increased productivity because they do not waste time looking or waiting for information. With better access to this information, management can make better decisions. By separating the operational and warehouse data, you allow both databases to be optimised for their own specific purposes, and reduce possible disruptions to the operational systems. A truly useful warehouse will also create spin-off by suggesting further applications.

Goals of the data warehouse

The data warehouse must make an organization information easily accessible :-

The contents of the data warehouse must be easily understandable. The data must be intuitive and obvious to the business user, not merely the developer. The contents of the data warehouse must be labeled meaningfully. Business users want to separate and combine the data in the data warehouse in endless combinations, a process commonly referred to as slicing and dicing. The tools that access the data warehouse must be simple and easy to use. They must also return query to the users with minimal wait times.

The data warehouse must present the organizations information consistently :-

The data in the data warehouse must be credible. Data must be carefully assembled from variety of sources around the organization, cleansed, quality assured, and released only when it is fit for users consumption. Information from one business process should match information from other business process. If two performance measures have the same name, then they must mean the same thing. Conversely, if two measures don’t mean the same thing, then should be labeled differently. Consistent information means high quality information. It means that all the data is accounted for and complete. Consistency also implies that common definitions for the contents of the data warehouse are available for the users.

The data warehouse must be adaptive and resilient to change :-

We simply can’t avoid to change. User needs, business conditions, data and technology are all subject to the shifting sands of time. The data warehouse must be designed to handle this inevitable change. Changes to data warehouse must be graceful, meaning they don’t invalidate existing data or applications. The existing data and applications should not be changed or disrupted when the business community asks new questions or new data is added to the warehouse. If descriptive data in the warehouse I modified, we must account for the changes appropriately.

The data warehouse must be a secure bastion that protects our information :-

An organization’s informational crown jewels are stored in the data warehouse. At a minimum, the warehouse likely contains information about what we are selling to whom at what price—potentially harmful details in the hands of the wrong people. The data warehouse must effectively control access to the organization’s confidential information.

The data warehouse must serve as the foundation for improved decision making :-

The data warehouse must have the right data in it to support decision making. There is only one true output from the data warehouse: the decisions that are made after the data warehouse has presented its evidence. These decisions deliver the business impact and the values attributable to the data warehouse. The original label that predates the data warehouse is still the best description of what we are designing :

a decision support system.

The business community must accept the data warehouse if it is to be deemed successful:-

It desen’t matter that we have built an elegant solution using best-of –breed and platforms. If the business community has not embraced the data warehouse and continued to use it actively six months after training, than we have failed the acceptance test. Unlike an operational system rewrite, where business users have no choice but to use the new syatem, data warehouse usage is sometimes optional. Business user acceptance has more to do with simplicity than anything else.

Components of a data warehouse

There are four distinct components of the data warehouse:-

Operational Source System Data Staging AreaData Presentation Area Data Access Tools

Extract

Extract

Extract

Services:-Clean, Combine and Standardize

Confirm dimensions

NO USER QUERY SERVICES

Data Store:-Flat Files and relational tables

Processing:Sorting and sequential processing

Load

Load

Data Mart #1 DIMENSIONAL Atomic and summary data Based on single business process

DW bus:- ConformedFacts and dimensions

Data Mart #2Similarly Designed

Ad Hoc Query Tools

Report Writers

Analytic Applications

Modeling: Forecasing Scoring Data Mining

Access

Access

Data Staging Area

Data Presentation Area

Operational System

Data Access Area

Operational Source System:-

These are the operational systems of record that capture the transactions of the business. The source systems should be thought of as outside the data warehouse because presumably we have little or no control over the content and format of the data in these operational legacy systems. The main priorities of the source systems are processing performance and availability. Queries against the source systems are narrow, one-record at a time queries that are part of the normal transaction flow and severely restricted on the demands on the operational system. We make the strong assumption that source systems are not queried in the broad and unexpected ways that data warehouses typically are queried. The source system maintains little historical data, and if you have a good data warehouse, the source systems can be relieved of much of the responsibility for representing the past. Each source system is a natural application where little investment has been made to sharing common data such as product, customer, geography or calendar.

The difference between the data warehouse and the operational system is as follows:-

Data Warehouse DBMS Operational SystemDoes not allow updating of data inside it Allows updating of data inside it.Allows many indexes in DBMS used by data warehouse, as you do not update data in it.

Allows finite number of indexes in DBMS used by the transaction processing. This restriction exists because you can update and insert data in the DBMS.

Does not provide free space in the DBMS used by the data warehouse. Free space is the space reserved at the block level of memory for further expansion of the DBMS. For a DBMS used by the data warehouse, you do not need free space, as data is not updated.

Provides free space in the DBMS used by the transaction processing. For a DBMS used by the transaction processing, 50% of space is the free space.

Data Staging Area:-

The data staging area of the data warehouse is both a storage area and a set of processes commonly referred to as extract-transformation-load(ETL). The data staging area is everything between the operational source systems and the data presentation area. It is somewhat analogous to the kitchen of the restaurant , where raw food products are transformed into a fine meal. In the data warehouse raw operational data is transformed into a warehouse deliverable fit for user query and consumption. Similar to the restaurant’s kitchen, the backroom data staging area is only accessible only to skilled professionals. The data warehouse kitchen staff is busy preparing meals and simultaneously cannot be responding the customer inquiries. Customers aren’t invited to eat in the kitchen. It isn’t safe for the customers to wander into the kitchen. We wouldn’t want our data warehouse customers to be injured by the dangerous equipment, hot surfaces and sharp knives they may encounter in the kitchen so we prohibit them from accessing the staging area. Besides, things happen in the kitchen that customers just shouldn’t be privy to.

The key architectural requirement for the data staging area is that it is off-limits to business users and does not provide query and presentation services.

Extraction is the first step in the process of getting data into the data warehouse environment. Extraction means reading and understanding the source data and copying the data needed for the data warehouse into the staging area for further manipulation.

Once data is extracted to the staging area, there are numerous potential transformations such as cleansing the data(correcting misspellings, resolving the domain conflicts, dealing the missing elements, or parsing into standard formats), combining data from multiple sources and assigning warehouse keys. These transformations are all precursors to loading the data into the data warehouse presentation area..

The data staging area is dominated by the simple activities of sorting and sequential processing. In many cases, the data staging area is not based on relational technology but instead may consist of a system of flat files. After you validate your data

for conformance with the defines one-to-one and many-to-one business rules, it may be pointless to take the final step of building a full blown third-normal form physical database.

However, there are cases, where the data arrives at the doorstep of the data staging area in the 3rd normal form relational format. In these situations, the managers of the data staging area simply may be more comfortable performing the cleansing and the transformation tasks using a set of normalized structures. A normalized database for data staging area is acceptable. However, we continue to have some reservations about this approach. The creation of both normalized structures for staging and dimensional structures for presentation means that the data is extracted, transformed and loaded twice-once in normalized database and then again when we load the dimensional model. Obviously, this two-step process requires more time and resources for the development effort, more time for the periodic loading or updating of data, and more capacity to store the multiple copies of the data. At the bottom line, this typically translates into the need for the larger development, ongoing support, and hardware platform budgets. Unfortunately, some data warehouse project teams have failed miserably because they focused all their energy and resources on constructing the normalized structures rather than allocating time to development of a presentation area that supports improved business decision making. While we believe that enterprise-wide data consistency is a fundamental goal of the data warehouse environment, there are equally effective and less costly approaches than physically creating a normalized set of tables in your staging area, if these structures don’t already exist.

It is acceptable to create a normalized database to support the staging processes; however, this is not the end goal. The normalized structures must be off-limits to user queries because they defeat understandability and performance. As soon as a database supports query and presentation services, it must be considered part of the data warehouse presentation area. By default, normalized databases are excluded from the presentation area, which should be strictly dimensionally structured.

Regardless of whether we’re working with a series of flat flies or a normalized data structure in the staging area, the final step of the ETL process is the loading of data. Loading in the data warehouse environment usually takes the form of presenting the quality-assured dimensional tables to the bulk loading facilities of each data mart. The target data mart must then index the newly arrived data for query performance. When each data mart has been freshly loaded, indexed, supplied with appropriate aggregates, and further quality assured, the user community is notified that the new data has been published. Publishing includes communicating the nature of any changes that have occurred in the underlying dimensions and new assumptions that have been introduced into the measured or calculated facts.

Data Presentation

The data presentation area is where data is organized, stored, and made available for direct querying by users, report writers, and other analytical applications. Since the backroom staging area is off-limits, the presentation area is the data warehouse as far as the business community is concerned. It is all the business community sees and touches via data access tools.

You can create a data warehouse system for an organization by two approaches. In the first approach you can create and implement a central data warehouse first with data marts created later. In the second approach, the data marts are implemented in such a way that the data warehouse works properly when their information joins in the warehouse system. In both the approaches, the design needs centralization for perfect use and consistency of the organizations data warehouse information. Data marts that are designed with central specifications can produce consistent reports even though the data is saved in different places.

FACT AND DIMESION TABLES:-

FACT Tables:-

The multidimensional models are of two types, fact tables and dimension tables. The fact tables are the tables that store the business data, such as profit, loss, cost, sales and money transactions. The data in the fact table is known as fact. The facts represent the transactions and have some attributes, which represent the data that describes the facts. The values assigned to an attribute represent the data that describes the facts. The values assigned to an attributes are known as tuples. For example, the attributes associated with the fact account transactions are:

Customer Name Account Number Type of Transaction Amount of Transaction

The diagram below shows a schema with a fact table and dimensional tables:-

Dimension table Dimension tableFact table

Dimension table

A fact constellation is performed when two fact tables share common dimensional tables

The diagram below shows a fact constellation

Fact 1 Fact 2

Dimension table

In above fact constellation, fact1 and fact2 share a common dimension table.

Determining Fact Table:-

Name of product

Product Number

Description

PRODUCTAREADURATION

Area 1

Area2

Area3

Year

Beginning Date

Date of Completion

Product Area Duration

Product Area Duration

Name of Product

Product Number

Description

You need to identify various facts to determine the dimension table in a database. The process of determining a fact table in the database involves

Identifying the transactions of interest Identifying the dimensions for each fact Verifying that a fact is not a dimension table Verifying that a dimension table is not a fact

Identifying the transactions and dimensions for the fact:-

You need to identify the transactions of interest, which are essential events in the business. The transactions of interest are also know as elemental transactions. For example, records of the phone calls made by the customer and the records of the account transactions made by the customer of the bank. The information about the transactions should consist of appropriate details. You need to identify the dimensions related to each fact in the fact table identifying the elemental transactions.

Fact Table Designing Process:-

The fact table can be of any size when sufficient budgets, hardware, and database are available. The database designer needs to maintain enough funds, when storing the details of the data. Factors that need to be considered when designing the fact table are:-

Determining for the historical period for which the fact table is to be made. The historical table enables you to store the data related to the required historical period. The time for which the data is stored is also known as data retention period.

Determining whether the collected data consists of the required details Minimizing the size of the columns in the fact tables. Including the time factor in the fact table. Subdividing the fact table consisting of larger amounts of data into smaller

fact tables. The data in the smaller fact tables is easier to manage.

Determining the Retention Period:-

The data retention period stores the data for a long time period of 5 to 10 years. You need to store data in the data warehouse to improve the query performance. The query performance refers to the data retrieval in response to various queries. You need to determine the details of the data to be stored to improve the query performance.

Removing Inappropriate Columns from the Fact Table:-

The fact table needs to include only those entries that can respond to various queries for retrieving data. The data that can be removed from fact table includes.

Replicated Data:- Refers to duplicate copies of data stored in data warehouse. Derived Data:- Refers to the data values that are derived using the already stored

values. Aggregated Data:- Consists of the aggregations, such as sum of the data values,

total number of rows in the table and average of the data values.

You need to remove the columns that do not provide any useful information from the fact table. For example, in a data warehouse that stores data regarding the telephone calls, the useful columns of data are

Source phone number Destination phone number Date of the call Time of the call Tariff Duration of the call

Every data warehouse and data mart has one or more fact tables. Fact tables contain data that represent the business measurement of an organization. You can include financial events like cash flow transactions and expenditure details in a fact table. Fact tables contain numerous rows, sometimes in thousands when they include one or more years financial details.

You can distinguish fact tables by seeing the numeric data in the table, which summarize to provide information about organizations history. Fact tables also contain many foreign keys which are the primary keys of the related dimension tables that contain the attribute of the fact records. You cannot have descriptive

information in the fact table and can only have numeric fields and index fields that relate to subsequent dimension tables.

Here, is the sample of the fact table orderdetail_fact used in creating the warehouse of northwind.

COLUMNS DESCRIPTIONSOrdered_wk Surrogate key for ordersOrder_id Foreign key of ordersAmount Amount of the orderDiscount Discount on the orderOrder_date Date of the orderShip_date Date of shippingCustid_wk Surrogate key for customerEmpid_wk Surrogate key for employeeShipid_wk Surrogate key for shipShipperid_wk Surrogate key for shipperRegionid_wk Surrogate key for regionDw_auditid Foreign key for audit table

The above entries represent the order detail on a specific date by the specific customer which is processed by a particular employee. It also gives the shipper details about who shipped the order.

The most useful measure of the fact table is the additive numbers. They allow you to include the summary information by adding various quantities of the measure. The sales of a specific item for a group of stores in a particular time-period can be included in the additive numbers. Non additive measures such as inventory, quantity left can also be part of a fact table, but you need to use different summarization techniques.

Aggregation of Fact Tables:-

The process of deriving summarized data from detail records is known as aggregation. You can reduce the size of the table by aggregating the data. When data summarization takes place in the fact table, the analyst has no detailed information available to him. If you need detailed information, you need to identify and locate the detailed rows that summarized in the source system that provided the data. You should maintain the fact tables at finest granularity as possible.

Mixing of aggregated and detailed data in the fact table causes problems when you use the data warehouse. For example, a purchase order often contains several line items and may contain a discount, tax, or cartage cost that applies to the to the order total instead of individual items. The quantities and item identification are recorded at the item level. Summarization queries become more complex in this situation and tools such as Analysis Services often require the creation of special filters to deal with them.

You can use two different approaches for aggregation. In first approach, you can allocate the order level values to line items based on line items based on values, quantity, or shipping weight. In second approach, you can two fact tables, one that contains data at the line item level, the other containing the order level information. You can relate the two tables by carrying the identification key in the detail fact table. You can use the order table as the dimension table for the detail table by considering the order-level values as attributes of the order level table in the dimension hierarchy.

We typically refer to the presentation area as a series of integrated data marts. A data mart is a wedge of the overall presentation area pie. In its most simplistic form, a data mart presents the data from a single business process. These business processes cross the boundaries of organizational functions.

We have several strong opinions about the presentation area. First of all we insist that the data be presented, stored and accessed in dimensional schemas.

Data in the queryable presentation area of the data warehouse must be dimensional, must be atomic, and must adhere to data warehouse bus architecture.

If the presentation area is based on the relational database, then these dimensionally modeled tables are referred to as star schema.

Star Schema:- The star schema consists of a fact table in the center and all the dimension tables attached to the center fact table. This arrangement of data resembles a star and is named as star schema.

The star schema is used in the data warehouse and the data marts. The data marts store data about the data stored in the data warehouse.

Dynamic Dimensions:-

Dynamic Dimensions are the dimensions that vary with time. The queries for retrieving data also change with the changes in dimensions. The queries are designed to determine whether the new business policies are successful or not. The execution of the queries compares the results of new business policies with the old business policies.

Snowflake Schema:- In snowflake schema, the dimension tables are normalized. Normalization means that the large dimension tables are partitioned into multiple tables are partitioned into multiple tables to remove the redundant data such as duplicate values of a data. For example, you can partition the product dimension into two dimension tables, Product_category and Product_manufacturer.

The snow flake schema consists of a fact table and the dimensional tables into which some more dimension tables are connected.

Area Time

Product

Customer

Sales

The above figure shows snowflake schema in which sales is the fact table. The four dimension tables are PRODUCT, AREA, TIME, CUSTOMER. The dimension table product is partitioned into product_category and product_manufacturer.

The snowflake schema also has some advantages. The large number of multiple tables makes the data unmanageable as it becomes difficult to retrieve data from multiple tables. The metadata also becomes complex in case of snowflake schema,as it needs to store data about multiple dimension tables.

Starflake Schema:-

The starflake schema is the conbination of star schema and snowflake schema. The starflake schema consists of fact table, star dimensions and snowflake dimensions.

Following is the structure of starflake schema:-

Sales Time Area

Product

Customer

Product_category Product_manufacturer

Refreshing the Data Warehouse:-

You need to regularly refresh the data in a data warehouse as data sources are updated at regular intervals. You use data sources such as ODS and legacy system to load the data in the data warehouse.

The ways for loading the data in the data warehouse is to trap data in the legacy system as it is being updated. The two ways to trap the updated data in a legacy system are:-

Data Replication :- Uses triggers to trap the updated data in the legacy system. A trigger is a set of SQL statements that automatically executes an action when changing of data in a legacy system occurs. The action is used to store the data updated in the legacy system.

Change Data Capture(CDC):- Uses log tape to trap the updated data in the legacy system. A log tape stores all the change occurred through out the day in a legacy system. You can use the log tape to trap the updated data in the legacy system.

Snowflake dimensions

Snowflake dimensions

Sales Location Product

Price Weight

Product

Location1

Location

Location2

Data Warehouse

Documents

operational data

data warehouses

data management differences

guru of data warehousing

breaching data security

data mining applications

information warehouse

warehouse concept