Dataware Housing

INDEX

1. Getting Started with Learning About Data Warehousing2. A Definition of Data Warehousing

3. A Definition of Decision Support

4. The Case for Data Warehousing

5. The Case Against Data Warehousing

6. Actions for Data Warehouse Success

7. Data Warehousing Gotchas

8. Performing Data Warehousing Software Evaluations

9. An (Informal) Taxonomy of Data Warehouse Data Errors

10. Data Warehousing Political Issues

11. Different Aspects of Data Warehouse Architecture

12. What to Learn About in Order to Speed Up Data Warehouse Querying

13. What to Learn About in Order to Speed Up Data Warehouse Loading

14. How to Save Money on Your Data Warehousing Efforts

15. Using Data Warehousing in Strategic Decision Making

16. Maintenance Issues for Data Warehousing Systems

17. What Decision Support Tools are Used For

18. Is Web Data Analysis (i.e., Web Data Mining) Different?

Getting Started with Learning About Data Warehousing

If you are new to this field and the way you like to get into a new field is by getting an overview, I suggest that you:

Read the books "Building the Data Warehouse" by W. H. Inmon, "The Data Warehouse Toolkit" by Ralph Kimball, "Data Warehouse from Architecture to Implementation" by Barry Devlin, and "Data Warehousing in the Real World" by Sam Anahory and Dennis Murray

With due respect to all the other fine books on data warehousing and decision support, when read in combination I believe these four books provide a great introduction to and overview of the strategic and tactical issues system developers face (even though the books are several years old - despite what you read in the trade media, data warehousing does not change that much.) Especially valuable are Inmon's overall overview and description of the iterative nature of data warehouse development,

Kimball's description of data modeling principles and query/report tools, Devlin's descriptions of data extraction, cleaning, and loading issues and metadata, and Anahory/Murray's description of what can be done so a system can run efficiently and their description of the main tasks in a data warehouse project. If you are a really ambitious reader, consider a couple of other titles. "The Data Warehouse Lifecycle Toolkit" by Ralph Kimball, et. al., is a 700+ page, clearly written description of a methodology for constructing data warehouses. If you use Oracle, "Oracle8i Data Warehousing" by Gary Dodge and Tim Gorman provides practical technical advice that even a non-DBA can understand and appreciate. Finally, "Data Warehouse Design Solutions" by Christopher Adamson and Michael Venerable provides insight on model design for specific business problems. (By the way, the above material contains the only recommendations of commercial products in this site. There is no commercial connection between this site and the authors or publishers of the books just cited.)

Visit a couple of organizations that have had warehousing systems in production for over a year

You will get an excellent education if you can ask an organization who 'has done it' what are the biggest issues it faced in developing systems and what are the biggest issues it faces in maintaining systems. Also, ask what the organization felt it did right and what it felt it could have done differently. I believe that if you do this you will learn a great deal aspects of data warehousing that do not get discussed much in the literature - specifically the politics of data warehousing projects, the maintenance burdens data warehousing imposes, and how to deal with data warehousing software/hardware vendors and consultants.

Read up on some fundamental technical topics

You may find you will be greatly helped by reading up on SQL queries (especially multi-table and summary queries and subqueries), database indexing, join processing, and how query optimization works. Also helpful would be some knowledge about how logical structures can be created and how database partitioning can be used in conjunction with logical structures. - There are many fine books on SQL. The latter knowledge will most likely be found in books aimed at DBAs for specific commercial databases.

Build something!

Computer texts love to cite a (supposedly) Confucian quote "What I hear I forget. What I see I remember. What I do I understand." Well, this quote is apt in the case of learning about data warehousing. After you build something, no matter how modest, you will gain a more profound appreciation of the topic.

A Definition of Data Warehousing

My favored definition of a data warehouse is a slightly modified version of Ralph Kimball's definition on page 310 of The Data Warehouse Toolkit:

A data warehouse is a copy of transaction data specifically structured for querying and reporting.

Ralph states that a data warehouse is "a copy of transaction data specifically structured for query and analysis". Two quibbles I have with Ralph's definition are: 1) Sometimes non-transaction data are stored in a data warehouse - though probably 95-99% of the data usually are transaction data. 2) I say "querying and reporting" rather than "query and analysis" because the main output from data warehouse systems are either tabular listings (queries)

http://www.wiley.com/compbooks/catalog/15337-0.htm

with minimal formatting or highly formatted "formal" reports. Queries and reports generated from data stored in a data warehouse may or may not be used for analysis. - For some more information about why the transaction data are copied, you may want to see my essay The Case for Data Warehousing. What I especially like about Ralph's definition is what he does not say.

The form of the stored data has nothing to do with whether something is a data warehouse.

A data warehouse can be normalized or denormalized. It can be a relational database, multidimensional database, flat file, hierarchical database, object database, etc. Data warehouse data often gets changed. And data warehouses often focus on a specific activity or entity.

Data warehousing is not necessarily for the needs of "decision makers" or used in the process of decision making.

Of course if you want to define every user as a decision maker and all activities as decision making processes, then my assertion is false. But in my experience, the overwhelming uses of data warehouses are for quite mundane, non-decision making purposes rather than for grist for making decisions with wide ranging effects (so-called "strategic" decisions.). In fact, I would assert that most of data warehouses are used for post-decision monitoring of the effects of decisions (or as some people might say, for "operational" issues. By the way, this is not saying that using data warehousing in the decision making process is not a wonderful, potentially high return effort. But my caution is that though the trade press, vendors, and many industry experts trumpet the role of data warehousing vis-à-vis decision making, this is an area in reality we really do not have a clear understanding of. (See the writing of Peter Keen for more on this perspective.)

A Definition of Decision Support

The term decision support, if my knowledge of history of this area is correct, goes back to the 1970s when it was coined by some academics associated with the Massachusetts Institute of Technology. Since then, many academic definitions have been offered. - My purpose in this essay is to provide a definition that may lend clarity to practitioners.

A decision support system or tool is one specifically designed to allow business end users to perform computer generated analyses of data on their own.

I believe the essence of decision support is, in the language of the 1960s, to allow end users to do their own thing. I note that this definition is still fuzzy because what constitutes analyses and "on their own" are debatable points.

We cannot say that decision support systems or tools necessarily support the making of decisions.

What's in a name? - As far as I know, cognitive researchers do not agree on how decisions are made. Therefore, saying that these tools support making decisions is not a provable statement. Nor, is it, in may opinion, an insightful way of defining these tools.

These tools do not analyze by themselves - rather they help a person analyze

http://www.computerworld.com/cwi/story/0,1199,NAV47_STO11306,00.html

http://www.dwinfocenter.org/casefor.html

http://www.dwinfocenter.org/casefor.html

In other words, the tools facilitate analyses rather than perform analyses. If you want to to learn more about how the tools facilitate analyses, see my essay on What Decision Support Tools are Used For.

Data warehousing and decision support systems and tools do not necessarily go hand in hand.

Many data warehouses are not used as decision support systems. And decision support systems or tools do not necessarily require the use of a data warehouse as a source for data. I assert that, by far, the most used decision support tools are spreadsheets not connected in any automated way with a data warehouse.

Business intelligence seems to have become the vendors' preferred synonym for decision support

My guess is because decision support has an academic connotation and, as just mentioned, decision support systems do not necessarily support decisions. On the other hand, business intelligence systems do not necessarily make a business more intelligent. By the way, the consultant-coined term business intelligence goes back to the late 1980s, fell out of use, and then was revived by the DW/DSS world in the late 1990s. Confusingly, business intelligence is also used as a synonym for competitive intelligence (and is probably a more apt term for that area). By the way, "analytics" seems to be an up and coming name for this area - despite the mid-1990 consultant-coined term "analytical applications" never taking hold.

The Case for Data Warehousing

The following is a list of the basic reasons why organizations implement data warehousing. This list was put together because too much of the data warehousing literature confuses "next order" benefits with these basic reasons. For example, spend a little time reading data warehouse trade material and you will read about using a data warehouse to "convert data into business intelligence", "make management decision making based on facts not intuition", "get closer to the customers", and the seemingly ubiquitously used phrase "gain competitive advantage". In probably 99% of the data warehousing implementations, data warehousing is only one step out of many in the long road toward the ultimate goal of accomplishing these highfalutin objectives. The basic reasons organizations implement data warehouses are:

To perform server/disk bound tasks associated with querying and reporting on servers/disks not used by transaction processing systems

Most firms want to set up transaction processing systems so there is a high probability that transactions will be completed in what is judged to be an acceptable amount of time. Reports and queries, which can require a much greater range of limited server/disk resources than transaction processing, run on the servers/disks used by transaction processing systems can lower the probability that transactions complete in an acceptable amount of time. Or, running queries and reports, with their variable resource requirements, on the servers/disks used by transaction processing systems can make it quite complex to manage servers/disks so there is a high enough probability that acceptable response time can be achieved. Firms therefore may find that the least expensive and/or most organizationally expeditious way to obtain high probability of acceptable transaction processing response time is to implement a data warehousing architecture that uses separate servers/disks for some querying and reporting.

To use data models and/or server technologies that speed up querying and reporting and that are not appropriate for transaction processing

http://www.dwinfocenter.org/whatfor.html

http://www.dwinfocenter.org/whatfor.html

There are ways of modeling data that usually speed up querying and reporting (e.g., a star schema) and may not be appropriate for transaction processing because the modeling technique will slow down and complicate transaction processing. Also, there are server technologies that that may speed up query and reporting processing but may slow down transaction processing (e.g., bit-mapped indexing) and server technologies that may speed up transaction processing but slow down query and report processing (e.g., technology for transaction recovery.) - Do note that whether and by how much a modeling technique or server technology is a help or hindrance to querying/reporting and transaction processing varies across vendors' products and according to the situation in which the technique or technology is used.

To provide an environment where a relatively small amount of knowledge of the technical aspects of database technology is required to write and maintain queries and reports and/or to provide a means to speed up the writing and maintaining of queries and reports by technical personnel

Often a data warehouse can be set up so that simpler queries and reports can be written by less technically knowledgeable personnel. Nevertheless, less technically knowledgeable personnel often "hit a complexity wall" and need IS help. IS, however, may also be able to more quickly write and maintain queries and reports written against data warehouse data. It should be noted, however, that much of the improved IS productivity probably comes from the lack of bureaucracy usually associated with establishing reports and queries in the data warehouse.

To provide a repository of "cleaned up" transaction processing systems data that can be reported against and that does not necessarily require fixing the transaction processing systems

Please read my essay on An informal taxonomy of data warehouse data errors for an explanation of the type of "errors" that need cleaning up. The data warehouse provides an opportunity to clean up the data without changing the transaction processing systems. Note, however, that some data warehousing implementations provide a means to capture corrections made to the data warehouse data and feed the corrections back into transaction processing systems. Sometimes it makes more sense to handle corrections this way than to apply changes directly to the transaction processing system.

To make it easier, on a regular basis, to query and report data from multiple transaction processing systems and/or from external data sources and/or from data that must be stored for query/report purposes only

For a long time firms that need reports with data from multiple systems have been writing data extracts and then running sort/merge logic to combine the extracted data and then running reports against the sort/merged data. In many cases this is a perfectly adequate strategy. However, if a company has large amounts of data that need to be sort/merged frequently, if data purged from transaction processing systems needs to be reported upon, and most importantly, if the data need to be "cleaned", data warehousing may be appropriate.

To provide a repository of transaction processing system data that contains data from a longer span of time than can efficiently be held in a transaction processing system and/or to be able to generate reports "as was" as of a previous point in time

Older data are often purged from transaction processing systems so the expected response time can be better controlled. For querying and reporting, this purged data and the current data may be stored in the data warehouse where there presumably is less of a need to control

http://www.dwinfocenter.org/errors.html

expected response time or the expected response time is at a much higher level. - As for "as was" reporting, some times it is difficult, if not impossible, to generate a report based on some characteristic at a previous point in time. For example, if you want a report of the salaries of employees at grade Level 3 as of the beginning of each month in 1997, you may not be able to do this because you only have a record of current employee grade level. To be able to handle this type of reporting problem, firms may implement data warehouses that handle what is called the "slowly changing dimension" issue.

To prevent persons who only need to query and report transaction processing system data from having any access whatsoever to transaction processing system databases and logic used to maintain those databases

The concern here is security. For example, data warehousing may be interesting to firms that want to allow report and querying only over the Internet. Some firms implement data warehousing for all the reasons cited. Some firm implement data warehousing for only one of the reasons cited. By the way, I am not saying that a data warehouse has no "business" objectives. (I grit my teeth when I say that because I am not one to assume that an IT objective is not a business objective. We IT people are businesspeople too.) I do believe that the achievement of a "business" objective for a data warehouse necessarily comes about because of the achievement of one or many of the above objectives. If you examine the list you may be struck that need for data warehousing is mainly caused by the limitations of transaction processing systems. These limitations of transaction processing systems are not, however, inherent. That is, the limitations will not be in every implementation of a transaction processing system. Also, the limitations of transaction processing systems will vary in how crippling they are. Finally, to repeat the point I made initially, a firm that expects to get business intelligence, better decision making, closeness to its customers, and competitive advantage simply by plopping down a data warehouse is in for a surprise. Obtaining these next order benefits requires firms to figure out, usually by trial and error, how to change business practices to best use the data warehouse and then to change their business practices. And that can be harder than implementing a data warehouse.

The Case Against Data Warehousing

The literature is full of testimonials for data warehousing. There is almost nothing about the arguments against data warehousing. In this paper I attempt to slightly fill that void by shedding light on business and cultural factors that greatly lessen the value of data warehousing for certain organizations. By the way, when I refer to data warehousing, I refer to both centralized data warehousing systems and data marts. Some of the reasons data warehousing efforts may not be appropriate for certain organizations are:

Data warehousing systems, for the most part, store historical data that have been generated in internal transaction processing systems. This is a small part of the universe of data available to manage a business. Sometimes this part has limited value.

That is, sometimes the business end user community does not have a strong interest in old transaction processing system data beyond what are available in basic reports generated in transaction processing systems. This lack of interest often stems from the fact that the markets in which a business competes are in great flux or that the internal structure of the organization is in perpetual transition. If these conditions exist, there may not be a solid historical base to compare current performance with. Also, sometimes there is a lack of

interest in looking at this data in any in-depth way because a business is so simple that a data warehouse is overkill.

Data warehousing systems can complicate business processes significantly.

Though the interest in business process reengineering seems to have waned, some of the appreciation of how complicated processes can slowly strangle a business has remained. Data warehousing, if unchecked, can foster the "institutionalization" of easily created reports whose reason for being quickly is forgotten while people still toil to process these reports. If your organization does not know how to throw out processes (pardon my calling producing, distributing, and reading a report a "process"), data warehousing can quickly add clutter to the business environment.

If most of your business needs are to report on data in one transaction processing system and/or all the historical data you need are in that system and/or the data in the system are clean and/or your hardware can support reporting against the live system data and/or the structure of the system data is relatively simple and/or your firm does not have much interest in end user ad hoc query/report tools, data warehousing may not be for your business.

Whew! You can say that again. - Anyway, you may find that as more of these conditions are met, the less value data warehousing may add to your firm. And once you get away from the big "Fortune 500, centralized IS" type shops most of the data warehousing vendors slant their marketing to, these conditions describe the reporting needs of many firms.

Data warehousing can have a learning curve that may be too long for impatient firms.

Despite the speed of the data warehousing development effort, it takes time for an organization to figure how it can change its business practices to get a substantial return on its data warehousing investment. I speculate that rigorous analysis of the return on most of the major data warehousing implementers' investments would find a much longer average payback period that you would surmise from reading the trade press.

Data warehousing can become an exercise in data for the sake of the data.

Organizations find that there are unlimited opportunities to add data to their data warehouse. Data warehouses, like most other complex systems, take a life of their own. Unfortunately, adding data without questioning the business value of the data can lessen the business value of the data warehouse and quickly increase the cost of maintaining the data warehouse.

In certain organizations ad hoc end user query/reporting tools do not "take".

This is of concern to organizations that believe they can get their return on investment by having users write many of their own queries and reports. In some firms there are profound cultural barriers in the business organization to the acceptance of a tool that allows a person to ask questions on his own. Trying to promote the use of such a tool in these organizations is setting yourself up for failure. Or, sometimes these tools do not take because a business is so complicated that only relatively simple reports with little business value can be written by end users.

Many "strategic applications" of data warehousing have a short life span and require the developers to put together a technically inelegant system quickly. Some developers are reluctant to work this way.

Again, the importance of the culture cannot be underestimated. This time, though, the issue is in the IS organization. If your sell of the data warehousing project is the ability to do this strategic work (which is probably now being done by your users with large and complex spreadsheets) as opposed to the usual development of canned and semi-canned reports and queries, ask yourself if the IS culture can accept this mode of working. For many organizations this approach to systems work is much harder to accept than most people realize.

There is a limited number of people available who have worked with the full data warehousing system project "life cycle".

I refer to availability of both employees and consultants. Systems of some depth require a considerable amount of time to develop fully. In other words, it takes a long time to gain experience with the usual problems that develop at different phases of a data warehousing effort. You should be wary of a consultant who says he has experience implementing scores of data warehouses in a couple of years. Usually this is experience will be with a well-defined part of a data warehousing project that was amenable to outsourcing or with minor projects.

Data warehousing systems can require a great deal of "maintenance" which many organizations cannot or will not support.

Despite the best efforts to architect a system so "maintenance" (in quotation marks because it seems often there is never the closure to the initial data warehousing effort that the term "maintenance" implies) demands are minimized, many systems by their very nature require a great deal of care and feeding once they are in "production". It is important to note that the more successful a warehouse is with the users, the more maintenance it may require. Organizations who cannot or will not staff to meet these maintenance demands should think twice before they jump into the data warehousing business. By the way, it's very easy for the users to quickly go sour on a system they were enthusiastic about at roll-out time if the system personnel do not support the maturing of the system.

Sometimes the cost to capture data, clean it up, and deliver it in a format and time frame that is useful for the end users is too much of a cost to bear.

The percentage of time that must be devoted to extracting, cleaning, and loading data has been well discussed in the literature. It should be pointed out that there are some potential "show-stoppers" in these efforts. Loading data from previous years can require the knowledge of transaction processing system developers who have long since moved on. Cleaning data so they are in a form that is acceptable to users from different functional areas may require arbitration skills the typical data warehousing developer may not possess. Finally, data may have to be loaded into a data warehousing system in a processing window that just isn't big enough. Sometimes compromises are acceptable get-arounds. Often, though, compromises end up substantially compromising the value of the information in the data warehouse. You may have gotten the impression from reading the trade press that data warehousing is only for large organizations because it requires huge staffs and huge budgets. Well, most of the trade press is dominated by vendors/consultants/publications trying to market to large organizations with huge staffs and huge budgets. - Though I have no way to prove this, in terms of numbers, I think most data warehousing efforts are done by small staffs with modest budgets. In fact, smaller organizations are probably much more "into" data warehousing than larger organizations. It is only recently that practical technology for huge organizations who

lust for multi-terabyte databases has become available. The technology for more modestly sized data warehouses, on the other hand, has been available for many years. Finally, you may have seen articles that state that data warehousing failure rates are between 10% and 90%. Though how these failure rates are determined is suspect, there is no denying that data warehousing is risky. Now the fact that these efforts are risky does not bolster the case against data warehousing. Data warehousing has not repealed the positive relationship between risk and expected return in capital projects. However, if your organization does not know how to manage risky projects, then data warehousing may not be for you.

Actions for Data Warehouse Success

The following are some suggestions for the warehouse builder. These are points I rarely see discussed or I do not see discussed enough in the barrage of articles about data warehousing.

From day one establish that warehousing is a joint user/builder project

Warehouse projects will fail if the builders get specs from the users, go off for 6 months, and then come back with the 'finished' project. Warehouses are iterative! (I think the word iterative means there are lots of mistakes in the projects.) Builders and users working with each other will not reduce the number of iterations, but it will reduce the size of them. By the way, see Peter Block's Flawless Consulting for a great discussion of how to bring about 'joint' projects.

Establish that maintaining data quality will be an ONGOING joint user/builder responsibility

Organizations undertaking warehousing efforts almost continually discover data problems. Best to establish right up front that this project is going to entail some additional ongoing responsibility.

Train the users one step at a time

Typically users are trained once. In several days they learn both the basics and intermediate and sometimes advanced aspects of using a tool. Slow down! Consider providing training initially in the minimum needed for the user to get something useful from the tool. Then let the user use the tool for a while (meaning several days, weeks, or months). Having basic training and some hands on experience, the user will have a much better context with which to grasp the next level. Also, once the basics and the next level are learned, keep training the users! After a year using the tool, schedule advanced training.

Train the users about the data stored in the data warehouse

Users often need more training about the stored data than about the tools used to access the data. Do not assume the data are self-explanatory or that any metadata you may provide will answer any questions. Note that users are often used to seeing data in canned reports and seeing data in its "raw" form can be confusing.

Consider doing a high level corporate data model / data warehouse architecture "exercise" in three weeks

Actually, the key point regarding time is to "time-box" the exercise into a relatively short time. After about three weeks, the marginal benefits from additional time devoted to these types of exercises rapidly decrease. - The corporate model is going to identify, at a high level, subjects and relationships and most importantly, what are the chunks of information that it makes sense to deliver in different projects. The architecture part of the exercise to determine the dimensions, definitions of derived data, attribute names, and information sources that you will

attempt to use consistently in your data warehousing efforts. The exercise also consists of coming to an agreement as to how to keep the corporate model up-to-date and how to make sure future data warehousing efforts pay attention to the architectural principles.

Implement a user accessible automated directory to information stored in the warehouse

The majority of successful warehousing efforts I have seen included providing some means for the warehouse user to locate stored information. Most of the times this involved building a separate database with directory information. And most of the time, a pretty simple database sufficed for initial use.

Once you know what raw data you want to feed into the data, request that data

If you have done some reading on data warehouse development you probably have read that figuring out the process of extracting, transforming, and loading (ETL) usually takes the majority of the time in initial data warehouse development. In project management lingo, figuring out ETL is usually on the critical path. - If you know what raw data you need, request it as soon as you know it. You are probably going to have to ask one of the programmers of the legacy feeder systems to initially get this data for you. For reasons of politics, overwork, and just plain lack of knowledge of how data are physically stored in a system, the feeder system programmer often can take a while to get you that data.

Determine a plan to test the integrity of the data in the warehouse

Do not underestimate the importance of user faith in the integrity of the warehouse data. Huge warehouse efforts quickly go sour if after system roll-out users find multiple mistakes. A good investment of time in the initial stages of a warehouse project is for the builder and user to jointly determine what checks will be made on the warehouse data during development and what checks need to be made on an ongoing basis. The checks including tying warehouse data controls back to controls in feeder systems, checking the correctness of aggregation logic, testing whether classifications codes were assigned correctly.

From the start get warehouse users in the habit of 'testing' complex queries

Many people will assume that the query result is correct. At the very least, get the user in the habit of eyeballing the query or report to check if several records that should be included are, in fact, included and that several records that should not be included are, in fact, not included.

Coordinate system roll-out with network administration personnel

Use of data warehousing systems can bring about some strange spikes in network activity. If you keep network administration people informed of the roll-out schedule, chances are they will monitor network activity for you and be ready to make adjustments to the network as necessary.

Have a good grasp of desktop databases and spreadsheets

Even if you are dealing with a 100 TB database, there are so many little tasks to be done in a data warehousing project where knowledge of these tools will be helpful. Skillful use of these tools during development can be a huge productivity enhancer.

Be prepared to support beginning users immediately and at any time

We developers often greatly underestimate users' hesitation to begin using the data warehouse. This hesitation could be because of user fear of technology or user fear that they will not get IS support. So, the first point is to be available to help when the user wants to try

to use the data warehouse the first time. Users also may want to use the data warehouse for the first time during the weekend or at 6:00 in the morning or 8:00 at night. The distractions are less at those times. If you want to make that beginning user as a committed customer of your data warehouse, you better be available to support the user when he starts out whatever the day or the hour.

Maintain the audit trail to the feeder systems

That is, make it as easy as possible to tie the data in the data warehouse to the feeder systems. Your users have to trust the numbers in the data warehouse. You owe this to the users in order to maintain their trust.

Market and sell your data warehousing systems

For the most part, use of data warehousing systems is optional. This means you have to identify the potential users of the systems, help them understand what are the benefits of the system, and then make them want to keep coming back to use the system.

Data Warehousing Gotchas

Here are some points for the warehouse builder I rarely see discussed or I do not see discussed enough in the barrage of articles about data warehousing. Forewarned is forearmed!

You are going to spend much time extracting, cleaning, and loading data

The usual figure quoted is that 80% of the time building a data warehouse will be spent on this type of work. (No one has ever explained how this percentage was obtained though.) Suffice it to say, though, the amount of time on these tasks is often grossly underestimated. Note that this point is about extracting and cleaning and loading. Though by now many people are aware the cleaning the data is complex, extracting data and loading data are equally, if not more, complex.

Despite best efforts at project management, data warehousing project scope will increase

To paraphrase data warehousing author W. H. Inmon, traditional projects start with requirements and end with data. Data warehousing projects start with data and end with requirements. Once warehouse users see what they can do with 2000's technology, they will want much more. (Which is fine!) One piece of advice for the warehouse builder is never to ask the warehouse user what information he wants. Rather, ask what information he wants next.

You are going to find problems with systems feeding the data warehouse

Problems that have gone undetected for years will pop up. You are going to have to make a decision on whether to fix the problem in what you thought was the 'read-only' data warehouse or fix the transaction processing system.

You will find the need to store data not being captured by any existing system

A very common problem is to find the need to store data that are not kept in any transaction processing system. For example, when building sales reporting data warehouses, there is often a need to include information on off-invoice adjustments not recorded in an order entry system. In this case the data warehouse developer faces the possibility of modifying the transaction processing system or building a system dedicated to capturing the missing information.

You will need to validate data not being validated by transaction processing systems

Typically once data are in warehouse many inconsistencies are found with fields containing 'descriptive' information. For example, many times no controls are put on customer names. Therefore, you could have 'DEC', 'Digital' and, 'Digital Equipment' in your database. This is going to cause problems for a warehouse user who expects to perform an ad hoc query selecting on customer name. The warehouse developer, again, may have to modify the transaction processing systems or develop (or buy) some data scrubbing technology.

Some transaction processing systems feeding the warehousing system will not contain detail

This problem is often encountered in customer or product oriented warehousing systems. Often it is found that a system which contains information that the designer would like to feed into the warehousing system does not contain information down to the product or customer level. By the way, this is what some people label a 'granularity' problem.

You will underbudget for the resources skilled in the feeder system platforms

In addition to understanding the feeder system data, you may find it advantageous to build some of the "cleaning" logic on the feeder system platform if that platform is a mainframe. Often cleaning involves a great deal of sort/merging - tasks at which mainframe utilities often excel. Also, you may find that you want to build aggregates on the mainframe because aggregation also involves substantial sorting.

Many warehouse end users will be trained and never or seldom apply their training

I once read a study that claimed that only one quarter of the people who get training in a query tool actually become heavy users of the tool.

After end users receive query and report tools, requests for IS written reports may increase

This phenomenon was seen with many of the information centers of the 1980s. It comes about because the query and report tools allow the user the users to gain a much better appreciation of what technology could do. However, for many reasons the users are unable to use the new tools themselves to realize the potential. By the way, if this happens do some honest research on why. Granted there are many reports that are so complex that IS expertise is going to be required no matter what tool the end user has. However, many times this phenomenon points to training needs.

Your warehouse users will develop conflicting business rules

Many warehouse tools allow users to perform calculations. The tools will allow users to perform the same calculation differently. For instance, suppose you are summarizing beverage sales by flavor category. Also suppose that the flavor category includes cherry and cola. If you have a cherry cola brand there is a chance that two users will classify the brand in different categories. You will find that there are means to incorporate some of the business rules in your warehouse. However, the number of possible business rules is so large that you will not be able to incorporate all rules.

Your warehouse users may not know how to use data

After many years of using whatever reports have been thrown in their faces, the users may not know what data to use their newfangled decision support tools to retrieve. To use a phrase

from pop sociology, the users have been "culturally conditioned" to use what they are given and to never ask for more.

Large scale data warehousing can become an exercise in data homogenizing

Data have quirks! Sometimes when we developers combine detailed data for different subjects, in our efforts to make everything 'fit' we can take the life out of the data. For instance, if your company sells dog food and auto tires, you want to be careful if you are building a sales data warehouse for both lines of business. You have to make a judgment call as to whether these businesses fit the same logical and/or physical model.

'Overhead' can eat up great amounts of disk space

A popular way to design a decision support relational databases is with star or snowflake schemas. Persons taking this approach usually also build aggregate fact tables. If there are many dimensions to the data, be aware that the combination of the aggregate tables and indexes to the fact tables and aggregate fact tables can eat up many times more space than the raw data. If you are using multidimensional databases, be aware that certain products pre-calculate and store summarized data. As with star/snowflake schemas, storage of this calculated data can eat up far more storage than the raw data.

The time it takes to load the warehouse will expand to the amount of the time in the available window... and then some

You'll do yourself well by understanding the different ways to approach updating the warehouse. Before you decide that you can do complete refreshes, be aware that "There's all day Sunday to load the database!" have been famous last words of more than a handful of warehouse developers.

You are going to have a tough problem with security - especially if you make your data warehouse Web-accessible

You are going to face a paradox - the more accessible you make your data warehouse (and by accessible, I don't just mean making it Web accessible - I mean architecting it in a way that people want to use it), the greater security risk you are exposing yourself too. Frankly, restricting people to "need to know" does not cut it in the organization on the 2000s. But, on the other hand, exposing information to theft from anyplace in the globe is not too great for job security either.

The data warehouse data you do not reconcile with the feeder systems will cause the problems

For certain data warehouse data you are going to think that there is no logical way that data in the feeder systems can be reconciled with what are in the warehouse. Then, when a user looks at a report and tells you "I think there is a problem", it will be with the unreconciled data. Unfortunately, you will then discover there is a way, albeit roundabout, to reconcile the data.

You are building a HIGH maintenance system

Reorganizations, product introductions, new pricing schemes, new customers, changes in production systems, etc. are going to affect the warehouse. If the warehouse is going to stay 'current' (and being current will be a big selling point of the warehouse), changes to the warehouse have to be made fast.

You will fail if you concentrate on resource optimization to the neglect of project, data, and customer management issues and an understanding of what adds value to the customer

If you provide a system that is fast and technically elegant but adds little value or has suspect data, you will probably lose your customer from day one and will have a tough time getting him back. For the most part, use of data warehousing systems is optional. The customer has to want to use the system.

Performing Data Warehouse Software Evaluations

Here are some ideas that may make the process of evaluating data warehousing software more effective. This is not a comprehensive list of tasks to follow in a technology evaluation. Rather, these are points that seem to be rarely discussed or followed in this wave of interest in data warehousing. An excellent paper to read along with this essay is Nigel Pendse's How not buy an OLAP product - which has advice that, for the most part, is applicable to buying any sort of data warehousing/decision support technology.

Do the evaluation yourself

That is, do not rely solely (or even in large part) on the ideas of someone outside your organization. There is no "metaphysically" best technology out there. All technologies have to be evaluated in the context of your organization's needs, expectations, limitations, and resources - which you know better than any outsider. Also, you can never be sure of the outsider's biases. Outsiders's main worth really comes from their knowledge of criteria you can use in the evaluation - though you have to decide the weight of each criterion.

Always first ask whether technology already in-house can do the job

Successful data warehousing/decision support systems can often be built without the specialized tools you see listed in this site. Taking on additional technology in you organization always imposes some burdens that should always be recognized before you hand over your organization's money.

Get references

Talking to reference sites is one of the most effective means of getting practical information. You would be surprised how important operational issues surface while doing evaluations. Some hints on reference gathering practices that have worked for me are: Ask the software vendor for a complete list of referenceable sites - Try to have options as to which organizations you will call.If this is a major decision for your company, call 5-6 sites - You need a minimum number of sites to help you detect patterns. Make a telephone appointment to talk with the reference - The reference will appreciate this. Plan on 20 minutes with the reference - Again the reference will appreciate this. Ask open-ended questions - You will find some interesting information with skillful questions. Send your questions to the reference in advance - Some of the references will be more comfortable if they know what you'll be asking. Send a thank you note to your references asking if it would be okay to make a quick follow-up call if necessary - This will lay the groundwork if you have to call about another issue.

If you are going to see multiple vendor demos, build a test case that each vendor will follow

http://www.olapreport.com/How_not_to_buy.htm

http://www.olapreport.com/How_not_to_buy.htm

This will allow you to compare apples to apples and peaches to peaches. Leave some open time at the end of the demo so the vendors can show features that were not covered well in the test case. One more point. Because departing from the standard vendor dog and pony show takes time on part of the vendor, many will be unwilling to do this unless you are talking about a major purchase.

Be skeptical of data warehousing pundits' endorsements or reviews of technology

Often these pundits get compensated handsomely for these objective appearing endorsements or reviews.

Read stock analyst reports on publicly held vendors and the industry outlook

Though these reports are intended mainly to get people to buy stocks, many times these reports can be an excellent source of background information on a vendor. Many libraries will have a large collection of these reports stored on CD.

Check how well the software handles maintenance

Most of the time spent with a software tool will be with maintenance. See how well the tool handles changes. For instance, most tools work with something like a data dictionary. See what are the consequences of changing the name of a field in the data dictionary. See how the dictionary helps you locate and change queries, reports, forms, macros, etc. that may be affected by the name change.

Understand the tradeoffs the software makes

Usually there is not a free lunch! Designers of tools trade off speed, capacity, computer resource consumption, ease of development, ease of use, and ease of maintenance. For example, several report and query tools can be made quite accessible to end users if you are willing to maintain extensive data dictionaries. Several OLAP tools attain quick retrieval times by requiring the storage of huge amounts of pre-calculated numbers. To prevent some nasty surprises once the tool has been purchased, make sure the persons making the buying decision understand these tradeoffs.

Go to the vendor road shows to talk with other attendees

Sometimes I think that the audience at the vendor road shows is the best source of information. If you'll make a point of talking with several other attendees, chances are you will come across a person who is in at the same stage in evaluating warehousing tools. You will find that you and that person can exchange information that is mutually beneficial.

Check the financial stability of the vendor

If you for work for an organization with an accounts receivable department, the people in that department can help you with this. A simple check could save you some major potential grief.

Have a representative team perform the evaluation

Often technology acquisitions fail or go awry because a group within an organization felt it did not get its views heard during the evaluation. One of the first steps in a technology evaluation is to identify all 'interested parties' in the acquisition. Make sure these parties are asked how they want to be represented in the evaluation. If parties that are in conflict with each other will actively participate, if you do not have the skills and/or patience to be a mediator, seek the services of an outside facilitator. Facilitation skills can be especially helpful if you have sessions dedicated to setting criteria, making your short list, and making the final decision.

If you're evaluating an end user tool, let an end user lead the evaluation effort

It seems odd but some organizations buy end user tools with little input from the end users of these tools.

An (Informal) Taxonomy of Data Warehouse Data Errors

You may have seen publications that tell you that you may have to spend the majority of your data warehouse development time building the means for both the initial and recurring extraction, transforming, and loading of data. What I have not seen, though, is much in-depth discussion of what exactly are those errors in the dirty data that you will spend your time cleaning up. Forewarned is forearmed. If you know the possibility that certain errors exist, you will be more prone to spot them and to plan your project to attack the errors in a manageable way. Perhaps the material in this paper can help you formulate a checklist of errors you will be checking for. What follows is a list of common errors. Also, if you are a relational database expert, bear with my imprecise use of some terminology. Finally, note that when I refer to a data warehouse, I refer to the database that is directly fed with data from the source systems - not the data marts (or whatever you want to call them) that are fed with cleansed data.

The categories of "errors"

I place "errors" into four categories. Quotations are around the word errors because some errors are not, in the metaphysical sense, erroneous. So, with some awkwardness, let me suggest that errors involve data that are either:

Incomplete

Incorrect

Incomprehensible

Inconsistent.

Incomplete errors

These consist of:

Missing records

This means a record that should be in a source system is not there. Usually this is caused by a programmer who diddled with a file and did not clean up completely. (I read a white paper about how users have to "fess up" about bad data. Actually, usually system personnel cause MUCH more headaches than

users.) Note you may not spot this type of error unless you have another system or old reports to tie to.

Missing fields

These are fields that should be there but are not. There is often a mistaken belief that a source system requires entry of a field.

Records or fields that, by design, are not being recorded

That is, by intelligent or careless design, data you want to store in the data warehouse are not being recorded anywhere. I further divide this situation into three categories. First, there may be dimension table attributes you will want to record but which are not in any system feeding the data warehouse. For example, the marketing user may have a personal classification scheme for products indicating the degree to which items are being promoted. Second, if you are feeding the same type of data in from multiple systems you may find that one of the source systems does not record a field your user wants to store in the data warehouse. Third, there may be "transactions" you need to store in the data warehouse that are not recorded in a explicit manner. For example, updating the source system may not necessarily cause the recording of a transaction. Or, sometimes adjustments to source system data are made downstream from the source system. Off-invoice adjustments made in general ledger systems are a big offender. In this case you may find that the grain of the information to be stored in the warehouse may be lost in the downstream system.

Incorrect errors

You can say that again! That is, the data really are incorrect.

Wrong (but sometimes right) codes

This usually occurs when an old transaction processing system is assigning a code that the transaction processing system users do not care about. Now if the code is not valid, you are going to catch it. The "gotcha" comes when the code is wrong but it is still a valid code. For example, you may have to extract data from an ancient repair parts ordering system that was programmed in 1968 to assign a product code of 100 to all transactions. Now, however, product code 100 stands for something other than repair parts.

Wrong calculations, aggregations

This situation refers to when you decide to or have to load data that have already been calculated or aggregated outside the data warehouse environment. You will have to make a judgment call on whether to check the data. You may find it necessary to bring data into the warehouse environment solely to allow you to check the calculation.

Duplicate records

There usually are two situations to be dealt with. First, there are duplicate records within one system whose data are feeding the warehouse. Second, there is information that is duplicated in multiple systems that feed in the same type of information. For example, maybe you are feeding in data from an order entry system for products and an order entry system for services.

Unbeknownst to you, your branch in West Wauwatosa is booking services in both the product and service order entry systems. (The possibility of situation like this may sound crazy until you encounter the quirks in real world systems.) In both cases, note that you may miss the duplicates if you feed already aggregated data into the warehouse.

Wrong information entered into source system

Sometimes a source system contains data that were simply incorrectly entered into the system. For instance, someone may have keypunched 6/9/96 as 9/6/96. Now the obvious action is to correct the source system. However, sometimes, for various reasons, the source system cannot be corrected. Note that if you have many errors in a source system that cannot be corrected, you have a much larger issue in that you do not really have a reliable "system of record".

Incorrect pairing of codes

This is best described by an example. Sometimes there are supposed to be rules that state that if a part number suffix is XXX, then the category code should be either A, B, or C. In more technical terms, there is a non-arithmetic relationship between attributes whose rules have been broken.

Incomprehensibility errors

These are the types of conditions that make source data difficult to read.

Multiple fields within one field

This is the situation where a source system has one field which contains information that the data warehouse will carry in multiple fields. By far the most common occurrence of this problem is when a whole name, e.g., "Joe E. Brown", is kept in one field in the source system and it is necessary to parse this into three fields in the warehouse.

Weird formatting to conserve disk space

This occurs when the programmer of the source system resorted to some out of the ordinary scheme to save disk space. In addition to singular fields being formatted strangely, the programmer may also have instituted a record layout that varies.

Unknown codes

Many times you can figure out what 99% of what codes mean. However, you usually find that there will be a handful of records with unknown codes and usually these records contain huge or minuscule dollar amounts and are several years old.

Spreadsheets and word processing files

Often in order to perform the initial load of a data warehouse it is necessary to extract critical data being held in spreadsheet files and/or "merge list" files. However, often anything goes in these files. They may contain a semblance of a structure with data that are half validated.

Many-to-many relationships and hierarchical files that allow multiple parents

Watch out for this architecture in source systems. It is easy to incorrectly transfer data organized in such manner.

Inconsistency errors

The category of inconsistency errors encompasses the widest range of problems. Obviously similar data from different systems can easily be inconsistent. However, data within one system can be inconsistent across locations, reporting units, and time.

Inconsistent use of different codes

Much of the data warehousing literature gives the example of one system that uses "M" and "F" and another system that uses "1" or "2" to distinguish gender. May I suggest that you wish that this is the toughest data cleaning problem you will face.

Inconsistent meaning of a code

This is usually an issue when the definition of an organizational entity changes over time. For example, say in 1995 you have customers A, B, C, and D. In 1996, customer A buys customer B. In 1997, customer A buys customer C. In 1998, Customer A sells of part of what was A and C to customer D. When you build your warehouse in 1999, based on the type of business analysis you perform, you may face the dilemma of how to identify the sales to customers A, B, C, and D in previous years.

Overlapping codes

This is a situation where one source system records, say, all its sales to Customer A with three customer numbers and another source system records its sales to customer A with two different customer numbers. Now, the obvious solution is to use one customer number here. The problem is that there is usually some good business reason why there are five customer numbers.

Different codes with the same meaning

For example, some records may indicate a color of violet and some may indicate a color of purple. The data warehouse users may want to see these as one color. More annoyingly, sometimes spaces and other extraneous information have been inconsistently embedded in codes.

Inconsistent names and addresses

Strictly speaking this is a case of different codes with the same meaning. My unscientific impression of this type of problem is that decent knowledge of string searching will allow you to relatively easily make name and address information 80% consistent. Going for 90% consistency requires a huge jump in the level of effort, Going for 95% consistency requires another incremental huge jump in effort. As for 100% consistency in a database of substantial size, you may want to decide if sending a person to Mars is easier.

Inconsistent business rules

This, for the most part, is a fancy way of saying that calculated numbers are calculated differently. Normally, you will probably avoid loading calculated numbers into the warehouse but there sometimes is the situation where this must be done. As noted before, you may have to feed data into the warehouse solely to check calculations. - This can also mean that a non-arithmetic relationship between two fields (e.g., if a part number suffix is XXX, then the category code should be either A, B, or C) is non consistently followed.

Inconsistent aggregating

Strictly speaking this is a case of inconsistent business rules. In a nutshell, this refers to when you need to compare multiple sets of aggregated data and the data are aggregated differently in the source systems. I believe the most common instance of this type of problem is where data are aggregated by customer.

Inconsistent grain of the most atomic information

Certain times you need to compare multiple sets of information that are not available at the same grain. For example, customer and product profitability systems compare sales and expenses by product and customer. Often sales are recorded by product and customer but expenses are recorded by account and profit center. The problem occurs when there is not necessarily a relation between the customer or product grain of the sales data and the account - profit center grain of the expense data.

Inconsistent timing

Strictly speaking this is a case of inconsistent grain of the most atomic information. This problem especially comes into play when you buy data. For example, if you work for a pickle company you might want to analyze purchased scanner data for grocery store sales of gherkins. Perhaps you purchase weekly numbers. When someone comes up with the idea to produce a monthly report that incorporates monthly expense data from internal systems, you'll find that you are, well, in a pickle.

Inconsistent use of an attribute

For example, an order entry system may have a field labeled shipping instructions. You may find that this field contains the name of the customer purchasing agent, the e-mail address of the customer, etc. A more difficult situation is when different business policies are used to populate a field. For example, perhaps you have a fact table with ledger account numbers. You may find that entity A uses account '1000' for administrative expenses while entity B uses '1500' for administrative expenses. (This problem gets more interesting if entity A uses '1500' and entity B uses '1000' for something other than administrative expenses.)

Inconsistent date cut-offs

Strictly speaking this is a case of inconsistent use of an attribute. This is when you are merging data from two systems that follow different policies as to dating transactions. As you can imagine, the issue comes up most with dating sales and sales returns.

Inconsistent use of nulls, spaces, empty values, etc.

Now this is not the hardest problem to correct in a warehouse. It is easy, though, to forget about this until it is discovered at the worst possible time.

Lack of referential integrity

It is surprising about how many source systems have been built without this basic check.

Out of synch fact data

Certain summary information may be derived independently from data in different fact tables. For example, a total sales number may be derived from adding up either transactions in a ledger debit/credit fact table or transactions in a sales invoice fact table. Obviously there may be differences because one table is updated later than another table. Often, however, the differences are symptoms of deeper problems.

Some ending thoughts

I hope this paper adds to the understanding of what takes up the majority of time in a data warehouse. Let me offer the following ending thoughts:

Be prepared for a lot of tedious work.

Probably the most important "tools" for solving these problems are a sharp eye and endurance for checking an abundance of detail information.

You may spend much more time checking for errors than cleaning up errors.

Most of these errors do not jump out at you.

The errors of inconsistency are the most difficult to handle.

At least that is my experience.

The complexity of a data warehouse increases geometrically with the number of sources of data fed into it.

Having to reconcile inconsistent systems is the reason. For example, if it takes 100 hours to reconcile data from two source systems, you can expect that it will take on the order of 400, not 200, hours to reconcile data from four source systems.

The complexity of a data warehouse increases geometrically with the span of time of data to be fed into it.

My previous comment applies. Note, however, that reconciling inconsistencies over time may be even harder because the people who know what happened in previous years may not be around to answer your questions.

You will be faced with an economic and political question as to how erroneous the data in your system will be.

Completely fixing some of these problems can be quite expensive. More vexingly, often what constitutes "correct" data is debatable. What you do, more often then not, boils down to a question of money and politics.

Data Warehousing Political Issues

This paper is a list of political issues that frequently come up in data warehousing projects. People often get blind sided by politics. My hope is that this paper might give readers some advance warning of these issues. Though what is done about these issues varies by organization, I believe the best advice to data warehouse implementers is to do your best to spot these issues early and then pick your battles wisely. I recommend that you read Marc Demarest's The Politics of Data Warehousing in conjunction with this paper. In his June 1997 paper, Marc comments on how little extended discussion of politics there is in the data warehousing literature. As of the writing of this paper, to the best of my knowledge, that situation still has not changed. This is unfortunate because ambitious data warehousing projects are rife with political issues. My working definition of a data warehousing "political issue" is a situation where the equally valid and reasonable goals and interests of two or more parties collide with each other. That is, these are situations where there is great potential for conflict. Though these issues can appear minor and even petty, they can account for a good portion of the mental wear and tear experienced by data warehouse developers. In this paper, I have classified the political issues into those that are within the IS organization (IS to IS), those that are between IS and the users (IS to Users), and those that are between users (User to User). Finally, in this paper I try to list the political issues that are peculiar to data warehousing. Data warehousing experiences all the usual political problems (i.e., resources, deadlines, etc.) that occur in complex technology projects. Just check into literature about IS project management and you will find a wealth of material on these issues.

IS to IS issues

Internecine conflicts in IS projects can be the most difficult to deal with. Data warehousing projects probably are typical in this respect.

Where does the data warehousing development group report to

The issue is whether the data warehousing development group should be a free standing development organization or whether it should be part of a group that traditionally has concentrated its efforts on transaction processing development. Often transaction processing development organizations have been driven by their work order backlogs and the need to react to whatever is the crisis on hand. Some persons believe that data warehousing, however, best flourishes when done with an entrepreneurial orientation rather than with a reactive orientation. On the other hand, many organizations quickly come to depend on data warehousing systems for day-to-day work. These data warehousing systems need to be as "industrial safe" as some of the transaction processing systems. Placing the data warehousing effort in a separate development group can lessen knowledge transfer and appreciation of how to make data warehouses industrial safe.

Who should administer the data warehousing databases - the DBA group or the data warehousing development group

The need to make data warehouse database structure changes can be relatively frequent. Proliferating data marts, uncertainty about usage patterns, and the "I'll know what I want when I see it" nature of data warehouse development can necessitate table and index changes. Data warehouse developers, concerned about losing the favor and interest of data warehouse users, want changes made quickly and get quite frustrated being put on the DBA backlog. On the other hand, DBAs often have knowledge about how to make database processing industrial

http://www.noumenal.com/marc/dwpoly.html

safe. Cutting the DBA organization out of the data warehousing support loop can deprive the data warehousing effort of some valuable wisdom.

How to gain the cooperation of feeder system developers who appear to have much more to lose than to gain in the data warehouse development effort

Data warehousing efforts often bring to light problems in feeder transaction processing systems that may have been "hidden" for years. The developers of these systems, whose knowledge is often crucial to the data warehousing effort, may be reluctant to help if they feel that the data warehousing effort is going to be audit of their work.

Should feeder system problems be corrected in the data warehouse or in the feeder system

Actually, the question often becomes whether: 1) The feeder system should be fixed or 2) The feeder system should be left alone and the data in the warehouse should be fixed or 3) Data should be fixed in the data warehouse with the fixes fed back to the feeder system. And to further complicate matters, usually there are multiple problems with different groups suggesting different combinations of actions.

Against what data should reports be written

Often an organization quickly discovers that quite a few reports can be written against data in the data warehouse or against data in the transaction processing systems. This can be quite perplexing to organizations where there is not agreement as to what the data warehouse is for.

How big is the data warehousing batch processing window

Often there is need for a time period where transaction processing systems are kept stable so changes made to the systems can be captured and fed into the data warehouse. When changes cannot be easily identified, a typical course of action is to compare a previous copy of the transaction system database with the current database. After the changes are identified, a copy of the current database is made for comparison in the next processing cycle. In some firms, the need to "freeze" transaction processing system databases can cause inconveniences to other processing. How much time should be allotted to the window in which transaction processing system databases are frozen can be a source of contention.

Who has ongoing responsibility for data quality monitoring

Data quality is not a one time concern to many firms that implement data warehouses. In a firm with complex feeder systems, it is not uncommon for previously undiscovered data quality problems occur after the big push to clean data for the initial load of the data warehouse is done. Firms find it necessary to install procedures to regularly audit data quality. And in most firms it is unclear who should have responsibility for executing these procedures.

How are requests to make feeder transaction processing system changes approved and how is knowledge about the changes communicated

Small changes in feeder transaction processing systems can have major impacts on the feed to a data warehouse. Conflicts arise when transaction processing system developers, under pressure from their users to make changes, now have to work with data warehouse developers to assess the impact on downstream systems. Even more vexing situations come when a change is made in the feeder transaction processing system and is not communicated to the data warehouse developers.

IS to User issues

User issues can be especially thorny with data warehouses because, unlike with transaction processing systems, use of data warehousing systems is often optional. Unless data warehouses are tailored to their preferences, users may quickly decide not to use the data warehouse.

Why should users give up control of user managed databases

Many user departments have, on their own, developed databases that meet some of their key reporting needs. Often these systems were built by user organizations on their own because the IS organization was unwilling or unable to help the users or the users were skeptical about the level of support they would receive if they were to work with IS. It is highly likely when a data warehouse that will subsume the functions of these user managed databases is proposed, these users may be skeptical about whether the IS organization can do as good a job supporting the user reporting needs as the users did on their own.

How to gain the cooperation of a user whose spreadsheet is being automated

Often part of the goal of a data warehouse is to automate the production of a spreadsheet or series of spreadsheets that have been manually created by a user. Sometimes the user's corporate identity is tied to the spreadsheets and he or she feels (rightfully) threatened by the prospect of automation. This user's cooperation will be needed in the data warehouse development. Though dealing with this sensitive personnel issue probably should be to be the responsibility of user management, often the IS organization has the burden of figuring out how to gain cooperation.

Should design be for the needs of the masses or for the needs of the most demanding user

In many data warehousing projects it is not uncommon for the IS organization to find one to a handful of users whose "needs" go way beyond those of most of the data warehouse users. Usually, the need is for a far greater level of detail and/or for far more history and/or for a series of reports of both a high deal of technical and business complexity. It can be quite expensive and time consuming to satisfy the needs of these far more demanding users. On the other hand, these users can have a peculiar need that is especially beneficial to the business and/or can be people whose support is vital to the success of the project.

What requirements should be frozen; When should requirements be frozen (and unfrozen)

Data warehousing development is iterative. This does not mean that requirements never get frozen. Rather, there can be many start-stop cycles in data warehousing requirements definition. Also, some requirements may be frozen while some are always loose. Managing requirements definition in a data warehouse effort can require a deft political touch.

How many data marts should there be

Users want their own data marts for a variety of reasons. Some of the reasons are: 1) The desire to put their data on different hardware platforms so their reporting needs are less impacted by other people's processing 2) The desire to modify data at their own discretion (though this may strike terror in a data warehousing purist) 3) The desire not have to work with other groups on resolving data definition issues. - Some reasons sometimes do make good business sense. Unfortunately, it can get quite expensive to support a proliferating number of data marts.

In how timely a manner are data corrected

Sometimes users are used to being able to make a correction to data and then immediately run reports against corrected data. Perhaps the users have been running reports against a transaction system database which could immediately be adjusted. Perhaps the users had their own database or spreadsheets which they could adjust at their will and then generate reports. Problems come if data warehouse developers design systems so corrections now are now incorporated into the data warehouse during a batch feed at the end of the day or at the end of the week or at the end of the month.

Who should have responsibility for maintaining data warehouse data not fed by transaction processing systems

Often as part of a data warehouse it is necessary to manually maintain dimension tables and conversion tables that contain data not in any transaction processing system. Also, sometimes budget, forecast, or quota data must be manually maintained. This maintenance can be quite involved. Determining whether users and/or IS should bear the maintenance burden can be a major issue.

Who is in charge of ongoing audit of data quality

As mentioned before, data errors pop up after the data warehouse is implemented. For example, problems occur because sometimes data is not fed from the transaction processing systems or fed multiple times. Many times it is necessary to make someone explicitly responsible for regularly auditing data. However, it often is not clear who this person should be.

How to pass responsibility for running and maintaining a report from the users to IS

Users write reports that the business comes to depend on for day-to-day functioning. Here is what often happens: 1) The reports become too technically difficult for the users to change and/or 2) The report "code" becomes lost or corrupted and/or 3) The user leaves the organization (usually without documenting the report). In these cases, IS usually gets called in. This need to obtain IS involvement can create great consternation in an IS organization who thought that building a data warehouse was going to get it out of the report writing business.

User to User issues

These are issues that involve potential conflicts among the users of a data warehouse. This does not mean that IS is not involved. Rather, IS can be right in the middle between users.

Who has access to what data

As can be imagined, one business group may not want another business group to see its data and one location may not want another location to see its data. Also common is for division personnel not to want corporate personnel to see detail division data. Perhaps more complicated to deal with are concerns of one user group that another user group may misinterpret data. Often one functional area thinks another won't understand certain data, e.g., Sales say Finance won't understand "its" numbers and Finance says Sales won't understand "its" numbers. Often people's whose formal job it is to analyze information question whether people whose formal job is not to analyze information will misinterpret data, e.g. , financial and market analysts question whether line accountants and sales people can understand certain data.

What dimensions, attributes, calculations should be defined similarly

You may have seen some data warehousing literature that talks about how the data warehouse should create a "common view" (or some similar term) of all the data. To put this is in what I believe are in more concrete terms, I believe that this is referring to making sure that

dimensions conform, that attributes are used consistently, and that calculations are always calculated the same way. Though this is a nice ideal, I believe that most firms do not have the patience to do this. Rather, through a great deal of give and take, firms implementing data warehouse decide a subset of dimensions, attributes, and calculations whose definition is worthwhile making the effort to calculate similarly.

How to define a customer; How is profitability calculated

Most firms end up wanting to determine similar definitions of customers and profitability. It is my opinion that these definition tasks probably cause more political issues than any other definition tasks . - Note that a common use of a data warehouse is to report profitability for internal purposes in a way more meaningful than profitability as calculated per generally accepted accounting principles. It is very common to want to report profitability by customer and/or by product. If so, the firm may have issues as to what a customer is. A customer may be a legal entity, it may be a location, or it may be the people performing a function for a legal entity or a location, etc. To determine profitability, it may be necessary to include expense allocations, the determination of which can be politically contentious. Finally, another common major issue regarding profitability is when a sale should be recognized.

Who has final say over the correctness of data

If multiple user organizations are going to be accessing the same data, there will be ongoing disagreements about the "correctness" of data added to the data warehouse. These debates about correctness will not be which items are in error. Rather, these will be debates regarding interpretation of data. Note that an unexpected consequence of data warehousing is that while before users might be able to reconcile their differences by making adjustments to summarized numbers, data warehousing may force them to agree on how the detail should be interpreted.

Conclusion

If you go through these issues I believe you will see three common threads regarding why data warehousing projects engender political issues: 1) Data warehousing imposes new obligations whose responsibilities are unclear 2) Data warehousing requires changes in processes that an organization is comfortable with 3) Data warehousing requires agreement on some, but not all, definitions of data.

Different Aspects of Data Warehouse Architecture

This page is a list of the different aspects of data warehouse architecture. Architecture is a pretty nebulous term. I think of architecture as a system design decision that is usually not easily changed. The decision is not easily changed because the amount of work, money, and politics involved in doing so. This a list of aspects of architecture that the data warehouse decision maker will have to deal with themselves. There are many other architecture issues that affect the data warehouse, e.g., network topology, but these have to be made with all of an organization's systems in mind (and with people other than the data warehouse team being the main decision makers.) This list will not attempt to provide detailed explanations of the different types of architecture. Rather, I am presenting this list because the data warehousing literature usually muddles the subject of architecture by lumping different types of decisions together or by forgetting certain types of decisions. Also, the literature makes these decisions seem much more black and white than they are. For example, in the area of what I call reporting and staging data store architecture, much of the literature discusses only the "enterprise" data warehouse, the dependent data mart, and the independent data mart options. In reality, there are many more variations being used that cannot easily be given a snappy label.

Data consistency architecture

Doug Hackney's excellent but confusingly titled article on what he calls incremental data mart enterprise architecture is the most succinct statement of what this means. This is the choice of what data sources, dimensions, business rules, semantics, and metrics an organization chooses to put into common usage. (Though the article does not say it explicitly, it is also the equally important choice of what data sources, dimensions, business rules, semantics, and metrics an organization chooses not to put into common usage.) This is by far the hardest aspect of architecture to implement and maintain because it involves organizational politics. However, determining this architecture has more to do with determining the place of the data warehouse in your business than any other architectural decision. In my opinion, the decisions involved in determining this architecture should drive all other architectural decisions. Unfortunately, this determination of this architecture seems to often be backed into than consciously made.

Reporting data store and staging data store architecture

The main reasons we store data in a data warehousing systems are so they can be: 1) reported against, 2) cleaned up, and (sometimes) 3) transported to another data store where they can be reported against and/or cleaned up. Determining where we hold data to report against is what I call the reporting data store architecture. All other decisions are what I call staging data store architecture. As mentioned before, there are infinite variations of this architecture. Many writings on this aspect or architecture take on a religious overtone. That its, rather than discussing what will make most sense for the organization implementing the data warehouse, the discussion is often one of architectural purity and beauty or of the writer's conception of rightness and wrongness.

Data modeling architecture

This is the choice of whether you wish to use denormalized, normalized, object-oriented, proprietary multidimensional, etc. data models. As you may guess, it makes perfect sense for an organization to use a variety of models.

Tool architecture

This is your choice of the tools you are going to use for reporting and for what I call infrastructure.

Processing tiers architecture

This is your choice of what physical platforms will do what pieces of the concurrent processing that takes place when using a data warehouse. This can range from an architecture as simple as host-based reporting to one as complicated as the diagram on page 32 of Ralph Kimball's "The Data Webhouse Toolkit".

Security architecture

If you need to restrict access down to the row or field level, you will probably have to use some other means to accomplish this other than the usual security mechanisms at your organization. Note that while security may not be technically difficult to implement, it can cause political consternation.

http://www.egltd.com/production/columns/5-97-1_enterprise_architecture.htm

http://www.egltd.com/production/columns/5-97-1_enterprise_architecture.htm

As a final comment, let me assert that in the long run, decisions on data consistency architecture will probably have much more influence on the return of investment in the data warehouse than any other architectural decisions. To get the most return from a data warehouse (or any other system), business practices have to change in conjunction with or as a result of the system implementation. Conscious determination of data consistency architecture is almost always a prerequisite to using a data warehouse to effect business practice change.

What to Learn About in Order to Speed Up Data Warehouse Querying

This paper is a laundry list of items data warehouse implementers may wish to learn more about in order to speed up their data warehouse queries or to make the data warehouse "environment" more responsive to the bulk of the data warehouse query users. This paper will not attempt to provide detailed explanations of these topics. Nor is including a topic in this list a declaration that knowledge of the topic will definitely speed up querying. Rather, data warehouse implementers may use this paper as a starting point in their search for ways to speed up queries. This list includes topics that are relevant to many of the relational database and data access tool technologies. Some topics that apply, to the best of my knowledge, to one or two vendors' technologies are not listed.

SQL SELECT statements

This is bedrock knowledge. It is quite worthwhile to get an book on SQL (there are quite a few good ones) and review (or learn) this topic. Though you may think that your query tool's SQL generation capabilities lessen the need for this knowledge, you will eventually find the SQL knowledge quite helpful.

How does your database join tables, union tables, uses indexes, choose access paths

This is some more bedrock knowledge. Unfortunately, this information may not be that accessible. If the information exists, it may be poorly written, written for an academic audience, and/or scattered among many manuals. Nevertheless, it is worth making a determined effort to understand these topics. - The vendor/consultant community would do itself well if it tried much harder to communicate this information in coherent and comprehensible terms.

What statistics your database provides on query execution

Sometimes those of us building stores of information for users to analyze forget about our own information needs. You need this information to identify which queries are especially resource consumptive. You probably will be concerned with a clump of queries that are far more consumptive than average. Sometimes the resolution of consumption issues is a simple rewrite of the query. Sometimes resolution is more technically involved and requires doing many things listed in this paper. And sometimes the solution is to do nothing - you just have to accept that your data warehouse has to support these demanding queries.

Aggregate tables

This is probably the most used method of speeding up queries. There are many discussions of this in the literature. The books "The Data Warehouse Lifecycle Toolkit", "The Data Warehouse Toolkit", and "Data Warehousing in the Real World" have especially good non-technology specific discussions of this topic.

Aggregate navigators/query redirectors

This is the technology that automatically directs a query to aggregated data if such data are available and appropriate for the query.

Partitioning

This is probably the second most common method of speeding up queries. Note that partitioning comes in many ways, shapes, and forms. At the very least, it is dividing one table into several tables usually based on the time the table data represent. Note that both tables and indexes may be partitioned.

B-tree indexing

Adding numerous indexes is another common method for speeding up queries. Note that persons with a transaction processing mindset may have a hard time accepting as much use of these indexes as is usually helpful in a data warehouse.

Dimensional modeling

With certain database technologies, this modeling can reduce the amount of sort/merging that goes on when joining tables. And, some query tools may generate more efficient SQL if data are modeled dimensionally. Also, if you use surrogate keys in conjunction with dimension modeling, joins may be more efficient.

Parallelizing query execution

Developments in database technology have made doing this much easier. Note, however, the number of users running queries and the amount of data to be returned in a query can sometimes limit this technique's effectiveness.

Archiving/purging data

Sometimes the cost of having to scan through older data exceeds the benefit of having it available in the unlikely possibility someone wants to examine it.

Reducing the width of large tables that get scanned

There are also many ways to do this. Before getting fancy with this it is worth taking the time to understand what actually takes up space in your database tables.

Completely denormalizing aggregate tables

If these tables can be heavily indexed and can be maintained by complete refreshing, the requirements of join processing can be eliminated.

Loading tables completely in memory

Presuming the memory is available to do this and you have researched other topics in this paper, this may be an interesting strategy.

Bit mapped indexing

This technique can work well when a field takes on a low number of distinct values (i. e., low cardinality) and tends to be in WHERE clauses often.

Striping files

This means spreading a file over several physical disks. Look into the topic of RAID for more details.

Locating different files used concurrently on different disks

This is basic stuff but it can be helpful.

Defragmentation of table and index files

This is more basic stuff.

Solid State Disk

Supposedly prices have come down in the last few years.

Disk controllers

Too few can be a query bottleneck.

What your query tool attempts to do via SQL and what it does internally

The book "The Data Warehouse Toolkit" has a good discussion of where query tools may fall short. The reason you need to learn about this is to prevent using the query tool where it is inefficient or to know when you might build some "get arounds".

Query scheduling capabilities

This does not necessarily speed up a given query. However, scheduling resource consumptive queries for off-hours times may free up resources for other queries during prime time.

Query queuing

As with scheduling, this does not speed a given query up. However, this facility gives you a means so priority queries (such as a query needed to gain information for the monthly close of the financial books) can execute faster.

Query accelerators

These help you generate more efficient SQL. Note that they are probably more helpful to those who report off of highly normalized databases.

Query governors

These stop queries usually after a specified number of rows have been returned and/or a specified time has elapsed.

Query nannies

This is my term for technologies that warn (scold?) the user if he submits an inefficient query. Some of these provides hints about how to make the query more efficient and some (I have heard) actually try to fix up the queries.

"Productionizing" regularly used, highly resource consumptive queries

Certain queries probably should be written by someone with a great deal of knowledge how to make queries efficient.

Storing the image of the report

If a report based on a query is used by many people and on-line retrieval of the report is needed, the image of the report may be stored. The query then need be run only once and perhaps at a less busy time. There are tools that allow intelligent retrieval of stored report data.

Query tool caching of results

Some tools store the results of some queries. If the same query is run again, the tool may check to see if the results are stored. Or, if a subset of a previously retrieved result set is desired, the tool will read the previously retrieved query result set rather than the data warehouse.

Query tool preview of a subset of records

When a query is being developed, some tools make it easy to retrieve a small subset of records that meet the query criteria. This makes it quicker to test the query and cuts down the number of potentially expensive test queries.

Making two copies of the data warehouse - one for "operational" users and one for "analytical" users

It actually is hard to draw a line between what is operational use and what is analytical use of a data warehouse. However, in a typical data warehouse most of the users (usually with more "operational" needs) are running IS written, parameterized queries. A relatively small number of users (usually with more "analytical" needs) are running potentially highly resource consumptive ad hoc queries. - Though it is not necessarily pretty, sometimes the best way to handle this mixed use of the data warehouse is to create a separate copy of the data warehouse for each user group.

Multi-tiered architectures/Application partitioning

Some query tools allow you to run different components (i.e., "tiers" or "partitions") of the tool on different hardware servers.

Network bottlenecks

Though you do not have to become an expert at network topologies, if some of your users will run queries that generate large result sets (and do not assume that only lengthy reports bring back large result sets to the query tool), it pays to trace the flow of data from the server to the user's workstation in order to see if there are any mismatched network components. For example, Fast Ethernet may be in your new facility but your user may have a 10Mbps network interface card.. Or, your user may have a card that was advertised to perform at 100Mbps which in actuality performs at 30Mbps. Also, find out how your network people load balance. They are more used to dealing with predictable transaction processing than extremely variable data warehousing demands. And if necessary, find out the costs of dropping more cable so you can put your users that run large result set producing queries on dedicated network segments. If you have invested millions in the data warehouse, the cost of an electrician and wire may be worth it.

Database technology designed specifically for data warehousing and third party indexing technology designed to speed up queries

Look at my Database page and Query and Load Accelerators page for more information.

The cost of installing more/faster CPU, memory, disk

Sometimes buying metal is (by far) the least expensive way to speed up your queries.

Some final thoughts about speeding up queries:

You best expect that many of your queries are going to run a "long" time. You will prevent some problems if you spend some time teaching your users about what, in general, will take a long time.

In line with what I just said, you can spend plenty of time tuning queries. Though many IS people like to spend their time tuning queries, this tuning time can take IS away from other data warehouse problems whose solution is more meaningful to the business.

In reality the area of speeding up queries involves plenty of guesswork, doings thing by intuition, trial and error, and making uncomfortable trade-offs.

What to Learn About in Order to Speed Up Data Warehouse Loading

This paper is another laundry list of items data warehouse implementers may wish to learn more about in order to speed up the process of extracting, transforming and loading data (henceforth simply referred to as loading) or to make these processes less prone to errors. This paper will not attempt to provide detailed explanations of these topics. Nor is including a topic in this list a declaration that knowledge of the topic will definitely speed up loading. Rather, data warehouse implementers may use this paper as a starting point in their search for ways to speed up loading. This list does not include points relevant to a specific vendor's technology. Your DBA should know some ways of speeding up the load that apply only to the technology of your DBMS vendor.

How often the users really need updated data

Oftentimes data warehouse developers unquestioningly give in to the most extreme demands for freshness of data or they automatically assume data need to be updated far more often than makes business sense. Though you read sometimes ridiculous articles in the trade press and from industry analysts (who have coined the awful term "information latency") about how the business world wants to know everything immediately, the reality is quite different. If your data warehouse is not there to support day-to-day monitoring and analysis, question why it should be updated daily. If your data warehouse is not there for week-to-week monitoring and analysis, question why it should be updated weekly. By the way, though, if you do decide to update weekly or monthly, try to design your loading process so you are not tied to loading at a specific interval. There may be certain "crunch" times when you have to load more frequently.

How to drop and re-establish indices and how to set index fill factors

If you update a large portion of the database (I've heard estimates from 10 - 25% up), you may want to learn about dropping indices before a database load and then re-establishing them after the load. If you do not drop indices, you want to make sure you set the index fill factors so your server's disk drives do not waste time looking for space in which to write index updates.

What facilities does the database have for bulk loading data and which of those facilities does it make sense to use

Many databases have ways of speeding up loading at the expense of data integrity checking. Note that certain bulk loaders do more than load - they will reformat data and sometimes aggregate data.

What input file formatting will speed up bulk loading

Oftentimes operations done on the input data on the feeder system platform (e.g., sorting, eliminating packed and signed fields) can speed up loading.

How to parallelize table load and index maintenance or re-creation

Dropping indices and bulk loading in parallel can drastically improve loading time. By the way, learn the differences between pipeline, component, and data parallelism. Given the circumstances, these different types of parallelism can have widely varying amounts of effect.

How to load databases via a stream

Certain ETL tools will allow you to extract, transform, and load in one process. That is, it is not necessary to create intermediate files. You do, though, have to be careful about data source, platform, size, scalability restrictions and limitations on how sophisticated your transformations can efficiently be.

How indices are used by your database optimizer

You need to learn this so you can figure out whether your indices are actually going to get used. In more recent versions of DBMS software, you may be able to get away with less indices than in older versions.

What integrity checks should be done in the loading process

After you perform the initial load of data warehouse tables, you may want to start a "discussion" of how all the errors you found should be trapped in the feeder systems (preferably at data entry time).

Where does it make sense to transform the data

There may be faster places to do it than in your data warehouse database system. You may want to work with flat files and a dedicated sort/merge utility either on the data warehouse platform or, if the source data are on another platform, you may want to do it on that platform. The problem with doing this on the source system platform, though, is that you then will need people skilled in that platform and you may be invading someone else's fiefdom.

Where processes can be done in memory

If you have got the available memory, learn how to use it. Sorts especially can be speeded up by doing them in memory.

What domain integrity checks should be in the data warehouse database

Depending on how you resolve the above two issues, you have to investigate the sensibility of incorporating referential integrity or any other type of domain integrity checking in your database.

Where does it make sense to aggregate the data

Sometimes if you do the aggregating outside the data warehouse database environment you can create multiple aggregate output files in one "pass" of the input data. You will probably have to learn how to use memory very carefully if you do this (and have a lot of memory on the server on which you are doing the aggregating).

What statistics are available on aggregate table usage

As you might have read ad nauseum, building a data warehouse is an iterative undertaking. You will probably create aggregates that seldom get used. You need these statistics for making the case for deleting the aggregates (though be forewarned this can get you into a quirky political aspect of data warehouse management.)

What level of data it makes sense to aggregate it and what non-additive measures are sensible to include in your aggregate tables

Say you have region, territory, customer, product, and salesperson dimensions. You may find that you get the most benefit by creating a region, territory, customer, product, and salesperson aggregate and say, that, an additional region, territory, customer, product aggregate adds little to the performance of your queries. A complicating factor, though, is use of non-additive measures in your aggregates because they will force you to re-aggregate. Suffice it to say that you should think twice before adding these measures to your aggregates.

What are non-FTP ways of transferring data

FTP-ing can be slow. There are a number of high speed transfer technologies to investigate. Also, don't forget about tape. Even if you have to send a tape overnight for early delivery, tape is sometimes the fastest way to transfer data. Also, don't forget about using compression technology in conjunction with transferring.

Whether you should incrementally update or rebuild a table

Sometimes you have the option to either incrementally update a table or rebuild a table. You may find that after a certain level of update activity it is faster to rebuild than to update. A rule of thumb sometimes stated is that if 20% of the records will be updated, it is faster to rebuild. This is a rough rule and the actual threshold will vary. Nevertheless, if you have options, it may be worth experimenting with them.

What are alternate methods for changed data capture

Presuming you must incrementally update your data warehouse database and you are not extracting from date stamped transaction records in the feeder system, you may find you have a technically daunting task in capturing changed information. Be aware that you may have options in how you do this and the options will differ in speed.

How to modify feeder systems so changes to records are written to flat files

Though this usually is not worth it, if this is done it can eliminate the time needed to go through sometimes time consuming, convoluted processing to determine what feeder system data has changed.

How to use report scraping software

If a report that has the data you need to extract is available, sometimes it make sense to put the report image in a file and use software specially designed to extract data from report image files. You do run a risk if the report format changes. But this technique often makes sense for extracting data the systems whose code hasn't been touched in the last ten years.

How to perform disk mirroring and hot backups

Disk mirroring and hot backups will not speed up loading the data warehouse database (in fact, if a disk is mirrored while being bulk loaded, loading time can greatly increase) but they can give you some greatly desired flexibility and breathing room. With mirrored disks, you can "break" the mirror, update the copy, and restore the mirror with the updated copy. This means that you can still have your data warehouse available while loading it. (Though be careful that you understand how mirroring can be handled by both hardware and software). Similarly, hot backups allow you to have your data warehouse database available when backing it up. By the way, a cycle of partial backups followed by a full backup is also worth looking into.

How to schedule loading processes

Loading a data warehouse usually requires quite a few processes. Obviously, you want to understand where there are and are not dependencies so you can "multi-task" these processes as much as possible. Where there are dependencies, you want to do risk analyses so you can find out whether it is worth the effort to build in restart capabilities in the intermediate processes. And you want to make sure you have the human and automated support for scheduling the way you want to.

How to set a restartable checkpoint

Again, checkpoints will not by themselves speed up the loading process. However, if you have a tight window for loading the data warehouse and that loading takes considerable time, availability of a checkpoint can be a lifesaver when the load crashes (which it does at the worst times).

How certain forms of RAID technology can both speed and slow loading

RAID technology can both help and harm loading speed.

Partial updating of multidimensional (MOLAP) databases

Many of these tools allow you to only recalculate some of the calculated numbers stored in the "cube". Most of these tools that have the capability will warn you that you do so at the risk of possibly getting data out of synch.

How to distribute data on multiple physical disks

If you can afford multiple disks, you may want to make sure input data, data warehouse tables, indexes, and logs (if you do not disable logging) are on different physical disks. In fact, you may want to learn about striping to spread a file over multiple disks and partitioning to divide a logical file into many physical files spread over different disks.

How to defragment table and index files

This is basic knowledge it will probably do you well to know.

How to make a copy of your transaction system database

If you really want to use your data warehouse only for production reporting, you may be better off just copying the transaction database periodically as is. Architectural purists hate this solution but sometimes it just makes sense to handle your reporting needs this way.

How to use multiple disk controllers

You will want high-speed interconnects to these controllers.

What is the cost of installing more/faster CPU, memory, disk

Sometimes buying metal is (by far) the least expensive way to speed up loading. Some final comments - In the long run long loading times usually will cause bigger problems than long query times. It is not completely uncommon that data warehouse development teams find themselves with systems they have promised to update daily but then they find the update time stretches to 12, 14, 16, and maybe even 20 hours. You can throw more and more technology at this but ultimately your best tactics are the ability to understand what really is most important to the business and good user expectation management. And, unless it is done by design, do not let your data warehouse be the main source for operational-oriented query and report functionality that, in the big picture, ought to be in the feeder transaction processing systems.

How to Save Money on Your Data Warehousing Efforts

This essay is not a list of tactics to be used in deploying the technology of your choice. Rather this is a list a pointers that may prompt a data warehouse developer to think twice before making those project management, political, and technical design decisions whose cumulative effect is to force far more resources to be committed to a data warehousing effort than what was expected. First, though, note how much more discretion there usually is in the design and implementation of data warehousing systems as opposed to transaction processing systems. In a transaction processing system, the data to be stored in the system, the users of the system, the service level provided to the users, the technology to be used, and, in many cases, the functionality of the system are usually subject to relatively little discretion. In a data warehousing effort, there is generally far greater discretion over these factors. However, for lack of time, political pressure, or unquestioning acceptance of mainstream industry thinking, data warehousing developers often fail to understand the range of choices they have. That being said, I hope these pointers will give you a little pause....

Have a reason besides expediency for building a report or query in the data warehouse as opposed to the feeder transaction processing system

You probably won't be far into your data warehousing efforts when you see a report or query that could be done in the data warehousing system or in the feeder transaction processing system. And since you're the data warehouse developer you'll probably decide that the report or query is easier to do in the data warehouse.- Welcome to the slippery slope! You're going to find more reports and queries that could go "both ways". Before you know it, you can end up with a data warehousing system that is in effect your "production" report and query generation system and which requires the same service level as the feeder transaction processing system. You may even end up doing transaction processing in your data warehousing (some data warehousing analysts politely call this "a feedback mechanism") to send corrected data back to the transaction processing system. Now, using a data warehouse for the unbundling the querying and reporting functionality from a transaction processing system may be a good investment if you do it by design. If this unbundling is done insidiously, you can quickly back yourself into supporting, at great cost, two production systems that provide duplicate functionality.

Set expectations about response time before the users use the data warehouse

These "obvious" points never get mentioned enough: 1) Data warehousing performance can fluctuate far more than transaction processing system performance (e.g., for some reason every user will want to do a five year trend analysis at the same time) 2) Not everyone starts using the data warehouse at the same rate. As more users start using the system, average performance tends to drop 3) If your data warehouse is being used for ad hoc end user work, you most likely won't be able to "tune" your data warehouse system for everything your users

are going to throw at it. - You best discuss performance issues with your users at the very start of your data warehouse investigations. Else they may expect response time to be the same as moving a cell in an Excel worksheet. If you do not discuss expected performance issues with your users, you are setting yourself up for costly (and possibly perpetual) rework of your design when the data warehouse performance does not meet the initial expectations of the users.

Do the work to determine the economics of different service levels

Get an appreciation of how much increments to the data warehouse service level cost. This type of analysis is an "art" but an art that your database/hardware vendor/consultant (with your questioning every assumption they make) should be able to help you with. By the way, the important knowledge is how making adjustments with a given set of technologies will change cost and expected performance. Be skeptical about comparing this type of analysis between different sets of technologies.

Do the analysis of whether platforms your organization has been using for a long time are appropriate for your data warehousing efforts

Mainframe, proprietary midrange, and file server network operating systems are legitimate platforms for data warehousing. Before data warehousing was called data warehousing, these platforms were being used quite successfully for data warehousing systems. In fact, though you will not read about it in the trade media, these platforms still are being used successfully for data warehousing. The platforms are not always appropriate but if you have a substantial investment in these platforms and the "keepers" of those platforms are not overly resistant, it is worthwhile to do the analysis.

Do the analysis of whether your users should directly report/query against data stored in the transaction processing systems

In the 1970s, the mainstream industry wisdom was that data should be extracted and reported against. In the 1980s the mainstream wisdom did a "180" and said that "data shall not be duplicated" and that you should go against the real stuff. In the 1990s, the mainstream wisdom did done another "180". - Reporting against transaction processing system data is not always appropriate, but unless you automatically want to accept mainstream wisdom which never seems to consider the varieties of situations people face, you may find doing the analysis worthwhile. (And then in the 2000s you will be considered in the avant garde and you will be a source for mainstream wisdom.)

Bargain with the database and hardware vendors

Chances are you are going to buy your database and your hardware from some well known, historically profitable vendors. If you do your homework, you will find written material (not specifically about data warehousing though) and consultants available to advise you how to deal with specific vendors.

If you will have large numbers of users who only run canned reports, consider the alternatives to providing these users with "full blown" client based report and query, OLAP tools

In the typical data warehouse, the majority of users will strictly be running canned reports. (Estimates that 75% - 98% of data warehouse users are strictly report users have appeared in the trade press.) A great deal of money can be spent licensing and supporting functionality that the users will rarely use. Alternatives to providing canned report users with full blown tools vary based on the technology you are using and the politics of the situation. But the alternatives are usually there if you look.

Implement query efficiency enhancing design techniques that do not require special hardware or software

Specifically learn about using aggregate tables and partitioning. These techniques can be used with any type of database or file access methods. Though these techniques can be overused, they generally are the simplest, most effective, and least expensive ways to speed up retrieval of information.

Itemize possible data cleaning tasks and, with the data warehouse users, examine if each of the majors tasks is worth the effort

You will probably come up with a long list of data problems many of which are not worth the effort to clean up. Note that "worth" is a judgment that the data warehouse developers and the users have to agree upon.

Think twice before building the means to perform complex calculations that few business users understand

It is not that uncommon for one business user to decide that he or she needs the data warehouse to store or report a set of numbers that are extremely difficult to determine and more importantly, that most business users have a hard time understanding. In this case, the data warehouse developer has to diplomatically discuss whether it is worth calculating a set of numbers that perhaps only business user will understand. Sometimes it is, most times it is not.

If the main reason you are considering a data warehousing is to get around the difficulties caused by a dysfunctional transaction processing system, do the work of costing how much it will fix the transaction processing system before you make the data warehouse decision

It may not be surprising that the primary motivation for the construction of many data warehouses is to get around the difficulties caused by a problematic transaction processing system. Immediately deciding upon a data warehouse as a "fix" can be an expensive mistake. If you don't do the work of costing how much it will cost to fix the transaction processing systems, you may never understand what is really causing the problems. And then you're setting yourself up for a situation where the same problems recur in the data warehouse and you end up supporting both a dysfunctional transaction processing system and a dysfunctional data warehouse.

If most of your business needs are to report on data in one transaction processing system and/or all the historical data you need are in that system and/or the data in the system are clean and/or your hardware can support reporting against the live system data and/or the structure of the system data is relatively simple and/or your firm does not have much interest in end user ad hoc query/report tools, you may not NEED a data warehouse

Sometimes a good report generator will do just fine.

Question whether you really will benefit from certain categories of tools

For some data warehouse implementations, certain types of tools just do not make good business sense. For example, if you have no need for the slice-and-dice or modeling capabilities of OLAP tools, a report and query tool may meet your reporting needs more than adequately. If you have to perform fairly complex data transformations and/or you have relatively few data sources and targets, you may be better off coding by hand than using a so called "data mart" tool. The database you use for transaction processing may do just fine based on the number of users, amount of data, and time you have to load the database. Before buying data mining tools do your best to assess whether they will yield "actionable" insights worth the effort in making the data mining tool work.

Accept that data warehousing is going to be technically messy

If someone were ever to write "The Zen Of Data Warehousing" (perish the thought - please), one of the concepts would probably be that at some point, the more technically elegant you try to make these systems, the messier (and more costly and less beneficial) they end up being. There are no rules for determining where this point is. Use your judgment and intuition to make the determination.

Using Data Warehousing in Strategic Decision Making

Though you can read many definitions of data warehouses that say that these systems are designed for "strategic decision makers" (or some other similar term) there is little written about actually using data warehouses in strategic decision making processes. In this essay, I would like offer some insight into using data warehouses in such decision making exercises. First, let me define strategic decision making. There probably are thousands of published definitions. For working purposes let me say that a strategic decision is one that involves spending a lot of money and/or firing/re-assigning/hiring a lot of people and/or that is going to cause a lot of pain/joy until the next strategic decision is made. (Of course "a lot of" is a relative term.) I assert that most of the uses of data warehouses are not for strategic decision making. Probably the most important reason for this is that strategic decision making usually is not done that often. Rather, I believe that most data warehouses are used primarily for post decision monitoring of the effects of decisions. Nevertheless, some data warehouse do get used in strategic decision making and are used very profitably. What follows are some personal observations on how you may actually use a data warehouse in a strategic decision making exercise.

Creating "special" databases, modeling (not in the IS sense of the word), and formal reporting are the most time consuming tasks when using data warehouses in strategic decision making.

Later I will go into more detail regarding these topics.

Systems for strategic decision making tend to be relatively short-lived.

The amount of time spent using these systems sometimes can be measured in days counted on one hand. Those couple of days using the system, though, can bring more payoff than some canned reporting system used for years.

Usually the work must be done quickly and is requested with little advanced notice.

This work usually has to be done in anything from a long afternoon to several weeks. This is "figure it out as you go along work" where IS often must take the part of the business analyst. There is usually no time for formal interviewing and extended data modeling exercises. The "requirements" are usually gleaned from "business" meetings which IS may have a little struggle to get into or are related secondhand from attendees of these meetings. These requirements are usually ambiguous. IS usually has to put on its business hat and figure out what is really needed by the business.

You will probably have to aggregate data differently, use different calculations for derived numbers, and combine data that never have before been combined.

The work you are doing allows the business to see a point of view that is not the common view of the business. (In other words, a part of many effective strategic decision making exercises is to see the business in a different perspective.) You are doing this work because when you built the data warehouse, you built it according to what then was the common view of the business.

You may need to create special databases.

Often you need to run repeated queries against a subset of the data warehouse. The subset may be one created by an extract query with quite complex constraints. Or, as I just mentioned, you may need to repeatedly access new aggregates and calculations or you may have to repeatedly concurrently access data that are not in the production data warehouse or that are in the production database but are not easily combined. For the sake of simplicity and efficiency, your best course is to create a special database. You may be thinking you created a data warehouse so you would not have to build special "extracts" but, perhaps to no surprise, often there just is no way of avoiding these extracts. (For more on somewhat similar ideas about these special databases, see Thomas Davenport's description of a "data deli" and Ralph Kimball's discussion of "behavioral studies".)

You may have to "feed" data into user maintained spreadsheet models.

Much of the use of data warehousing for strategic decision making ultimately involves "feeding" user maintained spreadsheets. These "feeds" are either links to data stored in a data warehouse or the actual loading of data into spreadsheets. The spreadsheets are used because the user needs to change complex calculations - maybe as part of a scenario analysis but usually because there is continual doubt about how certain calculations should be made - and the user is most knowledgeable about doing these changes in the spreadsheet environment. (To put this in a little more technical terms, many of these calculations are inter-record, cross dimensional calculations). Many OLAP tools allow a great deal of flexibility in making calculations but these capabilities tend to be too difficult for the user who is in a hurry in the strategic decision making exercise. Note also that oftentimes it is necessary to, in turn, feed spreadsheet data into the special databases you have created.

Sometimes data cleanliness is much less of a concern in strategic decision making.

Sometimes the analysis being done with highly summarized data and/or the need for speed lessens the need for extremely clean data. I do suggest, however, that whatever the data expectations are, you keep an audit trail that lets you trace how data were derived from feeder systems.

You may have to create some highly formatted reports.

The information from the data warehouse has to be communicated to people who do not have and/or want direct access to the data warehouse. In a strategic decision making exercise, despite the rush, your users may want to communicate the information in printed reports that look just "so". These reports are usually being created to persuade someone. Many of your users will want a polished look to the reports in order to convey credibility. Also, graphs are usually created for these exercises. By the way, there is usually some give and take as to whether these reports and graphs should be created manually (i.e., with a word processor, presentation tool, spreadsheet) or generated directly from the database. Now some advice:

Probably the most important determinant of the benefit you will get from technology is your ability to figure out the most insightful questions that the technology enables you to ask.

Do not assume that your users have full appreciation of the power of the technology. Unless you have some users with good gut instincts about technology, IS has to take the part of the business analyst to spur the imagination of the users.

Try to get in "the loop" early.

Users will tend to either grossly underestimate or overestimate the power of the data warehouses in these strategic decision making exercises. This means that either IS can miss

http://www.dbmsmag.com/9702d05.html

http://www.cio.com/archive/020198_think.html

an opportunity or be faced with an impossible task that must be done quickly. Note that there are usually politics in getting in the loop early. However, having previously built up a relationship of trust with a "decision maker" helps greatly.

When you are initially designing the warehouse, do not try to design for every contingency that could occur in a strategic decision making exercise.

You are not going to be able to foresee everything that will be needed in these exercises. Do not put everything you can possibly think of in the data warehouse. Do, though, try to keep atomic data in some electronically retrievable format. Do your best to conform the main dimensions of data used in your business. (That means customer, product, financial account, and internal "entity", i.e., people and department, identification.) Do address the slowly changing dimension issue. And do not make yourself completely dependent on outside resources whose availability you cannot control. These exercises come up unexpectedly.

Do not let the knowledge of the systems stay in the minds of the outside technical consultants

This trite and obvious piece of advice needs to be repeated. The technical consultants are gone and not available when these opportunities come up. If the key knowledge of your systems are in the heads of consultants, you may be up the creek when these exercises come up.

Learn spreadsheets and how your data warehouse can interact with them.

We in the data warehouse world often forget that the spreadsheet is by far the most used decision support tool. Persons supporting data warehouses that really will be used for decision support should be encouraged to learn the scripting language of the spreadsheet (which for most people is Visual Basic for Applications) so they have the flexibility in coming up with solutions in these strategic decision making exercises.

Don't "production-ize" your work.

The technical work done in these exercises is usually not "industrial strength" and it is probably not worth the effort to make it so. You may learn, though, that you need to modify your production data warehouse database. Also, do keep your work around so you can cannibalize code for the next strategic decision making exercise.

Do not claim that data warehousing alone will necessarily improve strategic decision making

It needs to be oft-repeated that if a person is a mediocre decision maker, technology alone will not make that person a better decision maker - especially in the realm of strategic decision making where, despite our 100 TB databases, much more remains unknown than known.

Don't miss these opportunities.

It is hard to calculate the expected ROI of a data warehouse project. Most businesses have to go on faith that the effort somehow will be worth it. Well, success (or, sometimes, just participation) in a strategic decision making exercise, despite the messiness of the work, can strongly bolster the belief that the data warehouse was worth the effort. If you do not justify a data warehouse before building it, it is smart, perhaps imperative, to justify the data warehouse after the fact. And the best way you are going to do this is "anecdotally" with successful war stories like a strategic decision making exercise.

Maintenance Issues for Data Warehousing Systems

Another important aspect of data warehousing and decision support systems (hereafter referred to as DW/DSS systems and I know that is redundant) where I see little public discussion is maintenance of these systems. Here I present some of the issues that you may face when your systems are "in production", as if these systems ever achieve the stability implied by that term. How you will deal with the issues will depend on your environment. This list is presented because, just as mentioned in my gotchas page, forewarned is forearmed!

You will be challenged to learn about business and feeder system changes that will affect the DW/DSS systems

You as the system developer would like to know of developments that will affect the DW/DSS systems in time to allow adequate time to assess what is impacted, make changes, test changes, etc. Of course this is no new concern to anyone doing systems maintenance. If you are responsible for a system being fed from, say, 10 sources, you may have much more exposure than you have with the typical transaction processing system. And though intelligent use of the data extraction, cleaning, and loading tools and the information catalogs can greatly ease the burden here, many changes will require a fair amount of effort. By the way, keeping informed and assessing the impact of technically driven changes to the feeder systems may be more difficult than keeping track of the business driven changes. If your IS organization has change control meetings, it is a major mistake for a DW/DSS developer not to attend those meetings regularly.

You will have to figure out if, when, and how to purge data

There comes a point when it does not make business sense to hold certain data in the warehousing system. This usually comes sooner than you expect. Either you are at some type of capacity limit or more likely, you are restructuring data and it is not worth the effort to restructure certain data. When you are at this point you may realize that the DW/DSS system has becoming a breeding ground for corporate information pack rats ("Why just last week ______ asked for an analysis going back to 1956!"). Before you get into a discussion about purging data, one piece of advice is to learn about less expensive, alternative means of storage.

You will have to determine which queries and reports should be IS written and which should be user written

Probably when you got started into this area you had an idea about who would be doing what. And if you are like most DW/DSS developers, after you have been in production a while you have seen how reality has differed from your expectations. A very common IS expectation is that the end users will take over the overwhelming majority of query and report writing duties. And an all too common reality is that IS ends up taking over almost all the query and report writing or IS writes some semi-canned queries and the potential of the system for answering ad hoc questions never gets fully realized. - You may have a challenge on two fronts. You may have to push the end users into "deep water". You may also have to convince your IS staff that the report and query building tools are not "toys".

You will be motivated to store data in the data warehouse "for data's sake"

You and/or the users of the system will see "holes" in the data you store in the data warehouse. Mainly for the sake of completeness, you will be tempted to add this data. Unfortunately, when you have yielded to this temptation several times, you will find you have exploded the size and complexity of your data warehouse without proper consideration of whether the incremental size and complexity had business worth.

You will find endless opportunities to tune DW/DSS system databases

I once saw a quote from the director of IS of a well-known retailing business who said that the biggest data warehousing lesson he learned is "there aren't many data warehousing experts out there". If you are allowing a fair degree of end user developed access to systems and your systems are large and complex, you will discover that there are myriad ways to drag the systems down to a crawl. It is unlikely than an "expert" can foresee all the problems. And many of the problems are so crazy that they only way you are going to solve them is on a trial-and-error basis. By the way, you may have sold the DW concept as a way that "killer queries" will not drag down your "production" systems. Now that you've put in a data warehousing systems, you will find out that the users are just as dependent on the data warehousing systems for recurring needs as they are on the so-called production systems and killer queries hurt wherever they occur.

You will have to balance the need for building aggregate structures for processing efficiency with the desire not to build a maintenance nightmare

Many DW/DSS systems involve building structures to contain aggregated information. These "structures" can be many things - separate tables in relational systems, dimensions in the OLAP world, etc. Anyway, after a while you will see countless ways to add or refine these aggregate structures usually in the name of reducing end user retrieval time. The issue you face is balancing your desire to speed things up with the need to be careful with how much a maintenance burden you want to take on. There two aspects of this burden. First, you have to consider developer time. Secondly, you have to consider the amount of time it takes to update your systems on a recurring basis.

You will be uncertain whether to create certain reports/queries in the data warehousing system or in the "feeder" transaction processing system

You are best advised to have some guidelines as to what goes where. If not, you may eventually find that you have almost a clone of your transaction processing system in your data warehousing system.

You will be pressured to implement a means to interactively correct data in the data warehouse (and perhaps send back corrections to the transaction processing system)

And you though your data warehouse was read-only! I am not saying this is necessarily bad. Though, as in the last point, you have to be careful you are not setting yourself up to building a clone of a dysfunctional transaction processing system.

You will be uncertain which tools are most appropriate for a certain task

DW/DSS systems present IS with yet another set of tools with overlapping uses. You will find that it is not clear what is the best tool for many applications. For instance, if you have invested in relational and multidimensional database technology, you will find that for many applications, at a technical level, it is a toss-up as to which database technology will do the job better. Many organizations also have a heavy duty tool and a more lightweight tool that have similar ends. You will come across many situations where it is not clear whether to go heavy duty or lightweight.

You will have to figure out how to test the effect of structure changes on end user written queries and reports

After a while you are going to make some database structure changes that may affect the reports and queries that your end users have written. In order that the need to re-test their work does not come as too bad a surprise to your end users, may I suggest that you get them into good housekeeping habits early on. This means, for example, not keeping their work in 10 different directories and storing descriptions of their work.

You will have to determine how problems with feeder system update processing affect DW/DSS system update processing

Again, if you have 10 systems feeding your data warehouse, you are going to have to develop an appreciation of what to do when there is a processing problem with one or several of those feeder systems. At the simplest level, this means determining if and when you will process updates to the data warehousing system. At a more difficult level, this means determining if and how to process partial updates to the warehousing system. The dependencies in DW/DSS update processing can get quite complex. Do take the time to understand these dependencies especially if you do not have the most well-behaved feeder systems.

You will find that maintaining a data warehouse architecture may be much harder than establishing the architecture

By architecture, I refer to consistent use of dimensions, definitions of derived data, attribute names, and data sources for specific information. Unless there is someone with responsibility to keep his eye on subsequent data warehouse development, it is easy to quickly lose the benefits of the hard work it usually takes to establish the architecture. By the way, the person keeping his eye on this development must: 1) Have some judgment - your expectations of what should remain consistent will change over time 2) Be able to work in a persuasive, not coercive manner - data warehouse developers especially resent "architecture police".

You will find that the business changes the meanings of attributes over time and that these changes can be overlooked

For example, say that you work for a fruit distribution company. Perhaps it has a policy of using category code "100" for sales of apples and oranges. If the company suddenly starts using code "150" for oranges, though your dimension table change capture mechanism may handle the change (I hope you know about slowly changing dimensions), there now is a question of how, well, apples to apples and oranges to oranges comparison should be made for historical purposes. Often there is no "right" way to handle these issues that come up in comparing historical. You do, though, have to do your best so you know there is an issue.

You will have to rework how you have implemented security

Most firms, if their data warehousing systems are used for ad hoc reporting, will find their security schemes are either too loose or too tight. You will find that assigning security is a balancing act. You want to minimize security breaches but on the other hand you do not want to minimize the chance of a user discovering some useful business insight as a result of his examining something that someone else might have thought was beyond the scope of his everyday concerns.

You will have to keep reconciling feeder systems with the DW/DSS systems

After things are going smoothly for a while, some times there is a tendency to be slack in whatever process you have implemented to reconcile systems. Also, if you have end users reconcile information, you may find that it is an ongoing discussion as to how to handle responsibility for regular reconciliation.

You will have to perform euthanasia on some DW/DSS systems

DW/DSS systems tend to be changed frequently. They experience entropy much more quickly than, say, general ledger systems. If your firm is used to keeping and patching a system for as long as you keep a refrigerator (and these days there are firms like that dipping their feet in DW/DSS for the first time), you may be in for a surprise.

You will find it is far more expensive (and complex) to maintain a data warehouse than to build one

Hope you got that point by now!

What Decision Support Tools are Used For

In the section on the "dirty little secrets of data warehousing" in her fascinating book "e-Data", Jill Dyché notes many IT departments don't really know how the business is using its data warehouse. It is not necessarily bad, though, if IT does not know all the specific uses. Sometimes the sign of a great warehouse is that the users "run with it" on their own. Nevertheless, it is possible to get a general idea just what the decision support (a.k.a., business intelligence) tools used to access a data warehouse are being used for. In this essay, I will attempt to make a general statement about use of these tools. Perhaps data warehouse support people can do a better job if they have a better feel for what the tools are really being used for. The main uses of decision support tools are:

To check that "everything" is okay

Surprise! Nothing will be done with many, perhaps most, of the queries and reports created with decision support tools. They are run to confirm a person's usually not crisply defined notion but intuitively felt notion of "okayness". If I were able to write the essay on "The Zen of Data Warehousing" (which I will not), I would say a primary function of decision support tools is to support non-action.

To confirm the "obvious"

Most end users the reports and queries are ultimately being produced for have a pretty good gut feel for what is going on in their area of concern. Decision support tools do not tell these people anything amazing that the people don't already suspect. But the information produced with the tools gives them confidence their gut feel is okay.

To figure out how something "works"

Most people are not looking for some grand Unified Theory of how firm XYZ works. Rather, they want to understand some small aspect of an operation like Customer A always pays on time, Customer B usually pays late and still takes the early payment discount, etc.

To convey information in a more digestible manner

These tools are often used to convey what a person or persons already know. These knowing people use the tools simply to present information to other people in a way that it is more easily read.

To compare information about customers, products, cost/profit centers, financial accounts

Sometimes this is side by side comparisons of a series of measures. Sometimes this is identification of the most, the least, the earliest, the latest, etc.

To compare the same type of information in different time periods

This is simply the usual daily, weekly, monthly, quarterly, yearly comparisons.

To check performance versus formal and informal goals or constraints

That is, measures of what actually occurred are compared with budgets, forecasts, quotas, or some other types of goals.

To identify the out of the ordinary

Usually the ultimate consumer of the tool's output has somewhat vague criteria of what is out of the ordinary. The decision support tools kind of do double duty in that they help refine the criteria of what is out of the ordinary and identify what fit the refined criteria of out of ordinariness.

To grab a little piece of information out of a large volume of information

These tools make picking that virtual needle out of that virtual haystack a lot simpler.

To get around an Information Technology department that does not have the time or the resources to write reports

Often end users use these tools out of impatience with the IT department. Or, the IT department gives the user these tools to relieve the pressure off of itself. The end users in these cases often write reports that could hardly be called analyses.

To provide a report "of record"

For all kinds of reasons it is often necessary for people to agree that "these are the numbers". Note they do not have to agree on all the data - just some data whose credibility must be accepted for actions to be taken. Decision support tools often are used to produce this "official" information.

To confirm and sometimes to discover trends and relationships

With all respect to the people working hard on data mining, I think that most good businesspeople have an intuitive feeling of the most important trends and relationships between factors that are affecting their business. The decision support tools perform the function of confirming their intuition. Yes, the tools also can help discover trends and relationships but it is difficult (though potentially profitable) to sift out the meaningless and spurious trends.

To help advocate a position

These tools are not just for "objective" presentation of the facts. Often they are cleverly used to help bolster the case for doing (or not doing) something.

To provide data for a what if analysis or a forecast

That is, the tools are used to feed data into a spreadsheet where the actual what-if analysis or forecast will be done. The tools can do some of the what-if-ing and forecasting themselves but most business users are more comfortable doing this work in spreadsheets. To repeat points I have made in other essays, despite their name most of these tools are not used as the sole input into making a non-trivial decision. Nor do they directly supply what I would consider to be business intelligence. Decisions are made and business intelligence is garnered only with the combination of the output of the decision support tools, human judgment and intuition, and the ability to put the information spit out by tools into a context of information that is much wider than any data warehouse, transaction processing system, knowledge repository can handle.

Is Web Data Analysis (i.e., Web Mining) Different?

The topic of analyzing web data (also referred to as clickstream data ) is one of the more discussed topics in the niche of data warehousing/decision support. Though there has been some intelligent writing on the topic, most of what is written seems to be the same unquestioning praise of supposedly revolutionary changes that analyzing this data is going to bring about.This essay is not meant to be a how-to primer but rather to raise some questions in the mind of the reader. In this essay I would like to challenge some of the usual industry hyperbole.

Web data are the record of what actions a user takes with his mouse and keyboard while visiting a site

That is all it is. It is not that mysterious. In fact, if data could be characterized as mundane, web data would have to rank among the most mundane.

Web data are just another source of data - with its own quirks and with limitations that come with all other sources of data

If you have worked with a variety of other data sources, you probably know much of what you need to know about working with web data. Yes, web data have quirks but what data (especially data as detailed as raw web data) do not have quirks.

The primary beneficiaries of web data analysis are web designers

Not many bet-your-company (and bet-your-career) decisions are going to be made with the results of web data analysis. Mostly it will be used for making many little decisions about how to modify the design of a web site . On the other hand, if your company is betting its continuance on smart use of its web site (and, except for the dot-coms, not many companies fall into that category), the cumulative effect of these little decisions may be company and career endangering.

The businesspeople will want and benefit most from highly aggregated web data that are usually combined with non-web data

Most web data has far more detail than the usual marketing or financial person wants to see. And these people think in terms of relative performance of "channels", most of which, for non dot-com companies, are not web based.

The person who is going to get the most insight from web data is the person who understands designing web sites so they are used profitably and who understands the power of data analysis

These people are hard to find! Sorry about the stereotypes but, at least in my limited exposure to good web designers and people who may not be hands-on designers but do have a good feel for the power of a web site, they are very different people from the financial and marketing analysts that data warehousing/decision support developers are used to working with. Most students of effective good web design do not strike me as people who want to sit down with a query/report tool or OLAP tool and refine some analysis for three hours.

Often web data analysis yields conclusions that would be immediately obvious to a good web designer

Web data analysis can serve as a very expensive substitute for a good web designer. On the other hand, though, sometimes web data analysis can be an inexpensive substitute for a very expensive web designer.

The value of detailed web data declines pretty fast over time

Though many data warehousing implementers won't admit it, most data loses value over time. (If you want to be a little more academic, the expected value of the data declines over time.) Because web sites change so much, the value of the web data declines quickly. Imagine doing a traditional cost center spending analysis. Now imagine what would happen if the cost centers and their reporting hierarchy would change everyday. This is kind of what it is like to analyze some web data.

In the same vein, the value of old detailed web data is dubious

I have read the publications predicting petabyte sized warehouses of months and even years of web data. What I have not read, though, is what people will do with older web data. Probably any web site that generates that much detailed data changes so often that, except at a very aggregated level, it is hard and perhaps meaningless to compare older data with newer data.

You can deliver "real-time" access to web data but your users will not be able to analyze it in real time

I read the pundits who say now you have got to go out and build usually expensive means to let users analyze web data generated up to the last millisecond. - I don't know who the pundits work with but most people I have encountered who analyze data are not polymaths who can, on an recurring hourly basis, disgorge meaningful analyses.

Web data is far "dirtier" than the usual data warehouse data

Web data often present problems with identifying web site users, identifying what was viewed,

identifying the sequence of user activity on a web site, and identifying when the user started and stopped looking at a web site. Data may have gaps or data may be suspect. Many of these problems are not solvable given the design goals of a web site.

Web data relies on some pretty fuzzy categorization

All you may know about the web site user is (what you think are) the sequence of his clicks. To make this data sensible, you may have to categorize users by their clicking sequences. Also, you may have to categorize the pages on the web site. These categorizations can get pretty fuzzy. By that, I mean there may be many, many ways to categorize with no compelling reason to use one categorization method over another. Also, though it is not exactly categorization, you also have to define a "session" - when a user started and stopped accessing a web site. The definition of a session can be arbitrary.

If session data are culled from multiple servers, you probably have a unique problem

If the servers' clocks are not exactly (!!) in sync, you are going to have a hard time tracing user activity

If your site generates pages dynamically, you may have to write your own system to track the dynamic content

This information also has to be correlated with the log file analysis. If a page consists of multiple dynamically generated areas, then you have a more complicated problem.

Web data issues make it harder to do the manual judgment tasks needed to use data mining tools to separate useful information from gibberish

By now there is awareness that a great deal of judgment that can only be provided by a human being is needed to for most data mining work. As you can imagine, all the problems with web data make it harder to do these judgment tasks that no software can do.

Often cursory analysis of web data produces most of the value that can be gained from analyzing the data

Or, in more academic terms, the marginal value of additional analysis may drop pretty rapidly. The data may be so dirty and so fuzzy that analyzing it further may not be worth it.

Web data by itself do not give you much information about the web site user

Unless the web site user has bought something from the site, you know very little about the site user. (I read that most registration information, if given, is false.) And even if a site user has bought something, you need to combine the web data with data from internal and external (like and Equifax, etc) non-Web data to learn something about the web site user.

Web data do not give you that much information about why a person does not become a customer

When you read that web data is supposed to help you find why a person did not customer, you

find you do this by analyzing the clicks of a customer who left the site without buying. Also, the last page a person clicked on is supposed to be important to analyze. - In actuality, you get a little information that is usually not great. Remember, usually the only thing you know about the non-customer is his clicking pattern. Analysis of clicking patterns, as mentioned before, can be quite moot.

Some marketing writers have questioned the effectiveness of the extremely targeted marketing some firms attempt via web data analysis

Though I make no claim to be a marketing expert, some of the supposed experts whose publications I have read have question the effectiveness of finely segmenting markets (which at its most extreme is segmenting markets to one person). They say that at some point in segmenting a market it is actually possible to get negative marginal returns. I interpret their writings to mean that marketers have to be humble about their understanding of consumer behavior. Though it seems counterintuitive, much more can be effectively acted upon by observation of group behavior rather than by observation of individual behavior. This essay is not meant to dissuade anyone from analyzing web data. Web data analysis can be extremely profitable. But like all other applications of data warehousing/decision support, web data analysis has to be done intelligently. That is, we have to know who are our real users, honestly acknowledge the data problems we cannot solve or can partially solve, and make our decisions on how much we want to analyze with an eye to expected marginal benefits versus marginal expected costs.

Dataware Housing

Documents

definition of data

data warehousing systems

data warehouses

data warehouse systems

data warehousing actions

data warehousing efforts

oracle8i data warehousing

data warehouse toolkit