Making Sense of Making Sense of BUSINESS BUSINESS INTELLIGENCE INTELLIGENCE A Look at People, Processes, and Paradigms A Look at People, Processes, and Paradigms Ralph L. Martino Ralph L. Martino
1
In
Making Sense of Making Sense of
BUSINESS BUSINESS
INTELLIGENCE INTELLIGENCE
A Look at People, Processes, and Paradigms A Look at People, Processes, and Paradigms
Ralph L. Martino Ralph L. Martino
2
4th Edition
Ralph L. Martino
Table of Contents
Business Intelligence works because of technology, but is a
success because of people!
3
Introduction 4
Background 6
Modeling the Enterprise 9
Operational Information Process Model 11
Proactive Event-Response Process Model 15
Analytical Information Process Model 22
Understanding your Information 29
Understanding your User Community 32
Mapping and Assessing Analytical Information Processes 34
Information Value-Chain 40
A look at Information Strategy 42
A look at Architectural Issues and Components 48
Information Manufacturing and Metadata 59
“Real Time” or Active Data Warehouse ETL 71
Data Quality Concepts and Processes 74
Information Planning and Project Portfolio Management 87
Organizational Culture and Change 90
Tactical Recommendations 93
Strategic Recommendations 95
The Relentless March of Technology 99
Conclusions 100
4
Introduction
Business Intelligence is many things to many people, and depending on whom you ask you
can get very different perspectives. The database engineer will talk with pride about the
number of terabytes of data or the number of queries serviced per month. The ETL staff will
talk about the efficiency of loading many gigabytes of data and regularly bettering load time
Service Level Agreements. The system administrators will speak proudly of system uptime
percentage and number of concurrent users supported by the system. Business Intelligence
managers see modeled databases organized around subject areas accessible through one or
more access tools, supporting query needs for a large number of active users.
In reality, these are necessary, but not sufficient, for Business Intelligence success. Those of
us in Business Intelligence long enough realize that its mission is not to implement
technology, but to drive the business. Technical implementation problems can take an
otherwise good BI strategy and turn it into a failure, but technical implementation perfection
cannot take a poorly conceived strategy and turn it into a success. In this book, we will start
by providing the business context and framework for understanding the role of BI and what it
will take to make it a business success. We will then look at how the BI strategy will drive the
assembly of architectural components into a BI infrastructure, and how the processes that
populate and manage this environment need to be structured.
As with all other investments that an organization makes, the expected results of investments
in data warehousing and business intelligence technology should be a direct contribution to the
bottom line, with a rate of return that meets or exceeds that of any other investment the
organization could make with that same funding. Unfortunately, with information, the
connection between investment and dollar results is often lost. Benefits are often regarded as
intangible or “fuzzy”.
The framework that I will present will assist those who are making strategy, architecture, and
investment decisions related to business intelligence to be able to make this connection. To do
this requires certain insights and certain tools. The tools we will be using here are models,
simplified views of organizations and processes that allow you to identify all of the ‘moving
parts’ and how they fit together. We will keep this at a very high level. We will be focusing
on the overall ecosystem of the forest rather than on the characteristics of the individual trees.
Understanding the big picture and the interrelationships between the major components is
critical to being able to avoid partial solutions that ignore key steps in the process. A bridge
that goes 90% of the way across a river gives you minimal value, since the process it needs to
support is enabling transport completely across a river. The key is to understand the context
when you are defining your business intelligence and data warehousing environment, and
design for that context.
5
Note that in many cases, not even your business partners really understand the context of
business intelligence. You can ask five different people, and they will give you five different
views of why the data warehouse exists, how it is used, and how it should be structured. This
is not reflective of the fact that information usage is random. It is reflective of the fact that
these individuals have different roles in systematized information-based processes, with
dramatically different needs. To understand business intelligence, we must delve into the roles
that these individuals play in the overall process, and look at how each interacts with
information in his own way. Hence, the information environment cannot be designed to be a
‘one-size-fits-all’ information source, but rather must conform to the diverse needs of different
individuals and facilitate their roles and activities.
6
Background
As Euclid did when he constructed his framework for geometry, we will build on some
fundamental premises. Let’s start with the very basics:
Data warehousing and business intelligence technologies are enablers. Putting data into
normalized tables that are accessible using some tool does not in and of itself create any value.
It must be accessed and utilized by the business community in order to create value. They
must not only use it, but use it effectively and successfully. This is similar to any other tool.
Phones only produce value when a person is able to successfully contact and communicate
with another intended person. Televisions only produce value when the content that is
delivered provides entertainment or useful information. Automobiles only produce value
when individuals and/or items are transported from one location to an intended second
location. The value is in the results, and not in the entity itself.
The deliverables or activities of an individual taken in isolation would create no more value
than a solitary brick outside of the context of a brick wall. In the context of an overarching
process, a series of data extraction/manipulation activities, analyses, and decisions together
have purpose and meaning, and can ultimately impact how the business operates and generate
incremental long-term value for the enterprise. Processes are unique to individual businesses,
and their efficiency and effectiveness are important determinants of the overall organizational
success. The complete set of business processes defines an organization, and is reflective of
its underlying character and culture.
Data Warehousing and Business Intelligence technology by itself does not produce business value. Business
information users produce value, with the technology as a tool and enabler that facilitates this.
People do not produce value in isolation - overarching information processes are the vehicles through which
their activities and deliverables find meaning and context and ultimately create value.
7
How an organization operates is based upon a spider web of dependencies, some of which are
almost chicken/egg types of recursive causality. Business intelligence is just one of these
interdependent pieces. As a result, even business intelligence processes must be viewed in
context of the broader organization, and can only be changed and enhanced to the extent that
the connection points that join this with other processes can be changed.
Information culture is a major determining factor as to the manner in which processes evolve.
A conservative culture is more prone to stepwise improvements, applying technology and
automation to try to do things in a similar fashion but somewhat faster and better. A dynamic
culture is more prone to adapt new paradigms and reengineer processes to truly take advantage
of the latest technologies and paradigms. Process-focused cultures, where methodologies such
as Six Sigma are promoted and engrained into the mindsets of the employees, are more likely
to understand and appreciate the bigger picture of information processes and be more inclined
to use that paradigm for analyzing and improving their BI deployments.
Other factors related to cultural paradigms include governance and decision-making
paradigms, which will direct how people will need to work together and interact with
information. Even cultural issues such as how employees are evaluated and rewarded will
impact how much risk an employee is willing to take.
Operational paradigms of the organization relate to how it manages itself internally, plus how
it interfaces with suppliers, partners, and customers. What types of channels are used? What
processes are automated versus manual? What processes are real-time versus batch? While
these issues may not impact an organization’s intentions or interests relative to BI deployment,
they will impact the connection between BI and decision deployment points, and will impact
the breadth and effectiveness of potential decisioning applications.
As with any other systematized series of interactions, information processes have a tendency
to reach a stable equilibrium over time. This is not necessarily an optimal state, but a state in
which the forces pushing towards change are insufficient to overcome the process’s inherent
inertia. Forces of change may come from two different sources – organizational needs for
change to improve effectiveness or achieve new organizational goals, and a change in
underlying tools and infrastructure which enables new process enhancements and efficiencies.
Processes are designed and/or evolve in the context of organizational/operational paradigms, standards, and
culture, and adapt to whatever underlying infrastructure of tools and technology is in place.
8
Ability to control and manage process change is critical for an enterprise to be able to thrive in
a constantly changing competitive environment, and is a key determinant of success in the
development and deployment of Business Intelligence initiatives.
In this book we will look together at the big picture of business intelligence and how it fits in
the context of the overall enterprise. We will do this by focusing on the underpinnings of
business intelligence: people, processes, and paradigms.
9
Modeling the Enterprise
In our quest to define the context for Business Intelligence, we need to start at the top. The
first thing we will do is come up with an extremely simplified model of the enterprise. If you
reduce the enterprise to its essence, you wind up with three flows: funds, product, and
information. These flows are embodied in the three levels of this diagram:
Flow of funds relates to the collection and disbursement of cash. Cash flows out to purchase
resources, infrastructure, and raw materials to support production and distribution, and flows in
as customers pay for products and services. Processes in this category include payables and
receivables, payroll, and activities to acquire and secure funding for operations.
Development and production activities physically acquire and transport raw materials, and
assemble them into the finished product that is distributed to customers. For a financial
institution, it would consist of acquiring the funding for credit products and supporting the
infrastructure that executes fulfillment, transaction processing, and repayment.
Marketing and distribution activities consist of all activities needed to get product into the
hands of customers. It includes the identification of potential customers, the packaging of the
value proposition, the dissemination of information, and the delivery of the product to the
Development and
Production Activities
Products, Services,
Experiences
Marketing and
Distribution Activities
Financial Control Processes
- Flow of funds -
Information Processes - Flow of Data -
Customers
10
customer. For credit cards, it includes everything from product definition and pricing, to direct
mail to acquire customers, to assessing credit applications, to delivering the physical plastic.
In addition, it includes any post-sales activities needed to support the product and maintain
customer relationships.
Shown on the bottom, since these are the foundation for all other processes, are information
processes. These processes represent the capture, manipulation, transport, and usage of all data
throughout the enterprise, whether through computers or on paper. This data supports all other
activities, directing workflow and decisions, and enables all types of inward and outward
communication, including mandatory financial and regulatory reporting.
Of course, in the enterprise there are not three distinct parallel threads – information, financial,
and production/marketing processes are generally tightly integrated into complete business
processes. For example, in the business process of completing a sale, there is a flow of funds
component, flow of product component, and flow of information component, all working
together to achieve a business objective.
Our focus here will be on information processes. We will look at how they interact with other
processes, how they generate value, and how they are structured. We will start at the highest
level, where information processes are subdivided into two broad categories, operational
information processes and analytical information processes.
11
Operational Information Process Model
In its essential form, the operational information process can be modeled as follows:
Let’s break this out section by section. An organization is essentially a collection of
operational business processes that control the flow of information and product. What I
consider to be the intellectual capital of the enterprise, the distinguishing factor that separates it
from its competitors and drives its long-term success, is the set of business rules under which it
operates. Business rules drive its response to data, and dynamically control its workflows
based on data contents and changes. Business rules may be physically embodied in computer
code, procedures manuals, business rules repositories, or a person’s head. Wherever they are
located, they are applied either automatically or manually to interpret data and drive action.
The life-blood of any operational process is, of course, the data. I have broken this out into
two distinct data categories. The first describes all entities that the enterprise interacts with.
Business
Rules
Entities/ Status
Events
Operational
Data Foundational Processes
• Product development/Pricing • Capacity/infrastructure Planning • Marketing strategy planning
Reactive Processes
• Dynamic pricing/discounts • Customer service support • Collections decisioning
Proactive Processes
• Sales/Customer Mgmt • Production/Inventory Management
Operational Information
Processes
12
This could be their products, suppliers, customers, employees, or contracts. Included in this
description is the status of the entity relative to its relationship with the enterprise.
When the status of an entity changes, or an interaction occurs between the entity and the
enterprise, this constitutes an event. Events are significant because the enterprise must respond
to each event that occurs. Note that a response does not necessarily imply action – a response
could be intentional inaction. However, each time an event occurs, the enterprise must capture
it, interpret it, and determine what to do in a timeframe that is meaningful for that event.
There are certain organizational processes that must execute on a regular basis, being driven by
timing and need. These are the foundational processes. Include in this are the development of
strategy, the planning of new products and services, the planning of capacities and
infrastructure. These processes keep the organization running.
In the operational information process model, there are two distinct scenarios for responding to
events. The first consists of what I refer to as reactive processes. A reactive process is when
the event itself calls for a response. It can be as simple as a purchase transaction, where money
is exchanged for a product or service. A more complex example from the financial services
industry could be when a credit card customer calls customer service and requests that his
interest rate be lowered to a certain level. The enterprise must have a process for making the
appropriate decision: whether to maintain the customer interest rate, lower it to what the
customer requests, or reduce it to some intermediate level.
Whatever decision is made, it will have long-term profitability implications. By reducing the
interest rate, total revenue for that customer is reduced, thereby lowering the profitability and
net present value of that customer relationship. However, by not lowering the rate, the
enterprise is risking the total loss of that customer to competitors. By leveraging profitability
and behavioral information in conjunction with optimized business rules, a decision will be
made that hopefully maximizes the probabilistic outcome of the event/response transaction.
The second type of event-response process is what I call a proactive process. The distinction
between proactive and reactive processes is the nature of the triggering event. In a proactive
process, the event being responded to does not necessarily have any profound significance in
and of itself. However, through modeling and analysis it has been statistically identified as an
event precursor, which heralds the probable occurrence of a future event. Identifying that
event precursor gives the enterprise the opportunity to either take advantage of a positive future
event or to mitigate the impact of a negative future event.
For example, a credit card behavioral model has identified a change in a customer’s behavior
that indicates a significant probability of a future delinquency and charge-off. With this
knowledge, the company can take pre-emptive action to reduce its exposure to loss. It could
actually contact the customer to discuss the behaviors, it could do an automatic credit line
decrease, it could put the customer into a higher interest rate category. The action selected
would hopefully result in the least negative future outcome.
13
Note that without business rules that identify these events as event precursors, no response is
possible. In addition, other factors are involved in determining the effectiveness of the event-
response transaction. The first is latency time. A customer about to charge-off his account
may be inclined to run up the balance, knowing it will not be paid back anyway. Therefore,
the faster the response, the better the outcome will be for the company. Enterprise agility and
the ability to rapidly identify and respond to events are critical success factors.
Another factor that plays a huge role in the effectiveness of an event-response is data quality.
The business rules set data thresholds for event categorization and response calculation. The
nature or magnitude of the data quality problem may be sufficient to:
Cause the precursor event to go undetected and hence unresponded to
Change the magnitude of the event-response to one that is sub-optimal
Cause an action to occur which is different from what is called for
This will result in a reduction in profitability and long-term value for the organization. Small
variations may not be sufficient to change the outcome. We will later discuss process
sensitivity to data quality variations and how to assess and mitigate this.
Operational information processes are implemented primarily using application software that
collects, processes, and stores information. This may be supplemented by business rules
repositories that facilitate the storage and maintenance of business rules. In certain event-
response processes, BI tools may also be utilized. This would be in the context of collecting
and presenting information from low-latency data stores, either by looking at a full data set or
isolating exceptions. This information is presented to analysts, who assimilate the information
from these reports and apply some sort of business rules, whether documented or intuitive.
This supports tactical management functions such as short term optimization of cash flows,
staffing, and production.
Our biggest focus will be on business rules. The validity of the business rules has a direct
impact on the appropriateness and business value of event-responses. In many cases, business
rules interact and not only need to be optimized as stand-alone entities, but also within the
context of all of the other business rules. This leads to the fundamental assertion:
In other words, you could theoretically ‘solve’ this as a constrained maximization problem.
You pick a single critical organizational metric, such as shareholder value. You identify all
Given a primary organizational metric and a defined set of environmental constraints, there is a single set of business rules that maximizes
organizational performance relative to that metric. This optimal set changes as the environment changes.
14
constraints related to resource costing, customer behavior, competitive environment, funding
sources, etc. What this states is that there is a single combination of business rules that will
achieve the maximum value for that metric. There are several corollaries to this:
Because of cross-impacts of business rules, you cannot achieve the optimal set by
optimizing each rule in isolation. Optimizing one rule may sub-optimize another.
As you increase the number of business rules assessed simultaneously, the
complexity increases geometrically, becoming unwieldy very rapidly.
The unfortunate conclusion is that achieving exact optimality is a virtual impossibility,
although wisely applied analytics can get you close. Part of the art of analytics is
understanding which rules have sufficient cross-impacts that it makes sense to evaluate them
together, and which can be approximated as being independent to simplify the math. These
trade-offs are what make the human, judgmental aspects of analytics so important.
Of course, even if you were to somehow identify the optimum combination of business rules,
your work would not be done. Because the environmental constraints are continuously
changing, the business rules that optimize organizational performance will also need to change
accordingly. Optimization is a continuous process, not an event.
15
Proactive Event-Response Process Model
Because most reactive information processes are handled by production applications and are
therefore not as interesting from a BI perspective, I would like to spend a little additional time
discussing proactive event-response processes. These are often referred to by industry thought
leaders as real-time information processes. Unfortunately, the perception most people have of
real-time misses the real point. Most people think of real-time in technological terms,
assuming it is synonymous with immediate access to events and data changes as they occur.
They associate it with a series of architectural components:
Messaging allows events to be captured from applications/processes as they occur.
Immediate access to live data allows quick decisions to be made and actions to be
taken.
Continuously operating engines for event capture, analysis, and response ensure
quick turnaround.
However, the true essence of real-time, from a purely business perspective, is very different:
Using this definition, real time often involves but no longer necessitates instantaneous
responses, nor is the focus around a specific technology set. Real-time now can be looked at in
purely business terms. Since we are now talking about optimizing business value, the
underlying issue becomes the maximization of net profitability, which is driven by its cost and
revenue components:
The costs of integrating the information sources needed to make an optimal response
to the event, which are dependent on the underlying infrastructure, application
software, and architecture.
The revenue produced through the implementation of that response, which are
dependent on the nature and distribution frequency of different possible event-
response outcomes.
Both costs and revenues are fairly complex. To facilitate analyzing and optimizing proactive
information processes, I have come up with some simple models. First, let’s break the
proactive event-response process out into a series of basic steps:
Real-time refers to the ability to respond to an event, change,
or need in a timeframe that optimizes business value.
16
As you can see from this timeline, the event-response window begins with the occurrence of a
trigger event, which has been identified as a precursor to a future, predicted event. The event-
response window closes at the time that the future event is predicted to occur, since at that
point, you can no longer take any action that can impact the event or its results. Let us look
individually at these process components.
Trigger event is detected and recorded:
After a trigger event occurs, data describing this event must be generated and stored
somewhere. Event detection is when the knowledge that this trigger event has occurred
is available outside of the context of the operational application that captured it and
becomes commonly available knowledge. This may happen because an event record is
placed on a common data backbone or bus, or it may happen because an output record
from that application is ultimately written out to an operational data store for common
usage. In some cases, significant processing must be done to actually detect an event.
Because of limitations in the source systems for the data needed, it is possible that deltas
will have to be computed (differences in data value between two points in time) to
actually detect an event.
Event is determined to be significant:
Out of all the events for which a record is captured, only a small number of them will
actually have significance in terms of foretelling future events. A mechanism must be
in place to do preliminary filtering of these events, so that just the small subset of events
with the highest probability of having meaningful consequences are kept. Note that at
this stage, without any contextual information, it is difficult to ascertain significance of
an event with any accuracy, but at least a massive cutdown of the volume of events to
be further examined can occur.
Context is assembled for analysis:
While an individual event or piece of data by itself is not necessarily a reliable predictor
of an event, it does indicate a possibility that a certain scenario will exist that is a
precursor to that event. The scenario consists of the fact that that event occurred, plus a
complementary series of prior events and conditions that in total comprise a precursor
Precursor (trigger) event takes place
Predicted future event
Duration of event-response window based on probabilistic lag between precursor and
predicted event
Trigger event is
detected and
recorded
Event is determined
to be significant
Context is assembled for analysis
Future event is predicted
and required action is
determined
Action is
initiated
Results of action are manifested
17
scenario. Once that single individual piece of the picture, the trigger event, is detected,
the data elements that comprise the remaining pieces must be pulled together for
complete evaluation within the context of a statistical model.
Future event is predicted and required action is determined
After all data is assembled it is run through the predictive model, generating probability
scores for one or more events. Depending on where these scores fall relative to
prescribed predictive thresholds, they will either be reflective of a non-predictive
scenario that does not require further action, or else will predict a future event and
prescribe an appropriate action to influence the future outcome.
Action is initiated:
All actions must be initiated in the operational world, via an appropriate system and/or
channel. Actions may include pricing updates, inventory orders, customer contacts, or
production adjustments. Actions may either be implemented:
Manually - a list is generated and a human must act upon that list in
order for any action to take place.
Automatically - data is transmitted to appropriate systems via automated
interfaces, with control reports for human validation. A
person must intervene for an action not to take place.
Results of action are manifested:
After an action is initiated, there will be some lag up until the time that it is actually
manifested. Actions are manifested when there is an interaction between the enterprise
and the person or entity being acted upon. For example, if the predicted event is the
customer’s need for a specific product and the action is to make an offer to a customer
to try to cross-sell a new product, the action manifestation is when the person receives
the offer in the mail and handles that piece of mail. If the predicted event is running out
of inventory and the action is to place an order for additional inventory, the action
manifestation is when the additional inventory is actually delivered.
As with any other process, the designer of the process has numerous decision points. Each
individual step in the process has a specific duration. This duration may be adjusted based on
process design, what types of software and infrastructure are involved, how much computing
resource is available, what type of staffing is assigned, etc. By understanding that the critical
time to consider is the full process cycle from trigger event occurrence to action manifestation,
and not just event detection, it is then apparent that trade-offs can be made as you allocate your
investment across the various response-process components. A certain level of investment
may cut event detection time by a few hours, but the same investment may accelerate action
initiation by a day or action manifestation by 2 days.
18
Note that while your event-response process is probably fairly predictable and should complete
in a specified amount of time with a fairly small variance, there is probably a much wider
variance in the size of the event-response window:
There should be just the right amount of time between the action manifestation and the mean
event occurrence. There must be sufficient lead time between action manifestation and the
mean event occurrence time that will provide adequate response time for a behavioral change
to occur. However, if action manifestation occurs too soon, you may wind up sending a
message before it actually has relevance for the recipient, thus reducing its impact, or you risk
spending money unnecessarily on compressing the process. To summarize the relationship
between action manifestation and predicted event occurrence:
• You gain benefit when your action manifests itself with enough lead time relative to
the predicted event to have the intended impact.
- For customer management processes, it also requires an appropriate and
receptive customer in order for value to be generated. Actions that do not
produce a response generate no value.
- Revenue reductions occur when costly offers are accepted by inappropriate
customers, thereby costing money without generating a return. If a credit
card company reduces the interest rate on an unprofitable customer to avert
probable attrition, this constitutes an action on an inappropriate customer.
They not only lose by keeping an unprofitable customer, they compound their
losses by further reducing interest revenue.
• Net gains must provide an appropriate return relative to development, infrastructure,
and operational expenses.
Time relative to Precursor Event
Fre
qu
en
cy
Action
Manife
statio
n
Mean e
vent
occurrence
When predicting an event, a
certain percentage of the time it
will not actually happen. The
remaining time, it will occur
according to a certain probability
distribution. The action
manifestation must occur prior to
the predicted event, and with
sufficient lead time to allow for a
change in behavior to occur.
Time relative to Precursor Event
Fre
qu
en
cy
Action
Manife
statio
n
Mean e
vent
occurrence
When predicting an event, a
certain percentage of the time it
will not actually happen. The
remaining time, it will occur
according to a certain probability
distribution. The action
manifestation must occur prior to
the predicted event, and with
sufficient lead time to allow for a
change in behavior to occur.
19
Environmental and process-related factors that will determine how effective your event-
response processes are and how much value they generate include:
• Operational effectiveness will determine how efficiently you can detect and respond
to events.
– Rapid and accurate execution of your operational processes
– High quality data being input into the process
– Efficient data interfaces and transfer mechanisms
• Quality of Analytics will determine the effectiveness of the business rules used to
drive your operational processes and how optimal your responses are.
– Accuracy of prediction: maximizing the probability that the condition you are
predicting actually occurs, thereby reducing “false positives” where you take
expensive action that is not needed.
– Accuracy of timing: narrowing the variance of the timing of the predicted
event, so that the action occurs with sufficient lead time to allow behavior
change to take place, but not so far in advance as to be irrelevant and
ineffective.
Because of the tradeoffs that need to be made, there is more involved in the model
development process than just producing a single deliverable. A wide range of predictive
models could be developed for the same usage, with varying input data and data latency, and
whose outputs have different statistical characteristics (accuracy of prediction and accuracy of
timing). Implementation and operational costs will vary for these. Optimization requires an
iterative development process, which generates and analyzes potential alternatives:
• Utilize statistical modeling to analyze a series of potential data input scenarios,
comparing the predictive precision of each scenario.
• Derive cost curve by looking at development/operational expense associated with
each scenario.
• Depending on the predictive accuracy of the models and on the timing relationship
between the original precursor event and the predicted event, the success of the
action will vary. Utilize this information for varying times to derive benefit curve.
You will find some general characteristics in your curves. In general, the further you try to
reduce latency and decrease response lag, the higher the cost. More data elements from more
diverse sources can also drive increased costs. Some sources are more expensive than others,
and this needs to be considered. At the same time, benefits will vary according to the
statistical predictive effectiveness of different model scenarios. Benefit also decreases based
on response lag, approaching zero as you near the predicted mean event time. The goal is to
identify the point where net return is maximized.
20
Graphically, it looks something like this:
Essentially, what this says is that the most robust and elaborate solution may not be the one
that is most cost effective, and that the most important thing is to match the solution to the
dynamics of the decision process being supported.
Some interesting examples of proactive event response processes come from the financial
services industry. One such example is trying to capture funds from customer windfalls. If a
customer receives a windfall payment, it will generally be deposited in his checking account.
It will sit there for a brief period of time, after which the customer will remove it to either
spend or invest. If the financial institution can detect that initial deposit, it is possible that they
could cross-sell a brokerage account, mutual fund, or other type of investment to this person.
The process will start out by looking for deposits over a specific threshold. This can be done
either by sifting through deposit records, or possibly by looking for a daily delta which shows a
large increase in balance. Once these are identified, context has to be collected for analysis.
This context could include the remainder of the banking relationship, the normal variance in
$ cost/
benefit
Decreasing data breadth,
currency, action lead time
Point where Net Return is
maximized
21
account balances, and some demographic information. Predictive modeling has indicated that
if the customer has a low normal variance (high normal variance means that he often makes
large deposits as part of his normal transaction patterns), does not already have an investment
account, has an income of between 30k and 70k, and has low to moderate non-mortgage debt ,
he is probably a good prospect. A referral would then be sent to a sales representative from the
investment company, who would then contact him to try to secure his business.
Since modeling indicated that the money would probably be there five days before it is moved,
a response process that gets a referral out the next day and results in a customer contact by the
second day would probably have a high enough lead time. Therefore, identifying these large
deposits and identifying prospects for referrals in an overnight batch process is sufficiently
quick turnaround for this process.
Another example shows that real-time processes do not necessarily need anything close to real-
time data. This relates to cross-selling to customers who make a service call. After handling
the service call, satisfied customers are given an offer to buy a new product. The way it works
is that on a regular basis, large batch jobs compute the probable products needed (if any) for
the customer base. These are kept in a database. When the customer calls, the database is
checked to see if there is any recommended product to be sold to that customer. If so, the
customer is verified to make sure there are no new derogatories on his record (missed
payments), and he has not already purchased that product. If neither of those are true, the
customer receives the offer. Results are subsequently tracked for the purpose of fine-tuning
the process.
There are, however, processes that are totally appropriate for real-time analysis, or at minimum
a combination of real-time with batch analysis. If somebody is on your web site, the key is to
get that person directed in real time to the offer he is most likely to want. This may be an
upsell, a cross-sell, or just a sale to somebody who comes in to “browse”. Real time analysis
would be very expensive, requiring that the person’s purchase, offer, and web session history
be available in an “in memory” database to analyze. A more cost effective model might be to
do batch analysis on a regular basis to determine a person’s current “state”, which is a
categorization that is computed based on all his prior history. The combination of this current
state with the recent events (what sequence of pages got the person to where he is, what is in
his cart, how was he directed to the site, etc) would then need to be analyzed to determine what
should be offered, which is substantially less data manipulation that needs to be done while
somebody is waiting for the next page to be served up.
There are no shortage of architectural approaches to any given problem – the key will be
balancing operational effectiveness, business effectiveness, implementation cost, and
implementation time.
22
Analytical Information Process Model
The question then is, how do you identify this optimal set of business rules? Historically, this
has been done through intuition and anecdotal evidence. Today, staying ahead of competitors
requires that you leverage quantitative data analysis utilizing business intelligence technologies
to achieve the next level. This analysis and technology is incorporated into Analytical
Information Processes, which I define as follows:
These processes are focused around understanding patterns, trends, and meaning imbedded
within the data. Even more importantly, they are oriented towards utilizing this understanding
as the basis for action, which in this context is the generation of new and enhanced business
rules. Viewed from the perspective of Operational Information Processes, they would look
like this at a high level:
Business Rules
Entities/ Status
Events
Operational Data
Analytical Information
Repositories
Analytical Information Processes
focused around the optimization
of Business Rules
Foundational Processes
• Product development • Capacity/Infrastructure
Planning • Marketing strategy
planning
Reactive Processes
• Dynamic pricing/discounts • Customer service support • Collections decisioning
Proactive Processes
• Sales/Customer Mgmt • Production/Inventory
Management
Operational Information Processes
Analytical Information Processes are iterative, closed-loop, collaborative workflows that leverage knowledge to produce new and updated business rules. These processes consist of a prescribed series of interrelated data
manipulation and interpretation activities performed by different participants in a logical sequence.
23
As you can see from the prior diagram, the inputs for the Analytical Information Processes are
data stored in Analytical Information Repositories. These are distinct from the operational
databases in two ways:
They provide sufficient history to be able to pick a point in time in your data, and
have enough history going backward from there to be able to discern meaningful
patterns and enough data going forward from there to allow for outcomes to be
determined.
They are optimized for the retrieval and integration of data into Information End
Products, which I define as the facts and context needed to make decisions, initiate
actions, or determine the next step in the process workflow.
The following diagram illustrates the true role of a BI team. Its goal is to create a system of
user-tool-data interactions that enable the creation, usage, and communication of information
end products to support effective execution of analytical information processes:
Users
Interact with their data through a….
Tool Suite, which includes:
Query tools Reporting Tools,
OLAP tools Analytical Tools
Multi-tiered Information Environment,
consisting of:
Extreme volume, low latency
High Volume, quick access for analytics
Low volume, immediate access for realtime decisions
Which access data structures within a...
Analytical Information Processes,
or activities that together will optimize business rules and
generate improved profitability or competitive advantage.
Interact with each other
to implement…
24
Part of the problem with assessing and truly understanding analytical information processes is
that these processes can be very complex, and often are ad-hoc and poorly documented.
Without a framework for simplifying, systematizing, and organizing these processes into
understandable components, they can be completely overwhelming. Faced with this
complexity, many project managers responsible for data warehouse requirements gathering
will generally just ignore the details of the business processes themselves, and focus on the
simplistic question ‘what data elements do you want?’ If you design a warehouse with the
focus on merely delivering pieces of data, and neglect to ascertain how it will be used, then
your result may be a system that is difficult, time consuming, or even impossible to use by its
intended users for its intended purpose.
Understanding the nature of information processes is therefore critical for success. If we look
closely at the type of processes that are performed that fall within the decision support realm,
we can actually notice some significant commonalities across processes. My assertion is that
virtually all analytical information processes can be decomposed into a common sequence of
sub-processes. These sub-processes have a specific set of inputs, outputs, and data
analysis/manipulation activities associated with them. This also implies that specific sub-
processes can be mapped to roles, which are performed by specific segments of the
information user community, and which require specific repository types and tools. The
Analytical Information Process Model decomposes projects into a sequence of five standard
components, or sub-processes:
Problem/Opportunity Identification
Drill-down to determine root causes
Identify/select behaviors & strategy for change
Implement strategy to induce changes
Measure behavioral changes/ assess results
25
A detailed description of each sub-process is as follows:
Sub-process 1 – Problem/Opportunity Identification
In this process component, the goal is to achieve a high-level view of the organization.
The metrics here tend to be directional, allowing overall organizational health and
performance to be assessed. In many cases, leading indicators are used to predict
future performance. The organization is viewed across actionable dimensions that will
enable executives to identify and pinpoint potential problems or opportunities. The
executives will generally look for problem cells (intersections of dimensions) where
performance anomalies have occurred or where they can see possible opportunities.
These may be exceptions or statistical outliers, or could even be reasonable results that
are just intuitively unexpected or inconsistent with other factors.
Sub-process 2 - Drill Down to Determine Root Causes
Here, analysts access more detailed information to determine the ‘why’s. This is done
by drilling into actionable components of the high level metrics at a granular level, and
examining the set of individuals comprising the populations identified in the cells
targeted for action. The end-product of this step is to discover one or more root causes
of the problems identified or opportunities for improvement, and to assess which of
these issues to address. For example, if we identify a profitability problem with holders
of a specific product, the drivers of profitability would be things like retention rates,
balances, channel usage, transaction volumes, fees/waivers, etc. By pulling together a
view of all the business drivers that contribute to a state of business, we can produce a
list of candidate business drivers that we could potentially manipulate to achieve our
desired results. Once we have collected the information on candidate business drivers,
the decision needs to be made of which to actually target. There are a number of
factors that need to be considered, including sensitivity (amount of change in the
business driver needed to affect a certain change in your performance metric), cost, and
risk factors. The output from this sub-process will be a target set of business drivers to
manipulate, a target population that they need to be manipulated for, and some high-
level hypotheses on how to do it.
Sub-process 3 - Identify/Select Behaviors & Strategy for Change
This sub-process probes into the next level, which is to understand actual set of
interacting behaviors that affect business drivers, and determine how to manipulate
those behaviors. For those who are familiar with theories of behavior, this is an
application of the ABC theory: antecedent => behavior => consequence. What this
26
means is that in order to initiate a behavior, it is first necessary to create antecedents,
or enablers of the behavior. This could include any type of communication or offers
relative to the desired behavior. To motivate the behavior, one must devise
consequences for performing and possibly for not performing the behavior. This could
include incentive pricing/punitive pricing, special rewards, upgrades, etc. Assessing
this requires complex models which predict behavioral responses, additionally taking
into account how certain actions performed on our customer base can have a series of
cascading impacts, affecting both the desired behavior and also potentially producing
other side affects. From an information perspective, this is by far the most complex and
least predictable task, and often requires deep drill into the data warehouse, sifting
through huge volumes of detailed behavioral information.
Sub-process 4 - Implement
Ability to implement is perhaps the most critical but least considered part of the entire
process. This criticality is due to the fact that the value of an action decreases as the
duration of time from the triggering event increases. This duration has two
components. The first is the analytical delay, which is the time it takes to determine
what action to take. The second is the execution delay, the time for the workflow to
occur that implements the antecedents and/or consequences required into the
operational environment
Implementation is often a complex activity, requiring not only information from the
decision support environment (data marts, data warehouse, and operational data
stores), but also processes to transport this back into the operational environment.
Because time to market is a critical consideration in being able to gain business value
from these strategies, special mechanisms may need to be developed in advance to
facilitate a rapid deployment mode for these strategies. Generally, this is a very
collaborative effort, but is focused around the ability of the information management
and application systems programming staffs to be able to execute. There could be a
wide variation in this time, depending on what needs to be done. Changes to the
programming of a production system could take months. Generation of a direct mail
campaign could take weeks. Updating a table or rules repository entry may take hours
or minutes.
Sub-process 5 - Assess Direct Results of Actions
There are two key assessments that need to be made after a tactic is implemented. The
first is whether or not it actually produced the anticipated behaviors. Generally,
behaviors are tracked in the actual impacted population plus a similarly profiled but
unimpacted control group to determine the magnitude of the behavioral change that
occurred. In addition to understanding what happened (or did not happen), it is also
27
critical to understand why. There could have been problems with execution, data
quality, or the strategy itself that caused the results to differ from the expectations. The
output from this step is essentially the capture of organizational learnings, which
hopefully will be analyzed to allow the organization to do a better job in the future of
developing and implementing strategies.
Because most business processes are cyclical, you end where you began, assessing the current
state of business to determine where you are relative to your goals.
To illustrate how this maps to specific activities, I have chosen the marketing function for a
financial institution. I have taken an illustrative set of measures and activities that occur and
showed how they map into the five sub-processes:
Let’s look at an example of how operational and analytical information processes are
interrelated. My team was involved with a project that actually had both operational and
analytical components. Our task was to access the data warehouse for the retail portion of the
Business
Performance
Management
• Financial ratios
• Profitability • Retention/
attrition rates • Risk Profile
Drill-down to
root causes/
business
drivers:
• Product
utilization and profitability
• Channel utilization and profitability
• Customer relationship measurements
• Attriter/retained customer profiling
• Profitable customer profiling
• Transaction breakdowns
• Fees paid vs. waived
Assess/select
behaviors to
manipulate:
Implement
strategy to
alter
behaviors:
Measure
strategy
effectiveness:
• Statistical analysis
• Data mining • Predictive
model development
• What-if analysis
• Intuitive assessment of information
• Direct Mail • Telemarketing • Feedback to
customer contact points
• Changes in pricing, credit lines, service levels
• Changes to customer scores, segments, or tags
• Measure new and closed account activity
• Measure changes in balances
• Measure changes in transaction behavior
• Measure changes in attrition and retention rates
• Measure collections rates
28
bank (demand deposits, CDs, car loans, mortgages, etc.), and pull together information on the
customer’s overall relationship. This consisted of overall profitability, breadth of relationship,
and derogatories (late payments). This information was to be ported over to the credit card
data warehouse platform, after which it would be funneled into two different directions. The
data would be shipped to the operational system used to support credit card customer service,
where it would be displayed on a screen that supports various operational information
processes. In addition, it would go into the credit card data warehouse, where it would be
accumulated over time in the analytical environment.
By moving the data into the data warehouse, it could be integrated with organizational metrics
and dimensions and used in the execution of analytical information processes. These processes
would be used to devise new or enhanced business rules, so that operational processes such
credit line management, interest rate management, customer retention, and even cross-sells,
could leverage the additional information. These business rules could either be incorporated
directly into the customer service application (via scripting), or else could be incorporated into
procedures manuals and training. As you collect more historical data, your analytical
information processes will yield continuously improved business rules. This is because of two
factors: models would work better with a longer time series of information, and you
additionally have the benefit of the feedback loop as you assess the results of the application of
prior business rules and apply those learnings.
29
Understanding your Information
All information is not created equal. Different types of information have different roles in
analytical information processes. Different roles mean that it flows through the environment
differently, is stored differently, and is used differently. At a high level, I use this taxonomy to
describe the different classes of information to capture and manage:
Performance Metrics
Organizational Dimensions
Actionable Measures that drive performance
Behavioral descriptors/ measures
A small set of high-level measures, generally utilized by senior managers to evaluate and diagnose organizational performance; in addition to current performance indicators, leading indicators may be included to determine performance trend.
These are standardized, actionable views of an organization which allow managers to pinpoint the subsets of the customers or products where there might be performance issues
These represent the actual root causes of performance, at a sufficiently low level that actions can be devised which can directly affect them. Analysts can then make the determination of which of these measures should be targeted for improvement in their strategies to impact the high level organizational performance across the appropriate dimensions.
Customer behaviors related to purchases, transactions, or requests for service link back to the measures that drive performance. Strategy development consists of deciding which of these behaviors to modify, and how to do it. As behaviors are modified, assessment must be made of both intended and unintended consequences.
Examples include Profitability, Customer Retention, Risk-Adjusted Margins, ROA, ROI
Examples include Business Units, Segments, Profitability Tiers, Collection Buckets, geography
Examples include: Interest income, transaction volume, transaction fees, late fees, new account sales, balances, balance increases
Examples include: product usage, account purchases, channel access, transaction behavior, payments made
Facts
Reflect current or prior state of an entity or its activities/ changes. This would include
purchases, balances, etc.
Context
Frame of reference used to evaluate facts for meaning and relevance. Includes forecasts,
trends, industry averages, etc.
30
Facts and context are descriptors that permeate all other information categories. Whether
describing metrics, business drivers, or behaviors, you would present facts about business
entities framed in context. Facts and context apply equally well whether looking at a single
cell of the organization or looking at the enterprise at the macro level.
An extremely important concept here is that of the information end-product. An information
end-product is the direct input into a decision or trigger for an action. An information end-
product may consist of metrics, business drivers, or behaviors. It will contain all needed facts
and relevant context. It will be presented in such a fashion as to be understandable and
facilitate analysis and interpretation.
It is sometimes not clear what actually constitutes an information end-product. If an analyst
gets three reports, pulls some data from each the three reports and retypes it into a spreadsheet
so that he can make sense of it and make a decision, the spreadsheet is the end-product. In a
less intuitive but equally valid example, if the analyst took those same three reports and
analyzed the data in his head, his mental integration process and resulting logical view of the
data would be the information end-product. More complex information end-products could
include a customer stratification by a behavioral score, monthly comparisons of actual metrics
with forecasts, and customer retention by product and segment. Note also that a physical
information end-product with respect to one activity/decision point may additionally be an
input in the process of preparing a downstream information end-product.
Like any other consumer product, information has value because it fulfills a need of the user
within the context of a process. It is up to the developers of this information to ensure that it
can effectively produce that value. There are several determinants of information effectiveness
that have significant impacts on the ability of an organization to utilize the information and
ultimately produce real business value. The five broad categories impacting value of
information are:
Accuracy
Accuracy generally refers to the degree to which information reflects reality. This is the
most fundamental and readily understood property of information. Either it is correct,
or it is not. For example, if you have a decimal point shifted on an account balance, or
an invalid product code, the you do not have accurate information.
Completeness
Completeness implies that all members of the specified population are included. Causes
of incomplete data might be applications that are not sourced, or processing
/transmission errors that cause records to be dropped, or data omitted from specific
data records by the source system.
Usability
Usability is a much less tangible, but much more prevalent problem. It pertains to the
appropriateness of the information for its intended purposes. There are many problems
31
with information that could negatively impact its usability. Poor information could be
introduced right at the collection point. For example, freeform address lines may make
it very difficult to use the address information for specific applications. Certain fields
may be used for multiple purposes, causing conflicts. There could be formatting
inconsistencies or coding inconsistencies introduced by the application systems. This is
especially common when similar information is directed into the warehouse from
multiple source systems. Usability problems could also arise from product codes not
defined to an appropriate level of granularity, or defined inconsistently across systems.
Usability is even impacted when data mismatches cause inconsistencies in your ability
to join tables.
Timeliness
Timeliness is the ability to make information available to its users as rapidly as
possible. This enables a business to respond as rapidly as possible to business events.
For example, knowing how transaction volumes and retention rates responded to
punitive pricing changes for excessive usage will allow you to change or abort if it is
not having the predicted effect. In addition, knowing as early as possible that a
customer has opened new accounts and is now highly profitable will enable that
customer to be upgraded to a higher service level ASAP. Timeliness of information is
achieved by effectively managing critical paths in the information delivery process.
Cost-effectiveness
As with any other expense of operation, it is critical that the cost of collecting,
processing, delivering, and using information be kept to a minimum. This is critical for
maintaining a high level of return on investment. This means being efficient, both in
ETL process operations and in process development and enhancement.
These must be appropriately managed and balanced as the BI manager devises an information
delivery architecture and strategy.
32
Understanding your User Community
Prior to even contemplating databases, tools, and training, it is critical that an understanding be
developed of the actual people who are expected to be able to utilize and generate value from a
decision support environment. Just as companies segment their customer bases to identify
homogeneous groups for which they can devise a unique and optimal servicing strategy, so too
can your business intelligence user community be segmented and serviced accordingly. Like
customers, the information users have a specific set of needs and a desire to interact with
information suppliers in a certain way.
What I have done is to come up with a simple segmentation scheme that identifies four levels
of user, broken out by role and level of technical sophistication:
Level 1
Level 2
Level 3
Level 4
Senior Managers and Strategists
Business Analysts
Information Specialists
Predictive modelers and Statisticians
Looking for a high level view of the organization. They generally require solutions (OLAP, Dashboard, etc) which entail minimal data manipulation skills, often viewing prepared data/analysis or directly accessing information through simple, pre-defined access paths.
Analysts who are more focused on the business than technology. They can handle data that is denormalized, summarized, and consolidated into a small number of tables accessed with a simple tool. They are generally involved with drilling down to find the business drivers of performance problems. They often prepare data for strategists.
These are actual programmers who can use more sophisticated tools and complex, generalized data structures to assemble dispersed data into usable information. They may be involved with assessing the underlying behaviors that impact business drivers, in strategy implementation, and in measurement of behaviors. They may assist in the preparation of data for business analysts, strategists, or statisticians.
These are highly specialized analysts who can use advanced tools to do data mining and predictive modeling. They need access to a wide range of behavioral descriptors across extended timeframes. Their primary role is to identify behaviors to change to achieve business objectives, and to select appropriate antecedents/consequences to initiate the changes.
33
Let me clarify the fact that this is a sample segmentation scheme. This specific breakout is not
as important as the fact that you must not only know and understand your users, but you must
be aware of the critical differentiation points that will direct how these users would like to and
would be able to interact with information. It is also important to remember that this must
apply to your end-state processes and not just your current state. This means that roles may be
invented that do not currently exist, and those roles must be accounted for in any segmentation
scheme.
Note that while user segmentation is very important from a planning and design perspective,
real users may not fall neatly into these well defined boxes. In reality, there is a continuum of
roles and skill levels, so be prepared to deal with a lot of gray. Many people will naturally map
into multiple segments because of the manner in which their role has evolved within the
process over time. Many of the information users that I have dealt with have the analytical
skills of a business analyst and the technical skills of an information analyst. They would feel
indignant if a person was to try to slot them in one box or the other. The key point to be made
here is that role-based segmentation will be the driving force behind the design of information
structures and BI interfaces. The important thing is that you design these for the role and not
the current person performing that role. A person performing a business analyst role should
utilize information structures and tools appropriate for that role, even if that person’s skill level
is well beyond that. This will provide much more process flexibility as people move to
different roles.
One of the biggest mistakes in developing a data warehouse is to provide a huge and complex
entanglement of information, and expect that by training hundreds of people, usage of this
monstrosity will be permeated into corporate culture and processes. When only a tiny minority
of those who were trained actually access the databases and use the information (and those
people were the ones who already had expertise in using information prior to development of
the warehouse), they then assume that this is a user information adoption problem. Their
solution - apply more resources to marketing the data warehouse and to training users.
Unfortunately, training will only get people so far. Some people do not have the natural
aptitudes and thought processes that are necessary to being successful knowledge workers. In
addition, many people have absolutely no desire to become skilled technicians with
information and tools. No amount of training and support will change this.
The key point to remember is, you are supplying a service to information users, who are your
customers. You must therefore start with the knowledge of who your customers are, what they
are capable of doing, and what they have to accomplish. You then apply this information by
delivering to them things that they need and can actually use. If you are supplying a product or
service that they either do not need, or do not have the current or potential skills and aptitudes
to actually use, there will not be adoption and the system you are building will fail. Business
Intelligence needs to adapt to the user community, and not vice-versa.
34
Mapping and Assessing Analytical Information Processes
In order to be able to evaluate and improve your analytical information processes, it is essential
that there be some way to capture and communicate these processes in a meaningful way. To
do this, I came up with the Analytical Information Process Matrix. With user segments
representing rows and sub-processes as columns, this graphical tool allows individual activities
within the process to be mapped to each cell. In the diagram below, you can see some
examples of the types of activities that might be mapped into each cell:
Although this representation is in two dimensions, it can easily be extrapolated to three
dimensions to accommodate multi-departmental processes, so that activities and participants
can be tied to their specific departments.
Managers/ Strategists
Business Analysts
Statistical Analysts
WHAT are the performance
issues?
Deliver & Assess Performance
Metrics
WHY is this situation
occurring?
Drill-down/ research to find
root causes
HOW can we improve
performance?
Analyze alternatives and devise action
plan
IMPLEMENT Action Plan!
Interface with processes and
channels
LEARN and apply to future
strategies.
Measure changes to assess effectiveness
Information Specialists
A wide range
of possible
roles exist as
you design
your closed-
loop analytical
processes.
Mine data for
opportunities
Develop and research
hypotheses
Assess performance and identify
opportunities
Performance reporting
Collect, analyze,
and present metrics
Collect and Assess
Results
Transactional and
behavioral reporting
Data integration
and complex drill down
Perform what-if
analysis
Create transport
mechanisms and
interfaces
Develop
behavioral models
Develop statistical profiles
Select optimal strategy
Develop alternative
strategies
35
To graphically map the process, the specific information activities are incorporated as boxes in
the diagram, along with arrows representing the outputs of a specific activity which are
connected as inputs to the subsequent activity. This is a very simple example:
This shows types of activities at a high level. Within each box could actually be numerous
individual information activities. The question then is: How much detail do you actually need
to do a meaningful process mapping? While more detail is definitely better, getting mired in
too much detail can lead to ‘analysis paralyses’. As long as you can identify the key
communication lines and data handoff points, you can derive meaningful benefit from a high-
level mapping. The key is to integrate this with the information taxonomy to identify the
information that corresponds with each box.
Managers/
Strategists
Business
Analysts
Statistical
Analysts
Information
Specialists
Sample
Analytical
Information
Process flow
scenario
Data Mining, Statistical analysis, and scenario evaluation
Query/Reporting
Assess performance
using OLAP/Dashboard
Reporting/Data Manipulation
Decide on appropriate
actions
Execute complex data
integration
Implement
Analyze Results
WHAT are the performance
issues?
Deliver & Assess Performance
Metrics
WHY is this situation
occurring?
Drill-down/ research to find
root causes
HOW can we improve
performance?
Analyze alternatives and
devise action plan
IMPLEMENT Action Plan!
Interface with processes and
channels
LEARN and apply to future
strategies.
Measure changes to assess
effectiveness
36
For example, in the ‘assess performance’ box, the key is to identify the meaningful, high-level
metrics that will be used for gauging organizational health and identifying opportunities. This
should be a relatively small number of metrics, since too many metrics can lead to confusion.
If the goal is to optimize performance, taking a certain action can move different metrics
different amounts, and possibly even in opposite directions. Simplicity and clarity are
achieved by having a primary metric for the organization, with supplemental metrics that align
with various aspects of organizational performance, and leading indicators that give an idea of
what performance might look like in the future. In addition, you need standardized dimensions
across which these metrics can be viewed, which can enable you to pinpoint where within the
organization, customer base, and product portfolio there are issues.
Once you know what needs to be delivered, you then need to understand the segment that will
be accessing the information to determine how best to deliver it. The managers and strategists
who are looking at organizational performance at a high level will need to do things like
identify exceptions, drill from broader to more granular dimensions, and be able to
communicate with the analysts who will help them research problems. A specific data
architecture and tool set will be needed to support their activities.
Business analysts need to be able to drill into the actionable performance drivers that constitute
the root causes for performance issues. For example, in a Credit Card environment, risk-
adjusted margin is a critical high level metric looked at by senior management. In our
implementation, we included roughly 40 different component metrics, which are sufficiently
granular to be actionable. The components include each individual type of fee (annual, balance
consolidation, cash advance, late, over-limit), statistics on waivers and reversals, information
on rates and balances subjected to those rates, cost of funds, insurance premiums, charge-
offs/delinquencies, and rebates/rewards. By drilling into these components, changes in risk-
adjusted margin can be investigated to determine if a meaningful pattern exists that would
explain why an increase or decrease has occurred within any cell or sub-population of the total
customer base. By analyzing these metrics, analysts can narrow down the root causes of
performance issues and come up with hypotheses for correcting them. Doing this requires
more complex access mechanisms and flexible data structures, while still maintaining fairly
straightforward data access paths.
The next level down consists of measures of behavior, which is generally the realm of
statistical modelers. Because there are so many different types of behaviors, this is the most
difficult set of activities to predict and support. Behaviors include whether individual
customers pay off their whole credit card bill or just make partial payments, whether they call
customer service regularly or sporadically, whether they pay their credit card bill by the
internet or by mail, whether they use their credit cards for gas purchases or grocery purchases,
and countless other possibilities. The issue here is to determine which of these behaviors could
be changed to in order to impact the business driver(s) identified as the root causes by the
analysts. The key is to determine antecedents and consequences such that the desired change
in behavior is maximized, without incurring negative changes in other behaviors that would
37
counteract the positive change. To do this requires access to detailed behavioral data over long
periods of time.
As an example, senior management has identified a performance issue with the gasoline
rewards credit card, which has been consistently meeting profitability forecasts in the past but
has all of a sudden had a noticeable reduction in profitability. After drilling into the problem,
analysts identified the issue as decreased balances and usage combined with increased attrition
among the customers who had been the most profitable. Because this coincided with the
marketing of a new rewards product by a competitor that provided a 5% instead of a 3%
reward, the hypothesis was that some of our customers were being lured by this competitor’s
offer. Some customers were keeping their card and using it less, others were actually closing
their card accounts.
Through statistical analysis, we then needed to figure out:
What would we need to do to keep our active, profitable customers?
Is there anything we could do for those customers already impacted?
Given the revenue impacts of increasing rewards, what do we offer, and to what
subset of customers, that will maximize our risk-adjusted margin.
Once the course of action was determined, next comes the implementation phase. For this,
there might be three hypothetical actions:
Proactively interface with the direct marketing channel to offer an enhanced rewards
product to the most profitable customers impacted.
Closely monitor transaction volumes of the next tier of customer, and use statement
stuffers to offer them temporary incentives if their volumes start decreasing.
Identify any remaining customers who would be eligible for temporary incentives or
possible product upgrade if they contact customer service with the intention of
closing their account.
Implementation of this strategy would therefore require that information be passed to three
separate channels via direct feeds of account lists and supporting data.
Once implementation is complete, the final sub-process is monitoring. For those people who
were impacted, their behavior was tracked, and compared with control groups (meeting the
same profile but not targeted for action) to determine the changes in behavior motivated by the
offer. Based on the tracking, appropriate learnings would be captured that could assist in the
development of future response strategies.
As with any iterative process, with the next reporting cycle senior management will be able to
look at the key metrics across the dimensions of the organization, and be able to ascertain if
overall performance goals have been attained.
38
As you can see from the previous example, the process requires the involvement of various
individuals and execution of numerous information activities. In any given sub-process,
individuals may take on roles of producers or consumers of information end-products (or
sometimes both). Effective information transfer capabilities and consistent information must
be available across all process steps to ensure that collaboration among individuals and
interfaces between activities within and across sub-processes occur accurately and efficiently.
For example, when a manager specifies a cell, this corresponds to an intersection of
dimensions. These same dimensions must be understood by the analyst and must be
incorporated with the exact same meanings into whatever data structures are being used by the
analysts. When the analyst identifies business drivers that are potential problems, these need to
be communicated to the statistical analysts. In addition, if a business analyst identifies a set of
customers that looks interesting, these need to be able to be transferred directly to the statistical
analysts so that they can work with this same data set. After a strategy is developed and
implementation occurs, the same data that drives implementation must be available within the
tracking mechanisms to ensure that the correct customers are tracked and accurate learnings
occur.
As we map out these information flows, we will be looking for the following types of issues:
Too many assembly stages can cause the process to pull together information to be
excessively labor intensive, extend information latency time, and increase likelihood of
errors being introduced. In many cases, excessive assembly stages are not due to intent,
but due to the fact that processes evolve and information deliverables take on roles that
were not initially intended. A rational look at a process can easily identify these
vestigial activities.
Inefficient information hand-off points between stages can occur when information is
communicated on paper, or using ‘imprecise’ terminology. For an example, if a
manager communicates that there is a problem in the New York consumer loan
portfolio, it could refer to loan customers with addresses in New York, loans which
were originated in New York branches, or loans that are currently domiciled and being
serviced through New York branches. It is extremely important that precise
terminology is used to differentiate similar organizational dimensions. It is also critical
that electronic communication be used where possible, so that specific metrics and
dimensions can be unambiguously captured, and data sets can be easily passed between
different individuals for subsequent processing.
Multiple people preparing the same information to be delivered to different users can
cause potential data inconsistencies and waste effort. This wasted effort may not be
limited to the production of data – it may also include extensive research that must be
done if two sources provide different answers to what appears to be the same question.
Information gaps may exist where outputs from one process activity being passed to the
next do not map readily into the data being utilized in the subsequent step. This can
occur if data passes between two groups using different information repositories which
39
may include different data elements or have different data definitions for the same
element.
Delivery that is ineffective or inconsistent with usage can cause excessive work to be
required to produce an end-product. This can occur when the data structure is too
complicated for the ability of the end user, requiring complex joins and manipulation to
produce results, or when intricate calculations need to be implemented. It can also
occur when the tool is too complicated for the intended user. An even worse
consequence is that the difficulty in data preparation may make the process vulnerable
to logic errors and data quality problems, thereby impacting the effectiveness of the
supported information processes.
In addition to promoting efficiency, understanding processes helps in one of the most daunting
tasks of the BI manager – identifying and quantifying business benefits. With the process
model, you can tie information to a set of analytical processes that optimize business rules.
You can tie the business rules to the operational processes that produce value for the
organization, and you can estimate the delta in the value of the operational processes.
Otherwise, you get Business Intelligence and Data Warehousing projects assessed for approval
based on the nebulous and non-committal “improved information” justification. The problem
with this is that you do not have any valid means of determining the relative benefits of
different BI projects (or even of comparing BI projects with non-BI projects). As a result,
projects get approved based on:
Fictional numbers
Who has the most political clout
Who talks the loudest
This is definitely not the way to ensure that you maximize the business benefits of your data
warehousing and business intelligence resources. If you, as a BI manager, are going to be
evaluated based on the contribution of your team to the enterprise, it is essential that you
enforce appropriate discipline in linking projects to benefits to guarantee the highest value
projects being implemented.
40
Information Value-Chain
Many of you are familiar with the concept of the Corporate Information Factory described by
Bill Inmon, which is well known in data warehousing circles. It depicts how data flows
through a series of processes, repositories and delivery mechanisms on route to end users.
Driving these processes is the need to deliver information in a form suitable for its ultimate
purpose, which I model as an information value-chain. Rather than looking at how data flows
and is stored, the value chain depicts how value is added to source data to generate information
end-products.
Note that while traditional ‘manufacturing’ models consist only of the IT value-add steps (or
what is here referred to as architected value-add), this model looks at those components as only
providing a ‘hand-off’ to the end users. The users themselves must then take the delivered
information and expend whatever effort is needed to prepare the information end-products and
deploy them within the process. I tend to call the user value-add the ‘value gap’ because it
represents the gulf that has to be bridged in order for the business information users to be able
to perform their roles in their analytical information processes.
Raw data from
operational systems
Integrational Value-add:
Consistency, rational entity-
relationship structure, and accessibility
Computational Value-add:
Metrics, scoring, segmentation at
atomic levels
Structural Value-add:
Aggregation,
summarization, dimensionality
BI Tool Value-add
Simplified semantics, automated/
pre-built data interaction
capabilities, visualization
Information end-products deployed in analytical processes
User Value-add
Interacting with tools/ coding of
data extract/ manipulation/ presentation processes
Core ETL
Analytical/Aggregational
Engines/Processes
User Interface
Delivery Infrastructure
Architected Environment Value-add
Information Hand-off Point
Value-gap
There is a distinct value chain for the information end-products associated with each unique information activity across your analytical processes
Human Effort
Information Infrastructure
41
When looking at the data manipulations needed to bridge the value-gap, you will find that a
substantial value-gap is not necessarily a bad thing, just like a small value-gap is not
necessarily a good thing. It is all relative to the dynamics of the overall process, the user
segments involved, and the organizational culture and paradigms. The key to the value gap is
that it should not just ‘happen’; it needs to be planned and managed. The process of
planning and managing the value gap corresponds to your information strategy.
Once a handoff point is established that specifies how the information is supposed to look to
the end-users, you then need to determine how best to generate those information deliverables.
The way this is done is to work backwards to assess how best to partition value-add among the
four environmental categories. Note that there are numerous trade-offs that are associated with
different categories of environmental value-add. The process of assessing and managing these
trade-offs in order to define the structure of your environment corresponds to the development
of your information architecture.
An information plan is then needed to move from a strategy and architecture to
implementation. Input into the planning process consists of the complete set of data
deliverables and architectural constructs. The planning process will then partition them out
into a series of discrete projects that will ultimately achieve the desired end-state and in the
interim provide as much value early on as possible.
42
A look at Information Strategy
The key issues associated with devising and implementing an information strategy are related
to managing the value gap. This gap must be bridged by real people, whose ability to
manipulate data is constrained by their skills and aptitudes. They must access information
elements using a specific suite of tools. They will have certain individual information accesses
that they must perform repeatedly, and certain ones that are totally unpredictable that they may
do once and never again. The nature of this gap will determine how much support and training
are needed, how effectively and reliably business processes can be implemented, and even
whether specific activities can or can not be practically executed using the existing staff. By
understanding the value gap, cost-benefit decisions can be made which will direct the amount
of value that will need to be built into the pre-handoff information processes, and what is better
left to the ultimate information users.
When developing an information strategy, the first thing that needs to be documented is the
target end-state process vision. The information strategy needs to consider three sets of issues
and strike a workable balance:
Users and Activities
Tools
Information
Tool Issues: • In-use already vs.
acquire
• Wide vs. Narrow
Scope
• Power vs. ease of
use
• Best of breed vs. integrated suite
User/Activity Issues:
• Training/learning
curve
• Activity/process
redesign required
• Realignment of
roles required
• Acquisition of
skilled resources
required
• Development time and cost
• Load and data availability timing
• Flexibility vs. Value Added
Information Issues
43
Based on the issues illustrated in the diagram, it is apparent that:
Strategy development is focused around the mapping of the information end-
products associated with analytical information processes back to a set of
information deliverables corresponding to a set of information structures and
delivery technologies
Implicit in strategy development is the resolution of cost-benefit issues surrounding
technology and systems choices
Responsibility for strategy is shared by both business functions and IT, and has close
ties to architecture development
Once you have the target processes documented and activities identified, you will see that
strategy development is essentially the recognition and resolution of process tradeoffs. The
types of trade-offs you will have to consider will include:
Trade-off of environmental value-add for user value-add
Trade-off of dynamic computations within tools versus static computations within
data structures
Trade-off of breadth of tool capabilities with ease of usage and adoption
Trade-off of segment-focus available with multiple targeted tools versus reduced
costs associated with fewer tools
Trade-off of development complexity with project cost and completion time
Trade-off of ETL workload with data delivery time
The trade-off of environmental value-add with user value-add is critical to the success or
failure of a BI initiative. To start off, a complete user inventory would need to be undertaken
to segment users based on their current skill levels. This would then need to be mapped into
roles in the end-state process. This will allow you to assess:
Current user capabilities and the degree of productivity that can be expected.
What training, coaching, and experience are necessary to expand users from their
current skill level to where they need to be to fulfill their intended process roles.
Critical skill gaps that cannot be filled by the existing user community.
By shifting the information hand-off point to the right, users will need less technical skill to
generate their information end-products. This would reduce the need for training and
enhancing skills through hiring. However, this potentially increases development complexity
and ETL workload, which would increase development cost and data delivery times.
Another huge issue which will impact the trade-off of environmental value-add versus user
value-add is the stability of the information end-products, which is a critical consideration for
organizations that already have a population of skilled information users. Both value-add
scenarios will have both drawbacks and benefits associated with them. The key is to balance
44
reliability and organizational leverage versus cost and expediency.
Those who have been involved with companies with a strong information culture know that
information users can be extremely resourceful. Having previously operated on the user side
and made extensive usage of fourth generation languages, it is amazing what kind of results
can be achieved by applying brute force to basic data. Since users can dynamically alter their
direction at the whim of their management, this is by far the most expedient way to get
anything done. It is also the least expensive (on a departmental level), since user development
does not carry with it the rigors of production implementation. Unfortunately, this has some
negative implications:
Each user department must have a set of programming experts, forming
decentralized islands of skill and knowledge.
Much is done that is repetitive across and within these islands. This is both labor
intensive and promotes inconsistency.
Service levels across the organization are widely variable, and internal bidding wars
may erupt for highly skilled knowledge workers.
User-developed processes may not be adequately tested or be sufficiently reliable to
have multi-million dollar decisions based on them.
Documentation may be scant, and detailed knowledge may be limited to one or a
small number of individuals, thereby promoting high levels of dependency and
incurring significant risks.
Therefore, while expedient at the departmental level, this carries with it high costs at the
overall organizational level.
Building information value into production processes is a much more rigorous undertaking,
which carries with it its own benefits and drawbacks. It requires that much thought and effort
be expended up front in understanding user processes and anticipating their ongoing needs
over time. Therefore, this is a very deliberate process as opposed to an expedient one. It
requires significant process design, programming, and modeling work to produce the correct
information and store it appropriately in repositories for user access. It also entails risk, since
if the analysis done is poor or if the business radically changes, the value added through the
production processes may be obsolete and not useful after a short period of time, thereby never
recouping sunk costs.
However, there are also extremely positive benefits of implementing value-added processes in
a production environment.
It reduces the value-gap that users must bridge, allowing user departments to utilize
less technically skilled individuals. This results in less need for training and
maximizes ability to leverage existing staffing.
It increases consistency, by providing standard, high-level information building
45
blocks that can be incorporated directly into user information processes without
having to be rebuilt each time they are needed.
It is reliable, repeatable, and controlled, thereby reducing integrity risk.
It provides a Metadata infrastructure which captures and communicates the
definition and derivation of each data element, thus minimizing definitional
ambiguity and simplifying research.
It can dramatically reduce resource needs across the entire organization, both human
and computing, versus the repeated independent implementation of similar processes
across multiple departments.
Note that no reasonable solution will involve either all ‘user build; or all ‘productionization’.
The key is understanding the trade-offs, and balancing the two. As you move more into
production, you increase your fixed costs. You will be doing additional development for these
information processes, and will operate those processes on an ongoing basis, expending
manpower and computing resources. The results will need to be stored, so there will be
continuing cost of a potentially large amount of DASD. When processes are built within the
user arena, costs are variable and occur only when and if the processes are actually executed.
However, these costs can quickly accumulate due to multiple areas doing the same or similar
processes, and they entail more risk of accuracy and reliability problems. This can actually
mean an even larger amount of DASD, since the same or similar data may actually be stored
repeatedly. The trade-off is that sufficient usage must be made of any production
summarizations and computations so that the total decrease in user costs and risks provides
adequate financial returns on the development and operational cost.
Depending on technologies used, data structures accessed, and complexity of algorithms,
performing the same set of calculations in a production environment versus an ad-hoc (user)
environment can take several times longer to implement, assuming implementers of equivalent
skill levels. This difference will be related to formalized requirements gathering, project
methodology compliance, metadata requirements, documentation standards, production library
management standards, and more rigorous design, testing, and data integrity management
standards.
Savings occur due to the greater ease of maintenance of production processes. Since
derivation relationships within metadata can enable you to do an impact analysis and identify
things that are impacted by upstream changes, it is much easier to keep production processes
updated as inputs change over time. Also, built-in data integrity checking can often detect
errors prior to delivery of data to user applications, thereby avoiding reruns and reducing the
probability of bad data going out. For ad-hoc processes, the author must somehow find out
about data changes in advance, or else data problems may propagate through these processes
and may not be captured until after information has already been delivered to decision makers.
In some cases, trade-offs made will impact tool and data communication expenses. If data is
delivered in a relatively complete form, it merely needs to be harvested. This generally means
that what the user will pull from the information environment is either highly summarized data
46
or a small targeted subset of the record population. In situations where the data is raw and
substantial value needs to be added by the users in order to make the information useful, large
amounts of data may need to be either downloaded to a PC or transmitted to a mid-range or a
server for further manipulation. This can dramatically impact data communications bandwidth
requirements and related costs.
For end-products that tend to change over time, consider providing to users a set of stable
components that they can dynamically assemble into their needed end-products. Changing
production processes requires significant lead time. If certain analytical metrics can potentially
change frequently, attempting to keep up with the changes could bog down your ETL
resources and not be sufficiently responsive. A lot depends on the types of changes that could
occur, since some types of change could be handled merely by making the process table or
rules driven. For changes that mandate recoding, efficiency in making changes is related to the
nature of the technology used for ETL. In many cases, the usage of automated tools can
dramatically reduce turnaround time and resources for system changes and enhancements.
Regardless, the need for flexibility must always be considered when determining what
deliverables need to be handed over to end users and in what format.
In addition to the issue of whether to do calculations in production or leave them to the user, an
even more vexing issue is how to structure data for retrieval. Complexity of access path is
often an even more impenetrable barrier to information adoption than even having to do
complex calculations. If tables are highly normalized, an extended series of table joins might
be necessary in order to pull together data needed for analysis. To simplify the user interface,
there are two alternatives. We can bury the joins into a tool interface to try to make them as
simple and transparent as possible, or else we can introduce denormalized tables. Tool-based
dynamic joins may be more flexible, but do not provide any performance benefit.
Denormalized tables provide a significant performance benefit, but at the cost of additional
DASD and the requirement of recoding ETL and restructuring databases if significant changes
occur in the business that require different views of data. Again, there will generally not be an
either/or solution, but rather a blending that takes into account which things are most stable and
which things require quickest access.
Critical decisions will need to be made with respect to tools. Tools with more power tend to be
harder to learn. In some cases, tools that are provided that are not consistent with the
corresponding user segment’s technical abilities can cause adoption resistance and ultimate
failure. There are trade-offs that need to be made with the number of tools. The more
individual tools in the suite, the more targeted they can be towards their intended segment.
However, this leads to increased support and training costs and reduced staff mobility and
interchangeability. In some cases, a single-vendor suite can be used that provides targeting of
capabilities while simplifying adoption by providing a consistent look and feel. This may
result in a compromise in functionality, since many of the best of breed individual solutions do
not have offerings that cover the complete spectrum of end-user requirements.
In a dynamic business environment, time is a critical factor. Here we need to look at time from
47
two perspectives. The first is the time it takes to implement a new capability. From the time a
need is recognized that requires an information solution, the clock is ticking. Every week and
month that we are waiting for that solution to be implemented, we are missing the opportunity
to generate value and gain competitive advantage. It is therefore critical to recognize the
implementation time as a critical consideration, and be willing to assess trade-offs that deliver
less, but do it faster. Likewise, when detecting events or evaluating results of decisions and
actions, information latency is a critical consideration. Hours and minutes may have
significant value. Again, it may be beneficial to make tradeoffs between latency and value-add
in order to expedite the delivery of information and improve responsiveness to events and
changes.
By the time you are finished with your strategy, you should have determined:
What are the optimal information handoff points to produce the needed end-
products?
What information accesses are repetitive versus sporadic?
What are the information clusters, or information that tends to be needed together?
What are the different access paths (selection criteria/drill sequence) needed at the
various handoff points?
How do we support information and data flows between activities?
How are people mapped to tools and data, and how will the needed skills be
acquired?
What are the various entities, events, and relationships for which data must be
captured?
What are the trade-offs associated with time versus capabilities?
With this information, we can now begin to work with the architects to establish a solution
architecture.
48
A look at Architectural Issues and Components
In the Information Architecture, you essentially define the interrelationships between data
sources, data management processes, and information repositories, and select appropriate
technologies and paradigms for implementation. Included would be a series of philosophical
directives that will determine what is stored, how processes are implemented, how metadata
(data about data) is managed, plus a series of technical directives related to platforms,
software, data communications, user and programmer tools, etc. Note that an architecture is a
means to an end. The end is the ability to implement your information strategy as quickly,
efficiently, and cost-effectively as possible. When developing an architecture, there are
numerous factors that must be considered:
Scalability, or ability to grow as the business grows.
Throughput, or the ability move and transform data quickly enough to satisfy
business timing needs.
Complexity/reliability, which will ultimately impact data integrity and the amount of
effort that must be expended to operate the processes.
Human productivity and implementation issues, which will impact the efficiency
with which your current staff (or target staff) will be able to develop, maintain, and
fix processes.
Enterprise integration issues, or how the business intelligence technology suite maps
into the overall technology set employed by the enterprise.
Note that strategy and architecture must converge at a single point. Strategy works backwards
from users and activities to identify the appropriate information deliverables to support process
execution. Architecture works forward from available data and building blocks to construct a
framework for implementing a set of information deliverables. The implication here is that
neither can be done in a vacuum. The starting point is always the set of information processes
that must be supported. This will drive a first pass at an information strategy, and the needs
represented by the strategy will provide the basic requirements for a technical architecture.
Technical and business practicalities will force compromises in both the strategy and
architecture, but in the end a consistent and workable scenario must be the result.
The first thing I would like to do is basically inventory the various architectural components
that may be assembled to create a Business Intelligence and Data Warehousing architecture. I
have divided these components into three sets:
Database Structures
Environments/Platforms
User access tools
The following charts identify these components:
49
Database Structures to support integrational, computational, and
structural value-add
May be used by
information specialists
for sub-process 4,5
Rapid turnaround
limits possible
value add.
Minimal latency data provides
quick feedback as to changes
in specific behaviors being
monitored.
Operational Data
Store
May be used by
managers or some
business analysts for
sub-process 1.
Becomes unwieldy
with numerous or
large dimensions.
Provides fast and easy access
to multidimensional
information for interactive
analysis and drill down.
Multi-dimensional
Database
May be used by
information analysts for
sub-process 1,2
Complex to build,
possible
redundancy,
limited ability to
store individual
event data.
Allows flexible, dimensional
access to detailed or summary
data, and enables drill through
from multidimensional
database.
Star Schema Mart
May be used by
information analysts for
sub-process 1,2.
Not suitable for
constantly
changing query
mix.
Improves performance of
frequently submitted queries.
Automatic/Manual
Summary Tables
May be used by
information analysts for
sub-process 1,2.
Introduces
difficulty if
external
information must
be integrated, plus
possible
redundancy,
Segregates, restructures,
and/or aggregates relevant
data for performance and
access simplicity.
Process/Subject
focused data mart
May be used by
information specialists
and statistical modelers
for sub-process 1-4
Introduces largest
value gap,
requiring largest
user value add and
information usage
risk
Most flexible means of storing
data, reflecting the natural
structure of the data itself.
Supports diverse retrieval
patterns and entry points.
Normalized
analytical data
warehouse
Suitable process
Phases/Segments
Limitations Areas where applicable Architectural
Construct
50
Environments/Platforms to support BI Tool value-add
All sub-processes and
segments.
Not suitable for
desktop
applications.
Unified front-end that allows
consolidated authentication
and access control for all BI
capabilities.
Portal
Suitable for use by
managers in sub-
process 1.
Information may
still require
interpretation from
data specialists.
Consolidates high level
performance metrics or
behavioral measures into
easily interpreted visual
format.
Dashboard/
Scorecard
Can be used by
information specialists
and modelers for sub-
processes 3,4, and
possibly others
Excessive
dependency on
custom integration
may result in
duplication of
effort and
inconsistency.
Used to integrate and
consolidate data from multiple
sources/environments to
support complex analysis and
modeling.
Analytical
Workspace
For data distribution to
managers and business
analysts for sub-
process 5, and possibly
1.
Reports are not
readily manipulated
by users if different
view is needed.
Delivery of detailed reports to
a wide range of users on a
need-to-know basis.
Report Management
and Distribution
Library
May be used by
managers and business
analysts for sub-
processes 1,2.
May be difficult to
incorporate metrics
and process
interfaces unique to
your business.
Custom coded or purchased
application which computes
and presents appropriate
metrics for specific business.
Analytical
Application
May be used by
business analysts and
information analysts
for sub-processes 1,2,5
Single environment
may not completely
support diverse tool
suite.
Web or desktop based access
to one or more BI tools for
accessing data. Allows for
storage/sharing of queries and
results.
Analytical Query
Environment
Suitable process
Phases/Segments
Limitations Areas where applicable Architectural
Construct
51
Data Delivery and Manipulation Tools to support BI Tool value-add
May be used by
information specialists for
sub-process 3
Requires significant
skill to execute and
interpret analysis.
Used to cluster/segment
customer population and predict
future behavior based on
historical data.
Data Mining
and Statistical
Analysis Tool
Used by Information
Specialists to prepare data
for sub-processes 1,5
Often linked to
specific query tools
and not sufficiently
flexible.
Used to populate and format
dashboard/scorecard delivery
applications.
Dashboard
Design/
Delivery
May be used by
information specialists for
sub-processes 1,2,3,4,5
Requires much skill,
and introduces
possibility of
redundant efforts,
inconsistency, and
errors
Used in development of complex
processes, including data
integration, what-if analysis, and
modeling.
Procedural
Programming
Language
May be used by managers
or business analysts for
sub-processes 1, 2, 5
Requires MDDB or
Star/Snowflake
Schema.
Allows data to be flexibly
viewed across dimensions at
multiple levels, with drill-down
into more detail.
OLAP Tool
Used By Information
Specialists to prepare data
for sub-processes 1,5
Formatting is static –
reports must be
recoded to look at
alternate views of
data.
Creation of highly formatted
report outputs, which can then be
saved and distributed.
Report
Creation and
Management
Tool
May be used by business
analysts and information
specialists for sub-
processes 1,2,5
May be difficult to do
complex
manipulations, multi-
step processes, etc.
Allows preparation and
submission of SQL using
simplified interface with
semantic layer.
General Query
Tool
Suitable process
Phases/Segments
Limitations Areas where applicable Architectural
Construct
52
Let’s discuss some of the issues to be faced when trying to assemble these pieces into a
cohesive information architecture. What I would like to do is evaluate this from the
perspective of the five sub-processes of your analytical information processes.
Let’s start from the beginning, which is the distribution of broad organizational metrics in
support of high-level performance management. The strategy will drive whether you will have
managers directly accessing the data through scorecards, dashboards, and/or OLAP, or whether
they will have analysts prepare decks in which the appropriate information is filtered,
massaged, interpreted, and delivered in a customized fashion. If a decision is made for
automated delivery of information, then all metrics must be calculated in advance, stored, and
be available dimensionally. Depending on data volumes, data interactivity required, and
performance constraints, this may be stored in a star schema or in cubes. A portal must be
selected to deliver this information (which will also be leveraged within the other sub-
processes), as well as mechanisms to deliver the scorecard information and to populate it.
Rather than attempting a custom solution, there are numerous industry-specific analytical
applications that can be deployed to compute, store, and deliver metrics. These are pre-
packaged vendor (or custom) applications which prepare metrics and analytics and deliver
them using a series of existing templates and report formats. They will often have capabilities
suited both to managers for performance management, and some limited capabilities for
business analysts to do drill down to root causes. Analytical applications are generally a quick
way to catch up, but also may not be sufficiently tailored to your internal business processes to
maximize your competitive advantage.
For the second sub-process, drill-down to root causes, we will need to decompose the high
level metrics delivered to senior managers into a robust set of component metrics. These will
generally be stored in a star schema, with a wide range of meaningful dimensions. This would
include the same organizational dimensions looked at by senior managers, but also others used
to divide customers into more actionable sub-classes based on current and historical behavior
patterns. To supplement the calculated metrics, drill through from the star schema metrics
table into detail data should be enabled to create an expanded ‘virtual’ data mart, which allows
access to a wider range of data. Depending on the nature of the process, additional data marts
can supplement this, providing information highly specific to individual activities.
Analysts performing this sub-process will need flexible tools, which will allow OLAP access
to dimensional views, plus more generalized query capabilities. These generalized query
capabilities should be sufficiently broad that they can accommodate the star/snowflake
schemas, normalized data warehouse, and even generalized data sets created by information
analysts that integrate data warehouse and other external data sources.
One of the things you will notice is that I identified the data warehouse as being normalized.
There is a substantial school of thought that proposes that all data warehouses be modeled
dimensionally as star/snowflake schema structures with a fact table and standardized
53
dimensions. I am a firm believer that once you get down into the realm of analyzing and
predicting behaviors, the multiplicity of access patterns and in many cases the utilization of
facts themselves as entry points into the data rather than dimensions substantially reduce the
benefits of dimensional structures, and actually make the normalized structures more intuitive
to use.
A large part of your information architecture is the identification of unique data marts that can
be applied to specific user segments and sets of information activities. These marts will be
optimized for their intended usage, so that the specific supported activities can be executed
very simply and/or very quickly. There are a number of different approaches that can be used
optimize data marts for their intended usage:
Computation and storage of summary information, particularly across transactions
and events that pertain to a specific entity. This can be implemented in static
(incorporated into the original schema) or dynamic (materialized query tables
derived as needed by analyzing data access patterns) structures
Integration of time-series information, so that a series of 12 or 24 of the same metric
corresponding to different time periods can be co-located into a single record to
simplify and speed up the extraction of historical trend information.
Extracting just a subset of information relevant to the specific information activities,
thereby simplifying access and improving performance due to narrowing of data
scope.
Optimizing access paths for frequent information entry points through the
implementation of multi-dimensional models, either through a star/snowflake
schema or a multi-dimensional database.
Denormalization to reduce joins
My recommendation is that multidimensional access be used for metrics-based data marts,
supporting managers/strategists and business analysts. This is the stage of the process where
the data entry points (what you are selecting on) and access patterns are most predictable.
This can be implemented via multi-dimensional databases for managers and strategists, who
need quick but fairly standardized access to pinpoint performance issues across the dimensions
of the organization. An underlying star schema could then provide the flexibility and drill
through capability needed for business analysts to be able to drill down to the next level of root
causes.
Decisions will need to be made as to the degree of denormalization to be introduced into the
overall system architecture. In denormalized tables, you are trading off redundancy for
efficiency. This could significantly increase DASD storage costs, and also introduce the
possibility of inconsistency within the database. In addition, pulling together data from entities
that have one-to-many or many-to-many relationships can still be complicated even after
denormalization, and care must be taken by users when reporting on data elements that might
repeat (i.e., customer related data elements on account related tables). My recommendation for
54
handling the integration of data is as follows:
The data warehouse should generally be normalized and always be at the lowest level of
detail. This ensures ease of update, a single source for each data element, and
consistency across tables.
Denormalization should generally take place through the development of data marts and
OLAP solutions based directly on needs of individual user segments in support of their
specific information activities.
Virtual data marts can be extremely effective. By joining co-hosted or federated
normalized tables into the main fact table of a star schema data mart, you can take
advantage of dimensional access paths into the data while eliminating the need for
duplicate development to load both marts and the warehouse.
By using denormalization for highly specific applications, you can potentially focus on small
subsets of data elements or records, you can better understand usage patterns and build around
those patterns, and you can verify usability through having a specific group of users do the
acceptance testing.
For sub-process three, which is the identification of specific behaviors that can be changed to
drive changes to the root causes and subsequently to the high-level performance issues, there
may be a variety of both internal and external sources of data. The primary internal sources of
data will be the normalized data warehouse, and possibly a ‘behaviors mart’ that captures
standardized measures of common behaviors that support production models. To support the
development and testing of behavioral analytic processes, you will generally need an analytical
workspace. An analytical workspace is critical for environments with a significant population
of skilled information analysts. This is a shared environment where users can dynamically
integrate data from multiple sources, and be able to run sophisticated data mining and
clustering software to identify patterns in the data. This scenario works best when the
environment is used for dynamic data integration and analysis to leverage external data
sources, and for research and development of new statistical models. A large temptation is to
leverage this environment for the production execution of scoring processes and behavioral
models. While having some short term benefits in expediency, the lack of production controls
and elevated risk of data quality impacts due to uncommunicated data changes tend to more
than compensate for any benefits.
Because of trends in hardware costs, the big trade-off here is whether to utilize high-power
workstations on each person’s desk, or create a shared Symmetric Multi-Processor
environment in which multiple users share the same large computing resource. Again, the
architects will need to look at the dynamics of how people collaborate and share data. A high-
degree of collaboration and numerous shared data sources would tend to point to a single SMP
environment, while more independent work with communication primarily of small sets of
end-products would point more towards a networked series of workstations. In general, a
single SMP environment can take individual, parallelizable jobs and execute them faster. In
some cases, this can also be done with networked workstations by setting them up as a grid,
55
but this is of benefit only for algorithms that are grid-suitable and requires technology that is
not as mature.
The tools required for this range from powerful procedural programming type languages that
support data integration, scoring, and advanced computations, to advanced statistical and data
mining software applications. These tools would need flexible access to data. This would
include both the data warehouse and external data sources. A ‘behaviors mart’ developed in
support of production models would also be of significant benefit in model development by
serving as a source of standard behavioral measures at an atomic level. For a financial
institution, this mart could include facts such as percentage of different transaction types
handled by different channels, number of months since last late payment, monthly variance of
deposit balances, and numerous others. Leveraging these measures will improve productivity
and consistency for modeling activities, and will facilitate the movement of models into a
production environment.
Implementation requires that the results of the strategy development be packaged and
transmitted to the point of execution. If the result is a direct mailing, the identifying
information for the individuals being contacted and the specifics of the message/offer must be
transmitted to whoever will be doing the fulfillment. If the result is a pricing change, any
communication of the changes must be implemented through a communications channel, and
the pricing data must be updated in the appropriate system. The architect must understand how
communication and data transfers need to take place to make the execution of these business
rules changes as quick and accurate as possible.
A potentially critical piece of the puzzle is the operational data store, for storing near real-time
data on significant events and behaviors. While many now are more inclined to integrate this
information into the data warehouse itself as part of an ‘active data warehousing’ paradigm, the
key is not so much where it is but how it can be used. Depending on the degree to which this
information needs to be integrated with other data warehouse information, it may be sufficient
to have a separate ODS which can be dynamically linked to the data warehouse via a federated
middleware scenario. Integrating this data directly into the data warehouse itself provides for
tighter coupling of information and processes and allows for more robust and better performing
integration of current data with historical context.
The ODS can actually serve multiple duties. From an analytical information process
perspective, it will support the measurement of behaviors in the final sub-process, allowing
quick feedback as to the effectiveness of the strategy and actions and the ability to assess and
re-apply learnings. This requires that the ODS have access to tagged sets of customers or
accounts to allow the behaviors to be measured for those specific subsets. It also requires that
all ODS data be consistent with data warehouse data both in terms of completeness and data
definitions. For specific operational reporting requirements, small ‘oper-marts’ can be
extracted from this data to simplify specific types of regular reporting processes. From an
operational perspective, the ODS can be used to collect and filter events that can drive event-
triggered operational information processes.
56
Delivery of operational reporting for the fifth sub-process may be effectively implemented by a
report management and distribution infrastructure. Some reporting packages allow highly
formatted reports to be created, stored in a library, and then distributed either through a push or
a publish-subscribe scenario via the web. User IDs will limit what any individual user has
access to. Because of the inability of users to effectively interact with data using this scenario,
it is generally good for simple, repetitive processes like tracking of behaviors.
When considering tools, the important thing is to understand the implicit mapping of tools to
user segments and activities, which will drive the critical capabilities and usability parameters.
Always evaluate a tool on the subset of capabilities and characteristics that are important to the
segments and activities for which it is intended to be used, not necessarily for its broader
spectrum of capabilities. For example, you may have a list of 50 capabilities that may be
incorporated into a query tool. You look at who will be using it (based on user segment), and
what types of activities will need to be executed. This will drive the specific capabilities that
are actually relevant, and also allow you to weight them as to their relative importance. If it
has been decided to go with separate OLAP and query tools, then you do not need a query tool
with OLAP capabilities. The OLAP capabilities would be weighted zero for that evaluation,
and a separate OLAP tool evaluation would be undertaken. However, if interoperability of
OLAP and query tools is essential due to the manner in which those users will be working
together, then it may be necessary to actually select a single tool to do both, or to select a
single vendor suite that encompasses both to ensure seamless integration.
The key to tools evaluation is that you evaluate tool scenarios (i.e., plausible combinations of
tools that would cover the spectrum of your requirements), rather than just individual tools in
isolation. This will ensure that all needed capabilities are covered, and that tool
interoperability is appropriately considered. This way you could readily identify suite benefits:
Individual productivity
Data sharing and transfer capabilities
Match to processing requirements
It will also allow you to look at the overall cost associated with the whole suite, which would
include:
Combined infrastructure
Licensing
Training
Ongoing support
Here is a good generalized approach for mapping of business requirements to information
structures, which will then drive the tool selection process:
57
The following diagram shows a possible information architecture that has components that
support each individual sub-process and user segment. It starts with managers and strategists,
who can access their data through scorecards, dashboards, and standard reports. These are
powered by cubes that enable extremely fast response times. Analysts can access the cubes, or
drill back even further to a star-schema metrics mart, which allows much more flexibility in
terms of the conformed dimensions used and number of metrics accessible. This also supports
drill-through back to the normalized data warehouse, or virtual data mart views. Finally a
normalized data warehouse is leveraged for modeling and complex data manipulation.
Cube
Identify High Level Metrics that
can be used to gauge the
performance of the
organization, and dimensions that
enable pinpointing of issues and targeting of accountability.
For each metric, identify the key
components (or drivers) at a high
enough level to be meaningful, but
at a low enough level to drive
strategic and tactical actions
For all entities, determine
the key behavioral components that
will impact their relationship with
the company and their resulting
cost to serve, revenue stream, and
profitability, and use this to
develop strategies
Denormalized, pre-
computed, summarized
information
Normalized
Detail
Integrated for key
entities
Multi-dimensional summary
Business Requirement Architecture
58
Sample Segment/Activity Focused Information Environment
Enterprise Warehouse
Normalized detail data
Analytics Engine generates all
metrics and summaries once
needed IW data is loaded.
Filtering, aggregation,
and analytical
processing using
standard metrics/ dimensions
Products
Dashboard
Cubes
Built around
customer
metrics and
dimensions from
ROLAP
structure, plus
external data
such as program
targets,
response rates,
and industry
statistics
Dimension Dimension Dimension Dimension
Metrics/Scores Table
Production ETL processes Production data
collection, reporting, and
presentation processes
Applications
- Dashboard
- Standard Reports
Relational access using any
table as a starting point yields
flexibility to get at any needed
information.
ROLAP access allows access to
metrics via any standard
dimension. Drill-through to
warehouse provides 'virtual
extended data mart' and allows
access to any data needed.
Cube access, geared towards
those with repetitive data
needs,provides high performance
and allows for integration of non-
customer / external data into
reporting structures.
Applications allow for data
distribution to completely non-
technical staff, including senior
managers. Logging into a portal
automatically provides access to
dashboard data and all reports
suitable for that user.
Highly skilled technicians use
this capability for detailed
reporting on customer
behaviors, searching for
correlations, performing 'what
if' analysis, and developing
statistical profiles and
predictive models.
Skilled analysts can do detailed
analysis of information at a program,
segment, or marketing group level to
research performance anomalies and
drill into root causes. Customer 'cells'
can be researched at the individual
account level to identify relevant
behaviors and issues.
Analysts can utilize a simple
interface to evaluate a program
or segment, make comparisons
across time or between forecasts
and results, and identify 'cells'
(intersections of dimensions)
where there may be performance
issues.
Managers can get a highly
visual and intuitive view of
summary-level data to get an
overview of total portfolio and
program results and quickly
identify any areas where
further research is warranted.
Customer
ETL Perspective
Data
Access Perspective
Process Perspective
59
Information Manufacturing and Metadata
What most Business Intelligence managers do not realize is that they are not in the
programming business – they are in the manufacturing and distribution business! There is no
conceptual difference between providing users with information versus providing consumers
with a broad suite of tangible or intangible products. Essentially, the BI/DW group collects raw
materials (data) from its providers (source application systems). It goes through a
manufacturing process which integrates and synthesizes the data to produce information
deliverables. It accumulates this information in a bulk warehouse, and can then pass this on to
different types of retail outlets (data marts and OLAP), organized around convenience and ease
of access. Finally, customers either access information directly through self-service delivery
channels (tools), or have information provided to them through value-added resellers
(programming specialists), who prepare and deliver spreadsheets, reports, decks, etc.
In spite of all of our advances in tools, the basic concepts behind how we approach the whole
process of information systems are often archaic. Data warehousing organizations often
develop their information as a series of threads. A project will be defined that identifies the
data elements that must be produced as output, and how they will be delivered. A group of
programmers is assigned, who will build a beginning-to-end process which handles all of the
required inputs, collects all of the needed data, pulls it together into temporary files, and finally
produces outputs. This is done independently of the other projects that are going on. What
kinds of problems does this lead to?
There is potential replication of effort across project teams within the IW
organization.
Processes produced could be inefficient due to touching the same data multiple times
across projects.
Process structures are often left to the discretion of the implementing team, and may
be inconsistent.
This is not conducive to implementing broad and consistent data quality checking
and correction.
Because many people are touching the same data, it is much more difficult to make appropriate
corrections to programs in response to input data changes. Also, we are leaving ourselves open
to inconsistencies across data repositories and even tables.
Manufacturing can inspire us about how things can be done better. First of all, the manufacture
of an end-product is not done in a vacuum by an isolated team. Manufacturing is focused
around maximizing production efficiency. It means taking raw inputs and producing sub-
assemblies, which can then be inventoried and used in building higher level sub-assemblies. In
this scenario, the key is designing reusable, general purpose components. When producing a
car on an assembly line, components are incrementally added until finally you have your
finished product. If you are building a coupe, sedan, and convertible of the same model, you
60
share as many components and assembly processes as possible on the same assembly line,
diverging only where necessary to support fundamental differences in the end products.
Having three different and independently developed assembly lines for the three automobiles
would dramatically drive up costs due to the proliferation of additional parts that have to be
inventoried, the additional people needed to operate the assembly lines, and the reduced
flexibility.
This analogy can be extended to cover synchronous, period-based EDW data updates (ie, data
is updated for the same period across sources at the same time). The core processes which
produce information can also be organized into an assembly line, which progressively builds
information by combining atomic data element instances into increasingly more complex
information sub-assemblies. Again, our objective would be to reuse as much as possible across
processes to improve productivity and control costs, diverging only where there are
fundamental differences in the output information. Intermediate results are not only
permanently saved, but integrated into our data stores and described in the metadata repository.
This allows them to be easily reused across existing processes and leveraged as new processes
(supporting models, data marts, reporting processes, etc) are developed. This gives you
economies of scale, eliminating the need to replicate that portion of the design and
development effort across the other product lines. It also promotes consistency, since if you get
it right once, it will be correct in all end-products in which it is a part.
When utilizing an assembly line to produce an end product, the production process is divided
among a series of discrete assembly stations. At each one, activities are performed in the
proper sequence and the end-product is incrementally built. The actions being done together at
the same assembly station are selected because of some natural connection. After the activities
of one station are concluded, quality is verified and the product moves on to the next station.
In ETL design, I refer to the individual steps as layers. As data passes through each successive
layer, value is added that moves the data closer to an information deliverable. Through the
organization of processes into layers, you are trying to achieve:
Grouping of like processes together
Minimization of number of external touch points
Elimination or reduction of repeated handling of data
Depending on the nature of the end-to-end processing that is required for any given company,
there is no single solution for how things should be organized. The following, however, is a
good general purpose layering scenario that can be adopted and modified for most Business
Intelligence teams to organize ETL processes:
1. Data Acquisition, or the extraction of data from the operational environment and
transport of data to the analytical environment.
2. Data Commonization, or standardization across diverse sources
3. Data Calculation, or creation of meaningful new elements
4. Data Integration, or creating/validating data linkages and populating tables
61
5. Information Assembly, or summarizing across tables to create complex metrics and
aggregations
6. Information Delivery, or populating summarized/aggregated structures for quick and
easy access by end-users
The six layers are meant to drive how you structure your ETL processes and how you organize
your support team. At a high level, the flow of data would look something like this:
In this diagram, the process layers are in olive and the repositories (including metadata) are in
aqua. Also shown, which integrates with the other layers, are data quality monitoring and
information delivery via data access tools.
In the section on the Information Value Chain, we identified a series of four components that
describe how value is added to information in the process of preparing information
deliverables. All integrational, computational, and structural value-add is actually produced
through the ETL process. While often excluded from traditional views of ETL, I include data
access tool and platform support within the final ETL layer, Information Delivery. In the
following diagram, you can see how the activities associated with each ETL layer contribute to
the various value-add components of the information value chain:
Operational Data
Sources
(1) Data
Acquisition
Landing Area
Initial Staging
Area
(2) Commonization
(3) Calculation
(4) Integration
Data Warehouse
Prior Month Reference
Data Data Quality
Monitoring
(5) Information Assembly
(6) Information
Delivery
Data Mart
Data Mart
Data Access Tools And
Platforms
Metadata
Quality Statistics
Users
Analytical Information Environment
A high level look at synchronous, periodic
ETL Process Flow
Cube Cube
62
Let us look in more detail at the specific activities that you will need to incorporate into the
various ETL layers:
Cross-referencing ETL Layers with
Information Value-add Components
Update tool
semantic layers;
prepare reports/
templates/
dashboards
Populate marts and
cubes
Information
Delivery
Compute conformed
facts and conformed
dimensions for
multidimensional
structures, plus any
summarization
across dimensions
Compute critical
business metrics and
measures globally
(across tables and time
periods)
Information
Assembly
Populate primary
warehouse tables
with normalized data
Integration
Compute new relevant
data elements locally
(within tables)
Compute primary
and secondary keys
as necessary
Calculations
Map disparate systems
into common
definitions and formats
and compute
commonized values
Build and load
common data
staging tables
Data
Commonization
Interface with
production systems;
Acquire extracts
Land extracts on
ETL platform
Data Acquisition
BI Tool Value-
add
Structural
Value-add
Computational
Value-add
Integrational
Value-add
Value-add:
ETL Layers:
63
Data Acquisition
This is the most fundamental layer of processing, but in many ways is the most critical. Those responsible for acquisition must obtain data from all internal and external sources and populate it into the decision support environment. This is the starting point from which everything else progresses. There are three key responsibilities for this layer:
Developing and operating the processes that collect and transport data from production application sources into the decision support environment.
Maintaining the lines of communication with the production staff so that they are aware of all changes, and then altering metadata and passing information on changes to subsequent processing layer. Note that this is the only layer that performs this external communication.
Maintaining data transport checks to ensure that everything is extracted and received in the landing area completely
It is the vigilance of these individuals that will determine the ability of the process to respond to changes in data to eliminate potential data quality problems. The output from this layer is the loading of images of the extract data into the analytical environment landing area.
Operational Data
Sources
(1) Data
Acquisition
Landing Area
Initial Staging
Area
(2) Commonization
(3) Calculation
(4) Integration
Data Warehouse
Prior Month Reference
Data
Data Quality
Monitoring
(5) Information Assembly
(6) Information Delivery
Data Mart
Data Mart
Data Access Tools And
Platforms
Metadata
Quality Statistics
Users
Analytical Information Environment
A high level look at synchronous, periodic
ETL Process Flow
Cube Cube
64
Data Commonization This step is extremely important for large companies that have multiple legacy application systems covering the same function, such as a bank built on acquisitions and mergers that may have several deposit systems, but is also useful for ensuring data consistency across different functional applications. Data commonization refers to identifying identically purposed data elements that may have different formats or code values on different systems, and converting them into a single, unified format and definition. Note that commonization should only occur when multiple systems can be mapped into data elements that mean exactly the same thing. If there are any definitional differences, they should be mapped into different data elements to ensure that downstream coding recognizes the differences. For example, in banking, one deposits system might store month to date debits, while another might store statement cycle to date debits. It would be my recommendation to keep them in two different data elements rather than at this level trying to combine them into a single one. This would give the downstream processes more flexibility in terms of how these differences should be handled. The basic responsibilities of this layer are:
Developing and maintaining the processes that provide consistently defined and formatted data that facilitates downstream activities
Making sure any upstream changes are reflected in the outputs from this layer.
Performing data integrity testing to ensure that no quality issues were introduced within this layer.
Note that the outputs from this layer will be stored in a set of staging tables along with all information pulled directly from the production systems via the landing area.
Operational Data
Sources
(1) Data
Acquisition
Landing Area
Initial Staging
Area
(2) Commonization
(3) Calculation
(4) Integration
Data Warehouse
Prior Month Reference
Data
Data Quality
Monitoring
(5) Information Assembly
(6) Information Delivery
Data Mart
Data Mart
Data Access Tools And
Platforms
Metadata
Quality Statistics
Users
Analytical Information Environment
A high level look at synchronous, periodic
ETL Process Flow
Cube Cube
65
.
Calculation Once data is in a common format, basic calculations can now be executed. In the most efficient possible scenario, a single pass would be made through each table, during which all calculations that do not require information from other tables will occur. For example, in banking, the outstanding balance of a loan may actually be comprised of a dozen different component variables, which need to be numerically combined in order to provide this overall balance. Note that some of these calculations will be necessary in order to compute keys and support subsequent integration and assembly steps. Basic responsibilities of this layer are:
Developing and maintaining the processes that do a series of basic, common calculations that will be used downstream.
Making sure any upstream changes (ie, changes to source data files, including new codes and changed formats) are reflected in their calculation algorithms, and that changes that occur in their algorithms are communicated downstream.
Performing data integrity testing to ensure that all calculations have executed error-free.
Outputs from this step will generally be stored in the staging tables along with the commonized data.
Operational Data
Sources
(1) Data
Acquisition
Landing Area
Initial Staging
Area
(2) Commonization
(3) Calculation
(4) Integration
Data Warehouse
Prior Month Reference
Data
Data Quality
Monitoring
(5) Information Assembly
(6) Information Delivery
Data Mart
Data Mart
Data Access Tools And
Platforms
Metadata
Quality Statistics
Users
Analytical Information Environment
A high level look at synchronous, periodic
ETL Process Flow
Cube Cube
66
Integration
The integration layer is where that we populate the keys for all tables, and ensure that all linkages of data, whether across entities within the current period or across time periods, work correctly. This ensures referential integrity across the entire data warehouse. Note that strategies for developing keys can vary sharply, depending on the nature of the input data. In some cases, a single data element with no modification can be used as a key field. In other instances, multiple data elements may be concatenated to form a compound key, or may be used to calculate a new, non-intelligent key. Basic responsibilities of this layer are as follows:
Developing and maintaining the integration and cross-reference processes that compute key fields
Making sure any upstream changes are reflected in key fields.
Performing bi-directional tests to ensure that all keys are defined consistently across tables that need to be joined, and that current time period can be joined back properly to prior time periods.
Loading data into relational tables in data warehouse At this point, data is loaded into the live data warehouse tables.
Operational Data
Sources
(1) Data
Acquisition
Landing Area
Initial Staging
Area
(2) Commonization
(3) Calculation
(4)
Integration
Data Warehouse
Prior Month Reference
Data
Data Quality
Monitoring
(5) Information Assembly
(6) Information Delivery
Data Mart
Data Mart
Data Access Tools And
Platforms
Metadata
Quality Statistics
Users
Analytical Information Environment
A high level look at synchronous, periodic
ETL Process Flow
Cube Cube
67
Information Assembly The final layer in the information manufacturing process is assembly. This consists of any type of calculation, summarization or aggregation that is implemented across entities or across time periods. Note that assembly is by nature a very broad term. Outputs from this process are the metrics, segments, and scores that users will actually incorporate into their business processes. These information deliverables are populated into the data warehouse, and may subsequently make their way into data marts, OLAP cubes, and/or reports. The information assembly layer performs the following functions:
Builds and maintains production processes that summarize and aggregate data. Leverages all raw inputs and information sub-assemblies available throughout the information environment to build information deliverables
Makes changes to processes based on any changes to inputs or changes in user requirements/specifications.
Validates all computations and summaries to ensure consistency with inputs and history.
This area is the most visible to users, and especially senior managers, since it is responsible for the calculation of KPIs and other business-critical metrics. Once this step is completed, information is ready to be organized and structured for delivery.
Operational Data
Sources
(1) Data
Acquisition
Landing Area
Initial Staging
Area
(2) Commonization
(3) Calculation
(4) Integration
Data Warehouse
Prior Month Reference
Data
Data Quality
Monitoring
(5) Information Assembly
(6) Information Delivery
Data Mart
Data Mart
Data Access Tools And
Platforms
Metadata
Quality Statistics
Users
Analytical Information Environment
A high level look at synchronous, periodic
ETL Process Flow
Cube Cube
68
Information Delivery
Information delivery optimally structures information and stores it in a location where users can retrieve and manipulate it using their tools of choice. This includes information that is delivered via:
Data Marts/OLAP cubes
In-memory Databases
Executive Information Systems, Dashboards, Scorecards
Web-delivered reports The manner in which information is delivered will depend on the type of information, the type of users, and the type of usage. Delivered Information is built off of standard metrics and dimensions that are calculated in the information assembly stage and stored in the data warehouse. In some cases, it is possible that those metrics may be supplemented by customized metrics prepared specifically for a report or data mart. My recommendation is that this be avoided unless there is absolute certainty that this information will never need to be shared. The same information often needs to be delivered in multiple places to satisfy different user needs, and not having this information prepared and stored in the data warehouse or mart structures may limit flexibility and result in replication and inconsistency.
Operational Data
Sources
(1) Data
Acquisition
Landing Area
Initial Staging
Area
(2) Commonization
(3) Calculation
(4) Integration
Data Warehouse
Prior Month Reference
Data
Data Quality
Monitoring
(5) Information Assembly
(6) Information
Delivery
Data Mart
Data Mart
Data Access Tools And
Platforms
Metadata
Quality Statistics
Users
Analytical Information Environment
A high level look at synchronous, periodic
ETL Process Flow
Cube Cube
69
Implementation of this discrete-layer paradigm can significantly improve the internal
efficiency of an information management department:
Each layer is dependent only on the inputs it receives from its sources, not on how
the sources operate.
Roles are precise and well-defined.
The manager of each layer is empowered to search for efficiencies across a wide
range of homogeneous processes.
The quality can be verified before and after each layer to ensure that no errors are
introduced, and the manager of each layer is held accountable for its outputs.
An extremely critical part of the whole development and deployment process is metadata.
Metadata actually serves a dual role, supporting both information users and information
developers. For information developers, metadata is used to manage the entire inventory of
data inputs, transformations, and intermediate results. Information that must be captured
includes:
Business definitions, in sufficient detail to allow users to completely understand a
data element, its applications, and its implications.
Transformation rules that dictate exactly how a data element is computed, along
with bi-directional derivation linkages between any data element and its
components.
A complete set of code values and formats.
In the context of the information manufacturing process, it is critical that not only do the
deliverables that appear in the final repositories need to be documented, but also all of the
intermediate data elements. This supports the reuse of these data elements in downstream
calculations, and supports the continuity of the ‘impact chain’, which identifies those
information deliverables that are impacted when source data elements change.
In many cases, ETL development tools have their own metadata management. This allows
source data to be queried and profiled, with the statistical information about this data
incorporated into the metadata repository along with information on its definition and origin.
This profiling can show how categoricals are split among their discrete values, or show the
statistical distribution of a numeric value.
All information environments must be organized around a metadata repository to maintain
critical knowledge about the data. With an appropriate metadata and ETL development tool,
we can implement a highly efficient information engineering process. We start with an
information blueprint, which involves designing delivered information elements, identifying
the series of components that need to be collected or assembled to build them, and designing
70
the processes that synthesize them. This blueprint is then embodied into the metadata. Thus,
metadata should not just be a mirror of the development process, it should be the driver of the
development process
When producing metadata for a project, the first thing you start with is the business definitions
for the information deliverables that will be presented to end-users. From there, the
information engineering process begins by working backwards to identify the various
information sub-assemblies that should be intermediate components of these deliverables. We
start with all of the sub-assemblies that are pulled together in the information assembly layer.
As these are being identified, there will be some sub-assemblies that can be acquired from
other processes and used as-is, and some that will need to developed. These may also require a
set of linkages that must be developed by the integration layer, and inputs that must be
computed in the calculation layer. Each layer will then set the requirements for what it must
receive from the prior layer.
However, this is just a first iteration. Once you get down to the acquisition layer, the data
elements you need may not be there, or may not be exactly what you wanted. At this point,
different sourcing scenarios are proposed, and their effects bubbled up the process to determine
the impacts. A negotiation process then ensues to determine the actual inputs to be used and
the actual hand-off points and sub-assemblies that will be developed.
Implicit in the definitions of the various information sub-assemblies is an information
manufacturing process that builds the information. Processing considerations may therefore
cause the level at which information sub-assemblies are defined to be changed, or may
necessitate additional intermediate information components. In conjunction with the definition
of data elements, high level process flows must be developed which identify inputs,
programs/processes, and outputs.
Once this is completed, the next phase is to completely populate the metadata repository with
all definitions, transformation rules, and derivation linkages. To ensure that metadata is
consistent with what is delivered, the metadata should directly serve as the programming
specifications for the data elements that are being produced. Note that this is the case for all
changed data elements, also. The metadata should be rewritten to incorporate the latest
definitions, calculations, and components, and this should form the basis for any programming
changes.
Once metadata is captured, technicians from each implementation layer go through the process
of defining their roles and outputs to create the required elements. They then develop any new
transformation programs/processes, or plan changes to existing ones as defined in the
information flow documentation. Note that ideally, the tools used for ETL implementation
should be integrated with the metadata, so the transformation rules captured in the metadata
can be readily converted into the code for implementation.
71
“Real Time” or Active Data Warehouse ETL
In essence, a “real time” or active data warehouse is a hybrid of an operational data store and a
data warehouse. It supports the current view of the organizational to see the most recent
transactions or events, while also supporting a historical view of longer term trends.
I put real time into quotes because whether it is truly real time or just close to real time is a
matter to be determined based on planned usage and associated cost-benefit. In my experience,
real time has proven to be too expensive and of insufficient value to justify. When loading a
data warehouse, the cost per record loaded is generally inversely proportional to the number
loaded together. Therefore, low-latency micro-batching can be a highly cost effective
alternative to real-time loading of individual transactions.
The benefits of this type of scenario are substantial for a data warehouse at a level of maturity
to take advantage. Daily or even intra-day status of sales, inventories, or transactions can be
tracked, and events can be detected and that information used to trigger quick responses.
Information can be dynamically provided to people accessing web sites according to not only
their historical accesses, but also their current clickstream!
By nature, this type of update scenario (which I refer to as low-latency asynchronous updates),
carries with it some additional complexities that we do not need to worry about in the
synchronous update scenario. Because it is asynchronous, you can not cross validate data in
different tables because they could be in different states, depending on what has or has not
been updated. This means that as you update, you can only validate data locally, or on the
specific record(s) you are adding or changing.
To provide a similar degree of control and confidence in the data that my original process
would provide, we need to modify it to produce a two-step process here. The first is a data
acquisition step which takes new data and adds it into the data warehouse. The second is a
computational step where metrics are computed and aggregations/summaries are made. In this
second step, data can be “checkpointed” so cross-table synchronization and data validation can
take place. This cross-table synchronization could include referential integrity verification,
and also consistency checking of data across related tables. Results can be noted in data
quality reports, and can optionally result in questioned data being removed from production
tables and placed into “suspense” tables to be researched prior to being replaced in production.
Once appropriate data corrections are made, at that point assembly of key metrics can proceed
and then data aggregations and summarizations can occur in preparation for delivery. The
following two diagrams show a high level look at the two stages of processing of low-latency
asynchronous data feeds.
72
Stage 1 Processes
Low latency data can come from application systems in one of two ways:
1. It can be pre-consolidated into micro-batches at frequent intervals, that can be processed and loaded as batches. This is represented by the path starting with (1b).
2. It can be placed into queues where it can be subscribed to and read. Data from a queue can then either be direct loaded (Process and load step incorporates all communization, calculation, local validity testing, and loading into database) or else collected into a micro-batch in the landing area and then batch processed and loaded. This is represented by the path starting with (1a).
Micro batches
(1b) Data
acquisition
Landing Area
Initial Staging
Area
(2) Commonize
(3) Calculation
(4a) Load
Data Warehouse
Data Quality Monitoring
Data Access Tools And
Platforms
Metadata
Quality Statistics
Users
Analytical Information Environment
A high level look at low-latency asynchronous
ETL Process Flow (stage 1)
Operational Systems
Queued transactions
(1/2/3/4a) Process
and load
(1a) Collect from queue into micro-batch
73
Stage 2 Processes
Step is where the asynchronously received data is checkpointed to verify internal consistency:
1. Bi-directional joins are tested to ensure referential integrity 2. For related data in related tables, data cross-checks can be implemented to
verify consistency Once data quality and consistency is confirmed, data assembly and delivery steps can be performed as in the synchronous, periodic update scenario.
(4) Referential integrity/
Cross-table Validation
Data Warehouse
Prior Checkpoint Reference
Data
Data Quality Monitoring
(5) Information Assembly
(6) Information
Delivery
Data Mart
Cube Cube
Data Mart
Data Access Tools And
Platforms
Metadata
Quality Statistics
Users
Analytical Information Environment
A high level look at low-latency asynchronous
ETL Process Flow (stage 2)
74
Data Quality Concepts and Processes
Data quality is something that many talk about, but few are highly effective at. Just as with the
manufacturing of consumer products, quality orientation must permeate the entire information
manufacturing process. This means that all involved, from beginning to end, must think in
terms of quality as a primary objective.
Unfortunately, quality is often the least understood and lowest priority aspect of information
delivery. Information is a tenuous concept, and quality of information is even more tenuous.
Many organizations do not have even basic quantitative operational quality metrics such as
defect counts or defect rates, let alone a true understanding of the cost of poor quality or the
impacts of defects on the business value of the information deliverables. Thus, it is much easier
for a data warehousing group to concentrate on measurable things like delivery dates, number
of data elements, or number of terabytes.
The big question on data quality is: How do you quantify its impacts? It is only through
quantification of the benefits that are at risk that you can determine what an appropriate level
of expenditure is to identify and correct data quality problems. Let’s start by looking at the four
different ways in which data quality problems can cost an organization real dollars:
They may prevent you from applying information in specific ways that would
generate profitability, either based on known data problems or on a general lack of
faith in the data based on prior problems and perceived poor quality.
The process of correcting the data quality problems may delay implementation or
execution of critical information processes, which may defer the benefit stream and
reduce its long-term value.
They may incur a significant repair cost on the back end by forcing users to install
temporary (or even permanent) workarounds into their processes.
They may impact the accuracy of the business rules derived from analytical
information processes, which reduce profitability by causing you to take incorrect
actions.
The worst case is having data problems that nobody knows about. These can cause your
analytical and/or operational information processes to work incorrectly, thereby reducing or
even totally negating their benefit. For example, flawed customer data and/or householding
algorithms can cause a customer’s relationship to be misrepresented in the data warehouse.
This can cause a top-tier customer to be treated like a bottom-tier customer, and can result in
increased attrition. Likewise, it can cause customers to get the wrong direct marketing
solicitations sent to them. It can cause pre-approved loan offerings to be made to customers
who have already defaulted on loans, or cause top prospects to be bypassed in a marketing
campaign.
When devising an integrity strategy, you must approach it from a business perspective. You
75
must weigh the benefits in terms of increased information reliability (or at least being able to
identify problems before they negatively impact your processes), versus the costs incurred in
implementing and operating the data quality checks. It is never a matter of whether or not to
build in data quality checking, since there will always be critical data elements that are worth
checking no matter what the cost. Rather, the decisions to be made will generally consist of
what to check and where in the information assembly line to check it.
The way to assess the downstream organizational costs of potential data integrity problems is
to look at each data element independently. There are two key factors that must be assessed,
the aggregate value dependent on that data element, and its associated risk factors. The
following figure illustrates the error propagation path.
To determine aggregate business value at risk relative to an information deliverable, we must
link the data element to all of the analytical and operational information processes that are
dependent on it. From there, it is necessary to estimate the impact of a data integrity problem
on the ability to produce value from that process. This takes into account four factors:
The total amount of profitability generated by the business process if all of the
information end products upon which it is based are correct.
The sensitivity (rate of decrease) in business value associated with moving from an
optimal to a sub-optimal decision
The sensitivity of decisions to changes in the information end products upon which
they are based.
The sensitivity of the information end products to changes in the inputs that are used
to calculate them.
Error Propagation Chain
Error Condition
Delivered Information
Information End Products
Decisions/Actions
Enterprise Value Added
Invalid/unexpected inputs or interactions between data and ETL processes
Deviations in data elements populated to warehouse and marts
Incorrect results from end-user data preparation processes
Sub-optimal decisions or actions
Reduced Business Value!
76
This applies to both analytical and operational information processes. In an analytical process,
data quality issues could cause invalid business rules to be derived, which will reduce the
benefits of operational information processes. In addition, poor quality of inputs into the
operational information processes themselves will also diminish their value, even if the
business rules are perfect.
Understanding how defects propagate through processes allows you to understand and estimate
value at risk. Data defects ultimately impact the outputs from user processes (information end
products), which then have a cascading effect on the decisions made and resulting business
value. In some cases, changes in inputs will have minimal impact on the business value of the
back-end decisions. These are low impact elements. High impact elements are those that will
yield a significant change in the resulting decisions and business value. This relationship can
be referred to as the value sensitivity associated with a data element, which is the rate at which
value is lost from a business process as that data element diverges from correctness. For
example, assume you have 100,000 customers, of varying profitability levels. 10,000 of the
highest profitability customers (over $1,000 per year) should qualify for high tier service,
which costs the company an additional $100 per customer per year. If data problems caused
you to erroneously target an additional 5,000 low profitability customers to receive the high-
tier service, $500,000 would be spent unnecessarily. If data problems caused the 10,000 high
profitability customers to be targeted for low tier service and attrition increased by 5%, we
could lose $500,000. $500,000 is the value at risk for those two defect scenarios. To fully
assess value at risk, you must estimate value at risk across the entire set of decisions being
made on the basis of the information end product and of the full spectrum of defect scenarios.
Note that there may actually be multiple intermediate levels of information subcomponents
between the initial inputs from source systems and the information deliverables. Deviations in
subcomponents may either originate at that point due to faulty logic, or may be the result of
incorrect inputs to the computations being correctly processed. These deviations will
propagate as variances in delivered information, which ultimately impact the information end
products (via user processes) and the business processes they drive. The relative rate of change
of delivered information versus its components can be thought of as the computational
sensitivity. For example, profitability is computed as revenue minus expenses. For a slight
percentage change in revenue, there is actually a multiplicative effect on the percentage change
in profitability. This corresponds to a profitability calculation having a high degree of
computational sensitivity with respect to revenue. If we look at the average of 12 months of
balances, the sensitivity of the average to a deviation in the current month is only about 8%
(1/12). The product of the computational sensitivities of the sequence of information
subcomponents leading to an information deliverable, multiplied by the computational
sensitivity associated with the user process creating the end product, yields the aggregate
sensitivity of the information end product to that input element.
For any data element, the integrity risk factor is an estimation of the probability that the data
element, as delivered to end users, will diverge from a correct and usable form. Note the
inclusion of usability in this definition, since receiving correct information with unexpected
77
formatting or alignment differences may be just as bad as receiving incorrect information. For
a data element such as a balance that passes through virtually unchanged from the source
system to the data warehouse, the integrity risk of the deliverable is approximately equal to the
integrity risk of the source element. For a computed data element, it is the sum across all
inputs of the integrity risk associated with that input multiplied by the computational
sensitivity of the computed data element with respect to that input, plus the risk factor
associated with faulty computational logic.
Let’s first look at integrity risk that originates at the point of data capture. This is related to the
nature of the data validation that takes place upon input. One factor impacting validation is the
degree of granularity with which data is input. For example, granularity is increased when you
have separate data elements for city, state, and zip code, versus a free form address line. These
elements can then be individually verified for correctness. Risks here can be estimated in
either or both of two possible ways. Existing data anomalies can be identified within the data
stores by programmatically analyzing the contents of individual data elements. In addition, an
assessment can be made of the input controls to determine what types of errors are possible.
Generally, data elements used in actual operations (such as deposit balances or fee totals for a
bank) will be very reliable. Data elements captured for informational purposes only will
generally have much higher integrity risk, since there is often neither the focus nor the
capability to ensure correctness.
Many data quality issues are introduced after data acquisition. These generally consist of:
Errors in design assumptions that cause erroneous outputs for unexpected inputs.
Errors in coding that cause erroneous outputs for infrequently occurring/untested
input combinations.
Modifications in data element content from source systems for which ETL
programming changes were not made or were made incorrectly.
Incomplete transport of data due to hardware or software issues
Omission or inaccessibility of records due to processing errors or key duplication
problems
The point to remember here is that the more transport and transformation stages something
goes through, the higher the probability of a problem. Every time data is touched in any way,
there is the possibility of an error being introduced. This likelihood increases as complex
calculations are made such as customer profitability, which may pull in a wide range of data
elements from many sources.
Data quality assessment is almost as much art as it is science. While it is possible to recognize
data that is wrong, it is impossible to guarantee actual correctness of data. This requires that
you have access to the right answers to compare things to, which is generally not the case.
However, there are three things that you can assess to increase your comfort level with data
and verify that it is probably good:
78
Consistency with sources, to determine that no additional problems have been
introduced in any intervening transport or transformation function.
Plausibility, which pertains to whether any individual value encountered is a
possible one based on the business rules associated with that data element. For
example, negative interest rates are not plausible, nor are rates above legal limits.
Reasonableness, which pertains to whether or not the value is consistent with history
and with other values of related data elements. For example, a 10-fold increase in
balances is not reasonable. It could be an error resulting from a shifted decimal
place.
When evaluating data for these three qualities, the strategy you use will vary depending on the
type of data you are assessing. I place data into four broad categories:
Freeform Text
This is data that is input into large text fields. While documented business rules exist for
populating these fields, there is often little or no programmatic checking done to ensure
that it is being updated properly. This type of data may include name, address, or memo
fields. Generally, the only type of checking you can do with these are that they are
present or not present, and that they are correctly justified. In some cases, it may be
possible to check for the proper positioning of embedded substrings.
Dates
Dates should be checked for plausibility, to determine that they are of the correct
format (ie, month is 12 or less and date corresponds to maximum for month). In
addition, they need to be checked for reasonableness. For example, a closed date should
be null of there is an open status and it should be populated with a valid date if there is
a closed status. In addition, logical sequencing must be assessed to ensure that certain
dates are prior to other dates as mandated by business rules. For example, a closed
date should be subsequent to the open date.
Categoricals
These are any data elements that place an entity into a category. This could be product
codes, status codes, etc. Note that categoricals must be included in a pre-defined set of
valid values, based on business rules. Therefore, plausibility checks will consist of
verifying that the value of that categorical is included in that list of acceptable values,
and also to verify that it is in conformity with business rules relative to other related
data elements. An unknown value could actually mean that the list needs to be updated,
or bad data has been introduced at the source. There are two types of reasonableness
checks that could be utilized. The first is to compare the categorical in each record
with the prior version, to determine that the change of state is valid according to
current business rules. The second is to look at the overall distribution of values across
the entire population to determine consistency with historical norms.
79
Measured Values
This pertains to any numeric quantity that communicates the activity or current state
associated with an entity. Using a banking example, this would include the current
balance associated with a deposit account, or the number of debits that were processed
last month. With measured values, the data from the source can be assumed to be
correct, since the operational world will monitor this for accuracy. The key concern is
to make sure that as data transport and transformation occur, errors are not
introduced. Plausibility checks can include decimal positioning and valid sign.
Reasonableness checking can include comparisons to other related data elements and
trending over time.
The most difficult type of data to verify is computed data. This is especially true when
evaluating data that was arrived at through a complex series of summarizations and
computations, such as customer profitability. The key here is to determine plausibility and
reasonability of individual information sub-assemblies, or the components that are used to
build the information finished products. This checking is continued until you finally check the
finished products.
Key fields for relational tables may be a combination of computed data and directly sourced
data. Note that keys have an additional layer of complexity, in that they have to be determined
independently for different entities, and yet still need to be consistent across entities. Individual
key elements can be checked for consistency, plausibility, and reasonableness, and then
matched across tables to ensure that the linkages function properly.
Trending in data validation is a very straightforward concept. Simply stated, it refers to the
modeling of historical data into a predictive function that is used to extrapolate the next data
point, and the measurement and interpretation of the degree of deviation of the actual value
from that predicted data point. Deviations above a threshold quantity or percentage qualify as
anomalies, and trigger an action. Although trending applies specifically to measured (or
computed) quantities or counts, trend information broken out by categoricals can be used to
verify the consistency of the categorical definitions across time.
Of course, while the concept is simple, implementation can be complex and many
considerations must be balanced. The first trending issue to be decided is whether to have
automated or manual trend evaluation. Manual (judgmental) trend evaluation is merely the
display of N consecutive historical values for a data element, which are then visually compared
to the current value. The person evaluating the data will then make a judgment call as to
whether the data falls within reasonable bounds or not. This method carries with it a high level
of subjectivity, labor intensiveness, and potential for human error. This may be a good stop-
gap approach, but is definitely sub-optimal as a long term approach.
Automatic trending is a mathematical approach which utilizes a simple predictive (or possibly
heuristic) algorithm for determining the probable next value in a time series, and a probable
range based on the historical volatility of the variable. The probable range is used to define one
80
or more thresholds or trigger points. Comparing the actual data value for that time periods with
the trigger points will determine what level of data integrity problem exists and what potential
actions might be taken. The manner in which the historical data can be utilized to predict the
current value of data may be either simple or complex, rough or precise. All make assumptions
about the overall direction of data movement. Some of the possible prediction methods are:
Assume this month’s data will be approximately equal to last month’s data, so use
the prior value for comparison.
Assume this month’s data will be the average of the last N months
Assume linear growth and compute the slope of the line connecting the last data
month with the period N months prior, and then extrapolate forward
Assume linear growth and do a linear regression on the last N months, which can
then be extrapolated forward
Assume compounded growth, and compute the compound growth rate between the
last data month and the period N months prior, and apply this compound growth rate
to the current month
Identify annual cyclical patterns, superimposing this pattern on an annual growth
rate gleaned from analysis of the data, and use this to predict the next period
Prediction methods can be made increasingly complex, but often with rapidly diminishing
returns.
Volatility is the propensity of a value to fluctuate around its statistical trend line. Volatile data
is subject to large swings on a continual basis, which makes it much harder to distinguish
normal fluctuations from data integrity problems. Volatility is the basis for defining
threshold/action-trigger points, and can be estimated by using variance calculations or by
averaging the absolute value of the displacement of each individual period value from the
modeled value for that period (trend line or curve). Understanding the natural variations in
values would allow you to establish ‘trigger points’ that warn you when variations exceed
historical norms.
Approaching it from the other direction, you could disregard actual historical fluctuation
measurement and determine trigger points based on perceived variability of the data element
and on potential impacts that fluctuations would have on delivered information and business
processes
Trigger points could be defined to support an escalating series of actions based on potential
criticality of problems:
First trigger point (Yellow) would identify changes in values that are on the outer
fringes of normality or inner fringes of abnormality, and may or may not represent
potential problems.
Final trigger point (Red) would identify changes in values that are clearly beyond
81
normal and demand immediate attention
.
By doing this, separate ‘red alert’ reports and ‘yellow alert’ reports or email alerts can be
generated, and specific manual review processes can then be built around these two lists.
In addition to the type of trending and volatility measurement that is done, a critical design
decision is at what level of granularity the trending is to be done at. Trending can be done at:
The entity level
The application system level
The product, organizational unit, or customer type level
The account or customer level
Obviously, as you get to the lower levels of granularity for trending, the amount of work and
complexity goes up. However, there is much to be gained by going to the lower levels of
granularity:
Errors that might cancel each other out at the high level are easily detectable at the
lower levels of granularity
The lower levels of granularity support customized trend analysis for different sub-
groups
Low levels of granularity can still be summarized for reporting at higher levels
The lower levels of granularity simplify research into problems
For example, if you are tracking trends at the application system level, you would not notice
that the balances for a specific product went dramatically up, with a corresponding decrease in
the balances for another product. For this, you would need to trend a summarization by product
type. Trending each account individually, identifying the ‘alert level’ (red, yellow, or green)
associated with each account, and then summarizing that at the file level (including the total
number of red, yellow, or green account categorizations) provides the most insight into what is
happening in the data. It is also possible to apply different trending mechanisms at the record
level. In banking, for example, business checking accounts might be trended and evaluated for
volatility differently from retail, and senior checking accounts in Florida (subject to large
seasonal fluctuations) might be treated differently from the rest of retail accounts. This level of
granularity also facilitates research, since a statistical profile can be made of the ‘red’ accounts
that show up in the report to research what might be causing the problem.
Quality must be approached as an end-to-end strategy. It is only as strong as the weakest link
in the chain. The key is to understand where certain errors are most likely to occur, where they
can be most readily detected, where the data is most likely to be able to be fixed, and what
error detection mechanisms appropriately balance costs and risk. Note that the earlier in the
process a problem is detected, the easier it is to correct. Also, identifying problems later in the
process may entail significant rework if the problem has to be corrected in an earlier step, and
82
all downstream activities need to be redone. The following is a basic scenario for the types of
data quality checking that could occur in each information processing layer:
Data Acquisition
For Data Acquisition, there are three key objectives of data quality testing:
Identifying errors originating from the source systems, whether calculated or
user input.
Identifying valid data changes that will have downstream impacts as data
flows through the information assembly process.
Ensuring that all data is successfully assimilated into the decision support
environment.
This generally requires both plausibility and limited reasonability checking of all
critical data acquired. For all categoricals, every value must be compared with the
current listing of acceptable values for that data element. Exception reports should
identify any new categoricals introduced into the data (as well as any that disappeared),
for the purpose of determining if that is actually a valid change or represents an error
condition, and to determine what, if anything, needs to be done about it. This requires
coordination with all downstream ETL process managers.
To confirm that all records were transmitted successfully between the operational and
decision support environments, record counts must be generated both upon extraction
from the operational databases and loading into the decision support landing areas,
and compared to ensure completeness of transfer. In addition, control totals should be
compared to ensure that critical data elements are passed accurately.
Data Commonization
There are two different data quality objectives associated with commonization:
Making sure the correct inputs are mapped into the correct commonized data
elements.
Making sure there are no records dropped or omitted
The strategy used for checking will depend on the type of transformation that occurs.
For example, if you are mapping one set of categoricals into a different set, then a
plausibility check might be to verify that the combination of input and output is in a set
of valid combinations. A reasonability check might be to look at the distribution of the
transformed categorical in this data month, and compare it with the prior month.
83
For totals where only formatting was changed, plausibility can be verified by looking at
sign and number of decimal places, while reasonability can be verified by comparing
the pre- and post-transformation totals.
Calculations
In this phase, the primary objective is to determine that calculations are being done
properly. Calculations are somewhat difficult to verify, since there is nothing that you
can directly compare them to. Plausibility can be determined by checking formatting,
sign, and decimal points on numeric outputs, and checking output categoricals against
a list of possible values. Reasonability checking can be done by checking to make sure
that there is consistency between related data elements, which generally entails
verifying that the combination of data element contents that are being input to create
the calculation are covered by a business rule. This might involve determining if a set
of related categoricals represents a valid combination, if input values fall into correct
ranges, or if certain fields may or may not be populated based on the contents of other
elements. Unexpected combinations should be flagged as an error and those records
placed in an exception file, rather than being mapped into a default category. Any
mapping into a default category should be a decision that is made judgmentally, not
built in to the information manufacturing process. Additional quality verification can be
provided through trending analysis performed at various levels.
Integration
Integration, while the most difficult stage to implement, is actually fairly
straightforward to check. Generally, this involves answering two questions:
Are linkages across entities finding matches where they are supposed to find
matches?
Are the rows that they are being matched to the ones that they are actually
supposed to match to?
What this means is that any joins need to be verified bi-directionally to ensure that
everything that needs to match across tables does so correctly. While this is
conceptually the same as referential integrity, this does not necessarily mean building
referential integrity into the database load process. One problem is that we do not
necessarily want records to be rejected and potentially lost if there is not a match. The
key is to be able to identify those that do not match early in the process, to be able to
figure out why they did not match, and then to remedy the situation prior to the
continuation of production processing.
84
Information Assembly
Information assembly is the hardest process to check. Generally, the best you can do
here is reasonableness checking. The key is to continuously check the information sub-
assemblies as they are being manufactured. This allows you to pinpoint any specific
problems early in the processing where they are more easily detectable. A problem
which has a great impact on the value of a specific information sub-assembly may have
a fairly small but significant impact on the information deliverables that it rolls into, or
it may impact a small but important subset of the output deliverables. Checking only the
information deliverables themselves may not trigger red-flags, even though quality
issues exist.
Reasonableness checks can often be done by trending. In some cases, cross-element
checking can be done. For example, verification that two elements move in the same
direction across time periods can help to assure accuracy. Also, when trending totals, it
is often worthwhile to break out totals by various categoricals (product types, customer
types, etc). Often, problems that impact a small subset of the records that may be
invisible to the bottom line will be visible when looking a this type of meaningful
breakout of records.
Information Delivery
For information delivery, there is minimal additional processing done to data. It is
mainly an aggregation and transport process. However, there are still things that can
go wrong that must be specifically addressed. Records can be dropped as data
transmissions take place, or misinterpreted if the template through which records are
interpreted is different from the one through which they were written. If records are
being written to a repository, they can be rejected because of formatting errors,
duplicate keys, or referential integrity problems. The main things to be checked are
verifying totals stored or reported versus control totals associated with the information
assembly process. For cubes, we need to make sure dimension and metric totals
correspond with those in the atomic data.
Effective up-front validation will prevent expensive back-end cleanup. Often, the way that
data gets validated on an ongoing basis is that users get reports that look ‘funny’. A help-desk
request is placed to research the potential data anomaly. After a couple of weeks of research,
the root cause of the problem is uncovered. A request then has to be put in for the problem to
be corrected. In the meantime, the problem has permeated two months of tables, and has
dramatically impacted the company's ability to understand their customers. Compounding the
situation is that processes have already been executed and decisions made utilizing the invalid
data. The whole purpose of continuous checking of data is to find problems at the earliest
possible point, so that they can be corrected before propagating through the information
assembly line into the end-products. If a new product type has been identified or an error
85
introduced due to a systems conversion, this needs to be detected at data acquisition time. This
will allow corrections to be made downstream processes to accommodate the changed data
before it turns into a data integrity problem.
Of course, it is not possible to fix all problems that arise. This becomes a business decision.
Certain problems may be sufficiently low impact that they are not worth the effort to fix.
Fixing other problems may involve significant delays in the overall process, which would be
more detrimental to the business than the problem data elements themselves.
The key here is to develop a decision-making process. Upon detecting any error in a data
element, an escalation procedure is invoked. This will involve the appropriate individuals to
determine whether to stop the assembly line, fix the problem, and live with the delays, whether
to do a partial fix or patch that might have less impact on the delivery schedule, or even
whether to just let it go. These decisions are very important ones, and must have the right level
of management involved who can make decisions that could involve substantial sums of
money or business risk.
For each data element, we can define four critical roles in the oversight process:
The data owner is the individual who serves as the primary contact for a data
element as it exists in its source system, or system of record.
A process owner manages a process that transports or transforms a piece of data,
The data steward is the individual who serves as the primary contact for a data
element from a decision support perspective. This person is responsible for the data
element in its various manifestations throughout a decision support environment,
and will be the ultimate decision maker in the event of any quality issues.
Concerned parties are individuals who have a vested interest in the accuracy and
timeliness of data, and have identified themselves as participants in the decision
making process relative to data integrity. They will provide input to the data steward
as to impacts on their specific reports or processes.
In the event of a data integrity issue, the data steward will be immediately contacted. The data
steward will then have the option of making an immediate decision, or involving any or all of
the users who have identified themselves as concerned parties for that data element. In
addition, the data owner may be engaged if the problem is actually with the source. Together
with the programmer or programmers involved, they will determine a course of action. From a
business perspective, the data steward will bear the responsibility for the outcome of this
decision.
Note that for this type of scenario to be possible, information management processes must be
highly flexible and transparent. This would allow for that specific piece of information to be
reprocessed and reloaded with a minimal amount of disruption. This has to be designed into
your ETL. When you are designing a car, you want to make sure that you do not have to take
the engine apart to be able to change the oil! Likewise, companies whose processes consist of
86
an entanglement of incomprehensible Cobol programs will probably not be able to respond in
real time to data problems. Companies that utilize high productivity ETL tools and who have
intelligently structured their transformation and transport mechanisms using a metadata-driven
development process should be able to do this effectively.
87
Information Planning and Project Portfolio Management
An Information Plan is the offspring of the strategy and architecture. The planning process
serves to devise a workable and cost-effective scenario for building out an infrastructure that
satisfies the business requirements. This is a roadmap that will identify not only the end-state
for the decision support environment, but also each intermediate state as a series of projects are
implemented that will achieve the intended objective. An information plan will start with the
end-state:
Data elements/subject areas to be captured, stored, and delivered
A list of planned user-accessible data repositories and structures, including
warehouse, data marts, and OLAP cubes
Access tools/analytical applications
Mapping data elements/subject areas to repositories and tables
Business rules for linking data within and across repositories
Once the deliverables are identified, the work needs to be partitioned into projects. The
partitioning of work into projects is a critical part of the plan for two reasons:
Relative timing of deliverables can have a huge impact on the way end-user
processes evolve and the amount of benefit achieved
Scope of work included in different projects can have a huge impact on design and
interoperability
When planning BI implementations, it is critical to maintain a broad enough scope so that you
can capture all sub-processes and all participants in the targeted analytical information
processes. Bottom-up planning focused around satisfying tactical needs for specific projects or
products can dominate prioritization and resource allocations. Requests for single-function,
point solutions can yield sub-optimal results by neglecting impacts and interactions associated
with cross-departmental processes. Both of these can lead to process dysfunctionalities:
Discontinuities that prevent individuals from working together properly
Inefficient/ineffective processes that evolve to conform to the information available
Let’s look at how scope can impact projects. Finance comes to the business intelligence team
and requests assistance in reengineering reports. They complain that the reports are too labor
intensive. By looking at their outputs, the BI group is able to automate the process and prepare
the same output with 30% of the effort.
While that seems like a significant accomplishment, it was not necessarily the best approach.
The report generated by finance was for use by the marketing department. They went through
this report and picked out a few numbers, which they then manually integrated into their
88
spreadsheets along with some other information that they had to pull. Had the scope been
larger, the BI group would have understood the bigger picture of the project and recognized the
actual information end product that this was merely a component of. They would then
reengineer the entire process to produce the marketing end product, automating the preparation
and collection of data and building in appropriate human checks where necessary.
Here is an example on the operational side. One credit card group manages credit line
increases, while another manages credit line decreases. Both are proposed as independent
projects by their respective business units, and implemented as completely independent
processes. Does this make sense? Maybe or maybe not. However, it is critical to assess them
to determine if there are sufficient synergies and integration points that would make it
beneficial to pull both into the same project. It may be that they can both be implemented into
a common event-driven process where, depending on what event precursors might occur, the
customer may have a model dynamically executed to assess whether an increase or decrease
would be necessary. It may be that the dynamics are so different that integration would supply
no benefit. Credit line increases may only need to be done in batch on a monthly basis.
However, if high-risk customers need to be evaluated for decreases more frequently, or
possibly continuously triggered by events, then keeping them separate could make sense. At
minimum, however, these processes need to be synchronized and potential interactions
understood. Having both the credit line increase and decrease processes acting on the same
person could result in the worst case scenario of that person’s credit line bouncing up and
down as each process runs. The key is that effective analytics up front to understand the
dynamics of these processes will allow for better operational decisions to be made.
From a timing perspective, the relative timing of project completion may yield unexpected and
unwanted results. Those of us who have been on the user side realize that when life gives you
lemons, you make lemonade. Whatever IT puts out there, good, bad, or indifferent, users will
figure out a way to duct-tape it together to somehow do their jobs. The impact is that if you
deliver a partial solution with the assumption that you will later provide the remainder of the
solution, you may find that processes (even extremely sub-optimal ones) that have developed
around the partial solution are so deeply entrenched that there will be reluctance to adapt to the
latter stages of the complete solution. When you plan deliverables, make sure you consider the
timing so that you do not create your own adoption obstacles.
For example, a company may have both a data warehouse and a summarized data mart which
are used by a specific department. A project was planned to add a number of new data
elements to the data warehouse, and it was subsequently planned to add them to the data mart
also. Based on project and resource scheduling, the data was scheduled to be added to the data
mart about six months after the data was included in the data warehouse.
By the time the six months had passed and data was available in the mart, the information
analysts in the end-user department had already created their own processes to leverage the
information in the data warehouse. At this point, they did not see any reason to have to change
things, since they were running overnight in batch anyway so the difference in performance
89
relative to their existing process was not meaningful. The managers are already accustomed to
looking at things in a certain way, and since the processes already worked there was no reason
to change. This is in spite of the fact that it is a more effective usage of time for the managers
to interact directly with an OLAP tool and the information analysts to be spending their time
on more organizationally productive endeavors.
This effect can be minimized by reducing implementation lag times. If there is a separate staff
assigned to data marts, the two development efforts should definitely be done in parallel with
just slightly staggered delivery times. If a metadata driven development process is used, much
of the work will be in analysis and design, which should be done together. This should
populate metadata, from which automated functions will create much of the application.
Actual data will be required for unit testing and beyond. Synchronizing this development will
produce the two sets of deliverables sufficiently close enough to discourage trying to do
inappropriate development from the data warehouse.
Another possible course of action is to store data into the warehouse in such a manner that it
can be integrated virtually into the data mart. This would allow the view that the end user
interfaces with to be identical or at least very similar to what the data would look like if it was
physically in the data mart. By doing this, there would be no rush to physically integrate, other
than for improved performance and reduced resource usage.
The bottom line is, taking a large amount of work and partitioning it out into projects is not
easy. Larger project scopes provide broader process coverage and better handling of
information interconnection points. The smaller the scope, the faster completion occurs and
the earlier benefits can start accruing. Faster delivery keeps people engaged, creates
excitement around the new capabilities, and creates value faster. Ability to stage deliverables
is extremely important to the success of a BI initiative. To do this, the BI manager must be
able to understand the interconnections between systems to support linkages between
information activities, and make sure the deliverables for each project include the critical
connection points with other projects. As the information plan is developed, you will find that
the manner in which the BI effort is subdivided into projects may be almost as important to the
success of the project as the creativity and insight applied in devising the BI end state.
90
Organizational Culture and Change
In my opinion, the ideal situation for any BI manager is to come into an organization that has
virtually nothing. In this situation, there is tremendous flexibility and opportunity to create
efficient and effective processes according to your vision, and to mold the culture and
information paradigms of the organization into an ideal balance.
However, in any organization where BI has been entrenched for an extended period of time, BI
has a life of its own. This is what I call the self-perpetuating information culture. Imagine a
carpenter who initially learned how to use just a hammer and nails. No screws, bolts, or glue –
just hammer and nails. He becomes extremely proficient with his limited tool set. The result
is that any problem he approaches will be based on applying the hammer and nail solution
paradigm, and when he looks for tools, what he is really looking for is just better hammers and
better nails.
Similarly, back in the beginning of the BI world, there was an initial set of information
deliverables. Or maybe there was an initial set of individuals with a specific set of skills.
Nobody may remember how it started, but regardless of whether the chicken or egg came first,
what we have is a perpetuating lineage of chickens and eggs. The users with a specific skill set
will want their information in a certain way. End-user programmers will want information in
normalized tables that they can flexibly pull from. Star schema users will want to be able to
slice and dice their information through simplified interfaces. Whatever type of user will
continue to request the information in the manner in which they are used to dealing with it.
Likewise, as information continues to be delivered according to that paradigm, people are hired
and skills developed in order to be able to handle that specific environment. Thus, a self-
perpetuating culture.
This culture will determine how projects are selected and prioritized. A culture dominated by
programmers will not care about simplified delivery, elegant data structures, or pre-computed
summaries. All they want is data. Their managers on the business side, who depend on them
for data (which further reinforces the self-perpetuating culture), will delegate to them to work
with the BI group to define projects. Therefore, in this type of culture, the BI manager will get
a laundry list of data elements, with instructions to “just put them out there, and make sure they
are right”.
A BI manager can approach this in two ways. He can breathe a sigh of relief because they
have made his life so easy, and just deliver the information as requested. Or he can delve
deeper to really understand the underlying information processes. Delving deeper can be
fraught with risk, and must be approached carefully. Programmers in business units can feel
very possessive about the processes that they develop, and they can be very threatened by
somebody who they feel may want to shake things up.
The BI manager must be able to trace the analytical information process flow to determine the
91
key actors and decision makers. He must be able to define a vision for what a new process
might look like. He must be able to communicate with both the business management and the
information practitioners. Most importantly, he must be able to articulate “what’s in it for me”
for all involved parties.
The business manager must understand the concept and ramifications of process. He must buy
into the notion of how value is added to information and how it is delivered most efficiently.
Both he and the programmers must also buy into the concept that by automating more
mundane data delivery processes, the programmers can spend their time on more value-added
types of activities, while the manager can get his data more easily and consistently through an
automated interface.
By far the most challenging problem occurs when processes cross organizational boundaries,
and the bulk of the cost and effort of change is borne by a different business unit than the one
receiving the bulk of the benefit. This requires a great deal of organizational/political savvy, in
conjunction with strong marketing, mediation, and negotiation skills. This must somehow be
reframed as a win-win situation, where the costs and benefits are more equitably distributed.
Foremost to remember, though, is the BI group is there to serve. It is more important to put
out solutions that will not blatantly clash with the prevailing culture and that will find
acceptance and be adopted, versus attempting to change the world (albeit for the better), but
winding up expending significant resources on something that will not be used. Processes in
many cases have to evolve incrementally. Systems and data structures change slightly; then
the staff will change slightly in response. Since the information users in the business units
have a huge amount of intellectual capital pertaining to the implementation details of the
existing analytical information processes, no strategy is going to work that does not enlist them
as partners.
However, if process change is in fact an objective of the organization, then both the business
and the BI group need to cooperate to make this happen. Cultural change and technology
change are both antecedents to process change. They provide an environment that enables and
nurtures change. Successful change requires that these be in alignment.
However, behavioral change will not take place without appropriate consequences that
motivate the change to occur. When changing information processes, there are two types of
consequences that are at work:
Innate consequences are impacts to the actor inherent in the new behaviors within
the context of the process being changed. These may relate to the ease or difficulty
of the new behaviors, and the perceived value-add to the quality of their work.
External consequences are impacts to the actor based on linkage of the new
behaviors to external penalties or rewards. These may be embodied in goals and
objectives incorporated by their managers into performance management systems, or
may be directly tied to incentive pay and bonuses.
92
It is important to understand that behavior changes may have unintended negative
consequences, and that even positive consequences must be carefully crafted to ensure that
they do not motivate unintended behaviors that undermine their true intent.
Behavioral changes tied to process changes require at minimum informal and possibly formal
action plans. All actors need to be identified, along with their intended behavioral changes. All
enabling antecedents must be determined for each person. Innate consequences then need to
be determined, both positive and negative. External consequences must be applied to
counteract any negative innate consequences or supplement any innate consequences that are
not sufficient to motivate change. By doing this for each person or set of homogeneous actors,
a plan can be initiated and monitored to ensure that change occurs as planned.
Change does not come easy. Overcoming inertia and resistance requires communication of a
vision that all involved parties can buy in to, and the down-and-dirty work of continuous
monitoring and persistent follow-up. This is what makes a leader.
93
Tactical Recommendations
Assuming you, as Business Intelligence manager, are now a convert and believe a process-
focused approach will improve the effectiveness of your organization and the enterprise as a
whole, what should you do now? While there are some agile, flexible, and progressive
organizations out there that can quickly adapt to a new paradigm, for most it will be a long,
hard, slog. It is like turning the Titanic… keep pushing and eventually it will change direction.
In the interim, there are many tactical things that you can do to eliminate obstacles to the
development of efficient processes, even if you cannot directly engineer them:
Whatever KPIs or metrics are being used by senior managers for directing the organization,
should be tied to whatever analysis you may perform. As you identify and evaluate strategies
to change customer behavior and improve performance, the key evaluation criteria is to
approximate the contribution to the KPI metrics that this change would generate. For example,
if risk-adjusted margin is one of the key driving metrics for managing a credit card portfolio,
each action you could take that would impact the cardholder base must be evaluated for its
potential contribution to overall risk-adjusted margin. If return on assets is a critical KPI,
contributions of specific customers and accounts to the overall return on assets and the impacts
of changes in their behavior require the ability to calculate return on assets at the individual
account level. As you plan campaigns, interest rate changes, or enhanced reward strategies, the
association of the organizations KPIs with individual accounts will directly connect your drill-
down to root causes and strategies for behavior modification back to the original performance
issues identified by the executives.
Of course, there are other benefits from defining common metrics and dimensions just once. It
improves consistency in information usage across and within business units and ensures
common information language. It also can have significant impacts on system load.
Calculations of these metrics may require the integration of information across a wide range of
subject areas and individual tables across the data warehouse, and can be extremely expensive
to run. By allowing users to ‘harvest’ pre-computed metrics rather than having to recompute
them from scratch whenever they are needed, this resulted in significant reductions in system
workload and improved turnaround times for the remaining workload.
Make sure common metrics and dimensions are defined
once (at most granular level) and shared.
94
You can look at an analytical information process as a mechanism whereby you continuously
drill from more generalized observations into more specific, actionable details. What this
means is that either implicit or explicit mechanisms for drilling into more detailed data must be
available. These drill mechanisms may be built into tools and largely transparent, or they can
be merely procedural, leveraging a common data language to allow the same selection criteria
that identified a specific organizational cell to be replicated to select individual accounts from
another data repository. The important thing is that each dimension that could be used for drill
back must be available (and computed uniformly!) in all repositories that could be utilized
together within the same analytical information process.
Datamarts that exist solely for access by a specific tool can sometimes be an acceptable
answer, but in many cases introduce significant issues:
This could limit or even eliminate potential for joining data across repositories,
creating data and process discontinuities.
It could prevent other segments of users, or those with existing skillsets in other
tools, from effectively accessing this data, thereby creating an ‘island’ of
information.
It could force reengineering of the repository if the need arises to migrate from that
tool, either due to vendor problems or emergence of significantly better technology.
Before making a decision to implement this type of solution, make sure you understand the
process and data interoperability issues. Be sure to also look at more open solutions, to
determine whether slight differences in functionality can result in large improvements in
interoperability.
Design should include drill-back paths from dimensional
views back to detail wherever possible.
Avoid data that is tool specific.
95
Strategic Recommendations
Ideally, a partnership should exist between BI and enlightened managers that leverages
knowledge of processes to enhance the strategy planning process. The next step beyond
merely eliminating tactical process obstacles is active management and design of the analytical
information processes themselves as part of the strategy development process:
In reality, it all starts with the operational information processes. These are the processes that
actually create value for the organization, and are the ones that the business will focus on as it
prepares its strategy. Changes to these processes may be mandated by external requirements,
or they may be desired because of anticipated positive impacts they will have on the business.
Either way, the nature of the changes in operational information processes will drive the types
of new or enhanced business rules needed, which will then determine the needs for the
supporting analytical information processes.
Because processes may span multiple business units, they should be looked at across business
units instead of within business units. Even if the planning processes are independent, the
interconnection points between the business units need to be identified and planned for. Make
sure that as you plan, you consider:
New and enhanced information end-products and the information deliverables from
the BI environment needed to support them
Changes in information activities and in roles of different segments, and
implications for training, staffing, and tools
Changes in process/system interconnection points and communication media
From there, you will need to work backwards:
Develop strategic information plans within and across business functions according to a process-focused
future vision.
Leverage target vision for analytical information processes to drive information strategy, architecture,
and design.
96
As was discussed in the section on BI planning, you start with the metrics that will be needed
to support your strategic information plan, and the other information end-products needed to
drive the supporting analytical activities. These will then need to be associated with user
segments that will be performing these activities. Once this is done, you can derive the set of
information deliverables and delivery technologies from IT to support the generation of the
end-products and execution of the surrounding processes. It is then necessary to identify any
new and enhanced architectural components needed to support this, and map out the projects
that will generate the appropriate information and structures to make this happen.
To ensure that these projects are appropriately funded and prioritized, you must:
Under a process-focused planning scenario, you will have the linkages to business processes
needed to measure value. You can drill back from BI projects to the operational information
processes that they impact, and even back to the production/delivery and financial control
processes that the operational information processes impact. This trace-back allows you to
identify deltas in revenue and profitability which can then be allocated back to the BI projects.
In general, there are two ways of tying Business Intelligence projects to the broader strategy
deliverables. The first is to consolidate BI into an overarching operational project (it can be
implemented separately as a subproject), so that the BI costs and benefits are assumed in the
overall project costs and benefits. The prioritization of the overall project will drive the
prioritization of the BI sub-project.
If BI and associated operational changes are not inextricably tied together, then we need to
look at the marginal contribution to profitability generated by the enhanced BI capabilities.
This means that you look at the probable profitability of a project without the enhanced BI
capabilities added, and also with the enhanced BI capabilities added. The assumption is that
without the upgraded tracking and optimization capabilities afforded by Business Intelligence,
the operational process will not be as effective. The delta is the contribution to profitability of
the Business Intelligence project, so you can independently determine the return on investment
and prioritization of the BI initiative.
Let’s look at an example. A Credit Card company is introducing a new ‘secured card’ product,
which allows people with sub-prime credit to have a credit card with a credit line secured by
the contents of a savings account. To support this, a number of BI enhancements need to be
implemented:
Measure the effectiveness of BI and Data Warehousing technology based on the business value of the underlying operational information processes.
97
Data elements unique to secured credit cards (such as information on the linked
savings account) must be added
Some specialized report templates need to be developed to manage the product.
Changes need to be made to high-level metrics to support this product.
In the consolidated prioritization scenario, the data warehouse work is integrated into the
overall project. A single go/no-go decision is made, which will include both the operational
and analytical work associated with this product. If the overall project meets the ROI
threshold, then the warehouse portion of the project will automatically be approved for
implementation. If the ROI of just the data warehouse portion is needed to determine sequence
of implementation within the data warehouse implementation queue, you can allocate a portion
of the total net benefits of the project to the data warehouse effort, possibly based on portion of
overall development cost associated with the data warehouse effort.
If you look at the projects separately, you will need to figure out how much of the anticipated
profitability generated by the secured card product will be lost if enhanced analytical
capabilities can not be provided to appropriately monitor and optimize this portfolio of
accounts. This difference would be the benefit assigned to the BI effort, and would be used in
conjunction with the overall cost of the BI portion of the project to determine the ROI.
Note that an entire book can be written about project costing. There are numerous ways to deal
with the allocation of infrastructure costs, incremental DASD and processing, etc. Be sure to
include the entire costs of producing information end-products and not just the information
deliverables output from the information environment. This includes training and tool costs,
plus end-user development and operational efforts and the CPU/DASD needed to execute their
processes!
Finally, from a process perspective:
Continuous process improvement is extremely important to compete and win in this
marketplace. Those who use DMAIC (define/measure/analyze/improve/control) process
optimization methodologies such as Six Sigma will find that this Business Intelligence
paradigm fits very well into that framework.
Continuously review end-to-end processes for efficiency and effectiveness, and optimize BI tools and structures
to eliminate gaps and bottlenecks.
98
Six Sigma can actually fit into this in two ways. The first is that Six Sigma can be used to
analyze the operational information processes to determine their degree of optimality and the
amount of opportunity that analytical information processes would have to improve their
performance. For example, a credit card company looking at fee waivers for late payment fees
may determine through Six Sigma that they are exceeding the percentage waivers of their
competition for similar products and customer segments, and therefore improvements in their
waiver strategy are necessary. This could result in a recommendation to provide additional
data to the BI environment or to make significant operational changes to the waiver
determination process.
Six Sigma could also be applied to the analytical information processes directly. It can be used
to track the flow of information as end-products are produced, and to identify gaps and
bottlenecks that are delaying and hampering the effectiveness of these processes. It can then
make recommendations as to how the analytical information processes can be reengineered to
eliminate the inefficiencies and be more effective.
99
The Relentless March of Technology
The BI technologist of today has an amazing array of technologies to capture and retrieve
information. The data warehouse is now just a part of the data reservoir and data lake.
Hadoop can be used to capture huge volumes of unstructured data. MPP technologies have
generated new design paradigms based on optimizing your data distribution and reducing
query spaces for finding your data, supplanting the need for the star schema databases that
were so effective for legacy database technologies. Fast data delivery through cubes has been
replaced by immediate availability of data via in-memory databases, which can pull your
answers from detail faster than a cube can retrieve summary information by dimension from
disk. The latest MPP platforms are expanding their parallel architectures to include columnar
and in-memory data storage to provide unheard of levels of performance, and the latest
federation technology can pull data from Hadoop, your data warehouse, or the cloud without
you being any the wiser as to where it is actually coming from.
New data sources exist that 15 years ago we would never have dreamed of. Social Networking
generates huge amounts of data that can be captured and mined to generate new insights and
help us better understand our markets relative to our product portfolio. Text, voice, and
unstructured data can be mined and can be combined with structured data to provide insights
beyond anything we could come up with before.
The immense data volumes available and speed of access will enable new business processes
that can dramatically enhance our ability to analyze and engage with customers and prospects.
Yet through all this, the basic concepts of BI remain unchanged – leverage information
resources to understand your business to optimize results, or understand your customers to best
engage with them at points of contact to drive their behaviors. Information processes still fall
within the same familiar patterns and structures, even as the information and technological
components are more sophisticated than ever. The business process models are not obsoleted
by the new technology – they are even more critical for pulling all these diverse pieces together
to give them purpose and meaning.
100
Conclusion
The concepts presented in this book are more directional than they are cookbook. I often think
of Business Intelligence as being as much art as science, and as much soft skill focused as it is
technology focused. Those who are planning and developing BI often have to work with
imperfect information and make decisions under uncertainty. They have to deal with people
with diverse and sometimes diametrically opposed needs and wants. They have to deal with
enthusiastic views of the future and vested interests in the past. The BI manager needs to know
when to drive, when to acquiesce, and when to be a diplomat. Change cannot be pushed on an
organization- it must be marketed and sold to an organization.
The BI manager will not be presented with problems where there is a definitive right or wrong
answer. He will select one of a broad range of possible approaches, and his effectiveness will
fall somewhere within a continuum from black to white encompassing all intermediate shades
of gray. Identical solutions applied in different situations could be effective or ineffective
depending on the context. The important thing is to be creative in coming up with ideas and
flexible in adapting to the needs and culture of the enterprise. In short, success comes to the
Business Intelligence manager who can somehow prod the rest of the enterprise into adopting
his ideas, so they can lead him in the direction he wants to go!