Making sense of BI

1

In

Making Sense of Making Sense of

BUSINESS BUSINESS

INTELLIGENCE INTELLIGENCE

A Look at People, Processes, and Paradigms A Look at People, Processes, and Paradigms

Ralph L. Martino Ralph L. Martino

2

4th Edition

Ralph L. Martino

Table of Contents

Business Intelligence works because of technology, but is a

success because of people!

3

Introduction 4

Background 6

Modeling the Enterprise 9

Operational Information Process Model 11

Proactive Event-Response Process Model 15

Analytical Information Process Model 22

Understanding your Information 29

Understanding your User Community 32

Mapping and Assessing Analytical Information Processes 34

Information Value-Chain 40

A look at Information Strategy 42

A look at Architectural Issues and Components 48

Information Manufacturing and Metadata 59

“Real Time” or Active Data Warehouse ETL 71

Data Quality Concepts and Processes 74

Information Planning and Project Portfolio Management 87

Organizational Culture and Change 90

Tactical Recommendations 93

Strategic Recommendations 95

The Relentless March of Technology 99

Conclusions 100

4

Introduction

Business Intelligence is many things to many people, and depending on whom you ask you

can get very different perspectives. The database engineer will talk with pride about the

number of terabytes of data or the number of queries serviced per month. The ETL staff will

talk about the efficiency of loading many gigabytes of data and regularly bettering load time

Service Level Agreements. The system administrators will speak proudly of system uptime

percentage and number of concurrent users supported by the system. Business Intelligence

managers see modeled databases organized around subject areas accessible through one or

more access tools, supporting query needs for a large number of active users.

In reality, these are necessary, but not sufficient, for Business Intelligence success. Those of

us in Business Intelligence long enough realize that its mission is not to implement

technology, but to drive the business. Technical implementation problems can take an

otherwise good BI strategy and turn it into a failure, but technical implementation perfection

cannot take a poorly conceived strategy and turn it into a success. In this book, we will start

by providing the business context and framework for understanding the role of BI and what it

will take to make it a business success. We will then look at how the BI strategy will drive the

assembly of architectural components into a BI infrastructure, and how the processes that

populate and manage this environment need to be structured.

As with all other investments that an organization makes, the expected results of investments

in data warehousing and business intelligence technology should be a direct contribution to the

bottom line, with a rate of return that meets or exceeds that of any other investment the

organization could make with that same funding. Unfortunately, with information, the

connection between investment and dollar results is often lost. Benefits are often regarded as

intangible or “fuzzy”.

The framework that I will present will assist those who are making strategy, architecture, and

investment decisions related to business intelligence to be able to make this connection. To do

this requires certain insights and certain tools. The tools we will be using here are models,

simplified views of organizations and processes that allow you to identify all of the ‘moving

parts’ and how they fit together. We will keep this at a very high level. We will be focusing

on the overall ecosystem of the forest rather than on the characteristics of the individual trees.

Understanding the big picture and the interrelationships between the major components is

critical to being able to avoid partial solutions that ignore key steps in the process. A bridge

that goes 90% of the way across a river gives you minimal value, since the process it needs to

support is enabling transport completely across a river. The key is to understand the context

when you are defining your business intelligence and data warehousing environment, and

design for that context.

5

Note that in many cases, not even your business partners really understand the context of

business intelligence. You can ask five different people, and they will give you five different

views of why the data warehouse exists, how it is used, and how it should be structured. This

is not reflective of the fact that information usage is random. It is reflective of the fact that

these individuals have different roles in systematized information-based processes, with

dramatically different needs. To understand business intelligence, we must delve into the roles

that these individuals play in the overall process, and look at how each interacts with

information in his own way. Hence, the information environment cannot be designed to be a

‘one-size-fits-all’ information source, but rather must conform to the diverse needs of different

individuals and facilitate their roles and activities.

6

Background

As Euclid did when he constructed his framework for geometry, we will build on some

fundamental premises. Let’s start with the very basics:

Data warehousing and business intelligence technologies are enablers. Putting data into

normalized tables that are accessible using some tool does not in and of itself create any value.

It must be accessed and utilized by the business community in order to create value. They

must not only use it, but use it effectively and successfully. This is similar to any other tool.

Phones only produce value when a person is able to successfully contact and communicate

with another intended person. Televisions only produce value when the content that is

delivered provides entertainment or useful information. Automobiles only produce value

when individuals and/or items are transported from one location to an intended second

location. The value is in the results, and not in the entity itself.

The deliverables or activities of an individual taken in isolation would create no more value

than a solitary brick outside of the context of a brick wall. In the context of an overarching

process, a series of data extraction/manipulation activities, analyses, and decisions together

have purpose and meaning, and can ultimately impact how the business operates and generate

incremental long-term value for the enterprise. Processes are unique to individual businesses,

and their efficiency and effectiveness are important determinants of the overall organizational

success. The complete set of business processes defines an organization, and is reflective of

its underlying character and culture.

Data Warehousing and Business Intelligence technology by itself does not produce business value. Business

information users produce value, with the technology as a tool and enabler that facilitates this.

People do not produce value in isolation - overarching information processes are the vehicles through which

their activities and deliverables find meaning and context and ultimately create value.

7

How an organization operates is based upon a spider web of dependencies, some of which are

almost chicken/egg types of recursive causality. Business intelligence is just one of these

interdependent pieces. As a result, even business intelligence processes must be viewed in

context of the broader organization, and can only be changed and enhanced to the extent that

the connection points that join this with other processes can be changed.

Information culture is a major determining factor as to the manner in which processes evolve.

A conservative culture is more prone to stepwise improvements, applying technology and

automation to try to do things in a similar fashion but somewhat faster and better. A dynamic

culture is more prone to adapt new paradigms and reengineer processes to truly take advantage

of the latest technologies and paradigms. Process-focused cultures, where methodologies such

as Six Sigma are promoted and engrained into the mindsets of the employees, are more likely

to understand and appreciate the bigger picture of information processes and be more inclined

to use that paradigm for analyzing and improving their BI deployments.

Other factors related to cultural paradigms include governance and decision-making

paradigms, which will direct how people will need to work together and interact with

information. Even cultural issues such as how employees are evaluated and rewarded will

impact how much risk an employee is willing to take.

Operational paradigms of the organization relate to how it manages itself internally, plus how

it interfaces with suppliers, partners, and customers. What types of channels are used? What

processes are automated versus manual? What processes are real-time versus batch? While

these issues may not impact an organization’s intentions or interests relative to BI deployment,

they will impact the connection between BI and decision deployment points, and will impact

the breadth and effectiveness of potential decisioning applications.

As with any other systematized series of interactions, information processes have a tendency

to reach a stable equilibrium over time. This is not necessarily an optimal state, but a state in

which the forces pushing towards change are insufficient to overcome the process’s inherent

inertia. Forces of change may come from two different sources – organizational needs for

change to improve effectiveness or achieve new organizational goals, and a change in

underlying tools and infrastructure which enables new process enhancements and efficiencies.

Processes are designed and/or evolve in the context of organizational/operational paradigms, standards, and

culture, and adapt to whatever underlying infrastructure of tools and technology is in place.

8

Ability to control and manage process change is critical for an enterprise to be able to thrive in

a constantly changing competitive environment, and is a key determinant of success in the

development and deployment of Business Intelligence initiatives.

In this book we will look together at the big picture of business intelligence and how it fits in

the context of the overall enterprise. We will do this by focusing on the underpinnings of

business intelligence: people, processes, and paradigms.

9

Modeling the Enterprise

In our quest to define the context for Business Intelligence, we need to start at the top. The

first thing we will do is come up with an extremely simplified model of the enterprise. If you

reduce the enterprise to its essence, you wind up with three flows: funds, product, and

information. These flows are embodied in the three levels of this diagram:

Flow of funds relates to the collection and disbursement of cash. Cash flows out to purchase

resources, infrastructure, and raw materials to support production and distribution, and flows in

as customers pay for products and services. Processes in this category include payables and

receivables, payroll, and activities to acquire and secure funding for operations.

Development and production activities physically acquire and transport raw materials, and

assemble them into the finished product that is distributed to customers. For a financial

institution, it would consist of acquiring the funding for credit products and supporting the

infrastructure that executes fulfillment, transaction processing, and repayment.

Marketing and distribution activities consist of all activities needed to get product into the

hands of customers. It includes the identification of potential customers, the packaging of the

value proposition, the dissemination of information, and the delivery of the product to the

Development and

Production Activities

Products, Services,

Experiences

Marketing and

Distribution Activities

Financial Control Processes

- Flow of funds -

Information Processes - Flow of Data -

Customers

10

customer. For credit cards, it includes everything from product definition and pricing, to direct

mail to acquire customers, to assessing credit applications, to delivering the physical plastic.

In addition, it includes any post-sales activities needed to support the product and maintain

customer relationships.

Shown on the bottom, since these are the foundation for all other processes, are information

processes. These processes represent the capture, manipulation, transport, and usage of all data

throughout the enterprise, whether through computers or on paper. This data supports all other

activities, directing workflow and decisions, and enables all types of inward and outward

communication, including mandatory financial and regulatory reporting.

Of course, in the enterprise there are not three distinct parallel threads – information, financial,

and production/marketing processes are generally tightly integrated into complete business

processes. For example, in the business process of completing a sale, there is a flow of funds

component, flow of product component, and flow of information component, all working

together to achieve a business objective.

Our focus here will be on information processes. We will look at how they interact with other

processes, how they generate value, and how they are structured. We will start at the highest

level, where information processes are subdivided into two broad categories, operational

information processes and analytical information processes.

11

Operational Information Process Model

In its essential form, the operational information process can be modeled as follows:

Let’s break this out section by section. An organization is essentially a collection of

operational business processes that control the flow of information and product. What I

consider to be the intellectual capital of the enterprise, the distinguishing factor that separates it

from its competitors and drives its long-term success, is the set of business rules under which it

operates. Business rules drive its response to data, and dynamically control its workflows

based on data contents and changes. Business rules may be physically embodied in computer

code, procedures manuals, business rules repositories, or a person’s head. Wherever they are

located, they are applied either automatically or manually to interpret data and drive action.

The life-blood of any operational process is, of course, the data. I have broken this out into

two distinct data categories. The first describes all entities that the enterprise interacts with.

Business

Rules

Entities/ Status

Events

Operational

Data Foundational Processes

• Product development/Pricing • Capacity/infrastructure Planning • Marketing strategy planning

Reactive Processes

• Dynamic pricing/discounts • Customer service support • Collections decisioning

Proactive Processes

• Sales/Customer Mgmt • Production/Inventory Management

Operational Information

Processes

12

This could be their products, suppliers, customers, employees, or contracts. Included in this

description is the status of the entity relative to its relationship with the enterprise.

When the status of an entity changes, or an interaction occurs between the entity and the

enterprise, this constitutes an event. Events are significant because the enterprise must respond

to each event that occurs. Note that a response does not necessarily imply action – a response

could be intentional inaction. However, each time an event occurs, the enterprise must capture

it, interpret it, and determine what to do in a timeframe that is meaningful for that event.

There are certain organizational processes that must execute on a regular basis, being driven by

timing and need. These are the foundational processes. Include in this are the development of

strategy, the planning of new products and services, the planning of capacities and

infrastructure. These processes keep the organization running.

In the operational information process model, there are two distinct scenarios for responding to

events. The first consists of what I refer to as reactive processes. A reactive process is when

the event itself calls for a response. It can be as simple as a purchase transaction, where money

is exchanged for a product or service. A more complex example from the financial services

industry could be when a credit card customer calls customer service and requests that his

interest rate be lowered to a certain level. The enterprise must have a process for making the

appropriate decision: whether to maintain the customer interest rate, lower it to what the

customer requests, or reduce it to some intermediate level.

Whatever decision is made, it will have long-term profitability implications. By reducing the

interest rate, total revenue for that customer is reduced, thereby lowering the profitability and

net present value of that customer relationship. However, by not lowering the rate, the

enterprise is risking the total loss of that customer to competitors. By leveraging profitability

and behavioral information in conjunction with optimized business rules, a decision will be

made that hopefully maximizes the probabilistic outcome of the event/response transaction.

The second type of event-response process is what I call a proactive process. The distinction

between proactive and reactive processes is the nature of the triggering event. In a proactive

process, the event being responded to does not necessarily have any profound significance in

and of itself. However, through modeling and analysis it has been statistically identified as an

event precursor, which heralds the probable occurrence of a future event. Identifying that

event precursor gives the enterprise the opportunity to either take advantage of a positive future

event or to mitigate the impact of a negative future event.

For example, a credit card behavioral model has identified a change in a customer’s behavior

that indicates a significant probability of a future delinquency and charge-off. With this

knowledge, the company can take pre-emptive action to reduce its exposure to loss. It could

actually contact the customer to discuss the behaviors, it could do an automatic credit line

decrease, it could put the customer into a higher interest rate category. The action selected

would hopefully result in the least negative future outcome.

13

Note that without business rules that identify these events as event precursors, no response is

possible. In addition, other factors are involved in determining the effectiveness of the event-

response transaction. The first is latency time. A customer about to charge-off his account

may be inclined to run up the balance, knowing it will not be paid back anyway. Therefore,

the faster the response, the better the outcome will be for the company. Enterprise agility and

the ability to rapidly identify and respond to events are critical success factors.

Another factor that plays a huge role in the effectiveness of an event-response is data quality.

The business rules set data thresholds for event categorization and response calculation. The

nature or magnitude of the data quality problem may be sufficient to:

Cause the precursor event to go undetected and hence unresponded to

Change the magnitude of the event-response to one that is sub-optimal

Cause an action to occur which is different from what is called for

This will result in a reduction in profitability and long-term value for the organization. Small

variations may not be sufficient to change the outcome. We will later discuss process

sensitivity to data quality variations and how to assess and mitigate this.

Operational information processes are implemented primarily using application software that

collects, processes, and stores information. This may be supplemented by business rules

repositories that facilitate the storage and maintenance of business rules. In certain event-

response processes, BI tools may also be utilized. This would be in the context of collecting

and presenting information from low-latency data stores, either by looking at a full data set or

isolating exceptions. This information is presented to analysts, who assimilate the information

from these reports and apply some sort of business rules, whether documented or intuitive.

This supports tactical management functions such as short term optimization of cash flows,

staffing, and production.

Our biggest focus will be on business rules. The validity of the business rules has a direct

impact on the appropriateness and business value of event-responses. In many cases, business

rules interact and not only need to be optimized as stand-alone entities, but also within the

context of all of the other business rules. This leads to the fundamental assertion:

In other words, you could theoretically ‘solve’ this as a constrained maximization problem.

You pick a single critical organizational metric, such as shareholder value. You identify all

Given a primary organizational metric and a defined set of environmental constraints, there is a single set of business rules that maximizes

organizational performance relative to that metric. This optimal set changes as the environment changes.

14

constraints related to resource costing, customer behavior, competitive environment, funding

sources, etc. What this states is that there is a single combination of business rules that will

achieve the maximum value for that metric. There are several corollaries to this:

Because of cross-impacts of business rules, you cannot achieve the optimal set by

optimizing each rule in isolation. Optimizing one rule may sub-optimize another.

As you increase the number of business rules assessed simultaneously, the

complexity increases geometrically, becoming unwieldy very rapidly.

The unfortunate conclusion is that achieving exact optimality is a virtual impossibility,

although wisely applied analytics can get you close. Part of the art of analytics is

understanding which rules have sufficient cross-impacts that it makes sense to evaluate them

together, and which can be approximated as being independent to simplify the math. These

trade-offs are what make the human, judgmental aspects of analytics so important.

Of course, even if you were to somehow identify the optimum combination of business rules,

your work would not be done. Because the environmental constraints are continuously

changing, the business rules that optimize organizational performance will also need to change

accordingly. Optimization is a continuous process, not an event.

15

Proactive Event-Response Process Model

Because most reactive information processes are handled by production applications and are

therefore not as interesting from a BI perspective, I would like to spend a little additional time

discussing proactive event-response processes. These are often referred to by industry thought

leaders as real-time information processes. Unfortunately, the perception most people have of

real-time misses the real point. Most people think of real-time in technological terms,

assuming it is synonymous with immediate access to events and data changes as they occur.

They associate it with a series of architectural components:

Messaging allows events to be captured from applications/processes as they occur.

Immediate access to live data allows quick decisions to be made and actions to be

taken.

Continuously operating engines for event capture, analysis, and response ensure

quick turnaround.

However, the true essence of real-time, from a purely business perspective, is very different:

Using this definition, real time often involves but no longer necessitates instantaneous

responses, nor is the focus around a specific technology set. Real-time now can be looked at in

purely business terms. Since we are now talking about optimizing business value, the

underlying issue becomes the maximization of net profitability, which is driven by its cost and

revenue components:

The costs of integrating the information sources needed to make an optimal response

to the event, which are dependent on the underlying infrastructure, application

software, and architecture.

The revenue produced through the implementation of that response, which are

dependent on the nature and distribution frequency of different possible event-

response outcomes.

Both costs and revenues are fairly complex. To facilitate analyzing and optimizing proactive

information processes, I have come up with some simple models. First, let’s break the

proactive event-response process out into a series of basic steps:

Real-time refers to the ability to respond to an event, change,

or need in a timeframe that optimizes business value.

16

As you can see from this timeline, the event-response window begins with the occurrence of a

trigger event, which has been identified as a precursor to a future, predicted event. The event-

response window closes at the time that the future event is predicted to occur, since at that

point, you can no longer take any action that can impact the event or its results. Let us look

individually at these process components.

Trigger event is detected and recorded:

After a trigger event occurs, data describing this event must be generated and stored

somewhere. Event detection is when the knowledge that this trigger event has occurred

is available outside of the context of the operational application that captured it and

becomes commonly available knowledge. This may happen because an event record is

placed on a common data backbone or bus, or it may happen because an output record

from that application is ultimately written out to an operational data store for common

usage. In some cases, significant processing must be done to actually detect an event.

Because of limitations in the source systems for the data needed, it is possible that deltas

will have to be computed (differences in data value between two points in time) to

actually detect an event.

Event is determined to be significant:

Out of all the events for which a record is captured, only a small number of them will

actually have significance in terms of foretelling future events. A mechanism must be

in place to do preliminary filtering of these events, so that just the small subset of events

with the highest probability of having meaningful consequences are kept. Note that at

this stage, without any contextual information, it is difficult to ascertain significance of

an event with any accuracy, but at least a massive cutdown of the volume of events to

be further examined can occur.

Context is assembled for analysis:

While an individual event or piece of data by itself is not necessarily a reliable predictor

of an event, it does indicate a possibility that a certain scenario will exist that is a

precursor to that event. The scenario consists of the fact that that event occurred, plus a

complementary series of prior events and conditions that in total comprise a precursor

Precursor (trigger) event takes place

Predicted future event

Duration of event-response window based on probabilistic lag between precursor and

predicted event

Trigger event is

detected and

recorded

Event is determined

to be significant

Context is assembled for analysis

Future event is predicted

and required action is

determined

Action is

initiated

Results of action are manifested

17

scenario. Once that single individual piece of the picture, the trigger event, is detected,

the data elements that comprise the remaining pieces must be pulled together for

complete evaluation within the context of a statistical model.

Future event is predicted and required action is determined

After all data is assembled it is run through the predictive model, generating probability

scores for one or more events. Depending on where these scores fall relative to

prescribed predictive thresholds, they will either be reflective of a non-predictive

scenario that does not require further action, or else will predict a future event and

prescribe an appropriate action to influence the future outcome.

Action is initiated:

All actions must be initiated in the operational world, via an appropriate system and/or

channel. Actions may include pricing updates, inventory orders, customer contacts, or

production adjustments. Actions may either be implemented:

Manually - a list is generated and a human must act upon that list in

order for any action to take place.

Automatically - data is transmitted to appropriate systems via automated

interfaces, with control reports for human validation. A

person must intervene for an action not to take place.

Results of action are manifested:

After an action is initiated, there will be some lag up until the time that it is actually

manifested. Actions are manifested when there is an interaction between the enterprise

and the person or entity being acted upon. For example, if the predicted event is the

customer’s need for a specific product and the action is to make an offer to a customer

to try to cross-sell a new product, the action manifestation is when the person receives

the offer in the mail and handles that piece of mail. If the predicted event is running out

of inventory and the action is to place an order for additional inventory, the action

manifestation is when the additional inventory is actually delivered.

As with any other process, the designer of the process has numerous decision points. Each

individual step in the process has a specific duration. This duration may be adjusted based on

process design, what types of software and infrastructure are involved, how much computing

resource is available, what type of staffing is assigned, etc. By understanding that the critical

time to consider is the full process cycle from trigger event occurrence to action manifestation,

and not just event detection, it is then apparent that trade-offs can be made as you allocate your

investment across the various response-process components. A certain level of investment

may cut event detection time by a few hours, but the same investment may accelerate action

initiation by a day or action manifestation by 2 days.

18

Note that while your event-response process is probably fairly predictable and should complete

in a specified amount of time with a fairly small variance, there is probably a much wider

variance in the size of the event-response window:

There should be just the right amount of time between the action manifestation and the mean

event occurrence. There must be sufficient lead time between action manifestation and the

mean event occurrence time that will provide adequate response time for a behavioral change

to occur. However, if action manifestation occurs too soon, you may wind up sending a

message before it actually has relevance for the recipient, thus reducing its impact, or you risk

spending money unnecessarily on compressing the process. To summarize the relationship

between action manifestation and predicted event occurrence:

• You gain benefit when your action manifests itself with enough lead time relative to

the predicted event to have the intended impact.

- For customer management processes, it also requires an appropriate and

receptive customer in order for value to be generated. Actions that do not

produce a response generate no value.

- Revenue reductions occur when costly offers are accepted by inappropriate

customers, thereby costing money without generating a return. If a credit

card company reduces the interest rate on an unprofitable customer to avert

probable attrition, this constitutes an action on an inappropriate customer.

They not only lose by keeping an unprofitable customer, they compound their

losses by further reducing interest revenue.

• Net gains must provide an appropriate return relative to development, infrastructure,

and operational expenses.

Time relative to Precursor Event

Fre

qu

en

cy

Action

Manife

statio

n

Mean e

vent

occurrence

When predicting an event, a

certain percentage of the time it

will not actually happen. The

remaining time, it will occur

according to a certain probability

distribution. The action

manifestation must occur prior to

the predicted event, and with

sufficient lead time to allow for a

change in behavior to occur.

Time relative to Precursor Event

Fre

qu

en

cy

Action

Manife

statio

n

Mean e

vent

occurrence

When predicting an event, a

certain percentage of the time it

will not actually happen. The

remaining time, it will occur

according to a certain probability

distribution. The action

manifestation must occur prior to

the predicted event, and with

sufficient lead time to allow for a

change in behavior to occur.

19

Environmental and process-related factors that will determine how effective your event-

response processes are and how much value they generate include:

• Operational effectiveness will determine how efficiently you can detect and respond

to events.

– Rapid and accurate execution of your operational processes

– High quality data being input into the process

– Efficient data interfaces and transfer mechanisms

• Quality of Analytics will determine the effectiveness of the business rules used to

drive your operational processes and how optimal your responses are.

– Accuracy of prediction: maximizing the probability that the condition you are

predicting actually occurs, thereby reducing “false positives” where you take

expensive action that is not needed.

– Accuracy of timing: narrowing the variance of the timing of the predicted

event, so that the action occurs with sufficient lead time to allow behavior

change to take place, but not so far in advance as to be irrelevant and

ineffective.

Because of the tradeoffs that need to be made, there is more involved in the model

development process than just producing a single deliverable. A wide range of predictive

models could be developed for the same usage, with varying input data and data latency, and

whose outputs have different statistical characteristics (accuracy of prediction and accuracy of

timing). Implementation and operational costs will vary for these. Optimization requires an

iterative development process, which generates and analyzes potential alternatives:

• Utilize statistical modeling to analyze a series of potential data input scenarios,

comparing the predictive precision of each scenario.

• Derive cost curve by looking at development/operational expense associated with

each scenario.

• Depending on the predictive accuracy of the models and on the timing relationship

between the original precursor event and the predicted event, the success of the

action will vary. Utilize this information for varying times to derive benefit curve.

You will find some general characteristics in your curves. In general, the further you try to

reduce latency and decrease response lag, the higher the cost. More data elements from more

diverse sources can also drive increased costs. Some sources are more expensive than others,

and this needs to be considered. At the same time, benefits will vary according to the

statistical predictive effectiveness of different model scenarios. Benefit also decreases based

on response lag, approaching zero as you near the predicted mean event time. The goal is to

identify the point where net return is maximized.

20

Graphically, it looks something like this:

Essentially, what this says is that the most robust and elaborate solution may not be the one

that is most cost effective, and that the most important thing is to match the solution to the

dynamics of the decision process being supported.

Some interesting examples of proactive event response processes come from the financial

services industry. One such example is trying to capture funds from customer windfalls. If a

customer receives a windfall payment, it will generally be deposited in his checking account.

It will sit there for a brief period of time, after which the customer will remove it to either

spend or invest. If the financial institution can detect that initial deposit, it is possible that they

could cross-sell a brokerage account, mutual fund, or other type of investment to this person.

The process will start out by looking for deposits over a specific threshold. This can be done

either by sifting through deposit records, or possibly by looking for a daily delta which shows a

large increase in balance. Once these are identified, context has to be collected for analysis.

This context could include the remainder of the banking relationship, the normal variance in

$ cost/

benefit

Decreasing data breadth,

currency, action lead time

Point where Net Return is

maximized

21

account balances, and some demographic information. Predictive modeling has indicated that

if the customer has a low normal variance (high normal variance means that he often makes

large deposits as part of his normal transaction patterns), does not already have an investment

account, has an income of between 30k and 70k, and has low to moderate non-mortgage debt ,

he is probably a good prospect. A referral would then be sent to a sales representative from the

investment company, who would then contact him to try to secure his business.

Since modeling indicated that the money would probably be there five days before it is moved,

a response process that gets a referral out the next day and results in a customer contact by the

second day would probably have a high enough lead time. Therefore, identifying these large

deposits and identifying prospects for referrals in an overnight batch process is sufficiently

quick turnaround for this process.

Another example shows that real-time processes do not necessarily need anything close to real-

time data. This relates to cross-selling to customers who make a service call. After handling

the service call, satisfied customers are given an offer to buy a new product. The way it works

is that on a regular basis, large batch jobs compute the probable products needed (if any) for

the customer base. These are kept in a database. When the customer calls, the database is

checked to see if there is any recommended product to be sold to that customer. If so, the

customer is verified to make sure there are no new derogatories on his record (missed

payments), and he has not already purchased that product. If neither of those are true, the

customer receives the offer. Results are subsequently tracked for the purpose of fine-tuning

the process.

There are, however, processes that are totally appropriate for real-time analysis, or at minimum

a combination of real-time with batch analysis. If somebody is on your web site, the key is to

get that person directed in real time to the offer he is most likely to want. This may be an

upsell, a cross-sell, or just a sale to somebody who comes in to “browse”. Real time analysis

would be very expensive, requiring that the person’s purchase, offer, and web session history

be available in an “in memory” database to analyze. A more cost effective model might be to

do batch analysis on a regular basis to determine a person’s current “state”, which is a

categorization that is computed based on all his prior history. The combination of this current

state with the recent events (what sequence of pages got the person to where he is, what is in

his cart, how was he directed to the site, etc) would then need to be analyzed to determine what

should be offered, which is substantially less data manipulation that needs to be done while

somebody is waiting for the next page to be served up.

There are no shortage of architectural approaches to any given problem – the key will be

balancing operational effectiveness, business effectiveness, implementation cost, and

implementation time.

22

Analytical Information Process Model

The question then is, how do you identify this optimal set of business rules? Historically, this

has been done through intuition and anecdotal evidence. Today, staying ahead of competitors

requires that you leverage quantitative data analysis utilizing business intelligence technologies

to achieve the next level. This analysis and technology is incorporated into Analytical

Information Processes, which I define as follows:

These processes are focused around understanding patterns, trends, and meaning imbedded

within the data. Even more importantly, they are oriented towards utilizing this understanding

as the basis for action, which in this context is the generation of new and enhanced business

rules. Viewed from the perspective of Operational Information Processes, they would look

like this at a high level:

Business Rules

Entities/ Status

Events

Operational Data

Analytical Information

Repositories

Analytical Information Processes

focused around the optimization

of Business Rules

Foundational Processes

• Product development • Capacity/Infrastructure

Planning • Marketing strategy

planning

Reactive Processes

• Dynamic pricing/discounts • Customer service support • Collections decisioning

Proactive Processes

• Sales/Customer Mgmt • Production/Inventory

Management

Operational Information Processes

Analytical Information Processes are iterative, closed-loop, collaborative workflows that leverage knowledge to produce new and updated business rules. These processes consist of a prescribed series of interrelated data

manipulation and interpretation activities performed by different participants in a logical sequence.

23

As you can see from the prior diagram, the inputs for the Analytical Information Processes are

data stored in Analytical Information Repositories. These are distinct from the operational

databases in two ways:

They provide sufficient history to be able to pick a point in time in your data, and

have enough history going backward from there to be able to discern meaningful

patterns and enough data going forward from there to allow for outcomes to be

determined.

They are optimized for the retrieval and integration of data into Information End

Products, which I define as the facts and context needed to make decisions, initiate

actions, or determine the next step in the process workflow.

The following diagram illustrates the true role of a BI team. Its goal is to create a system of

user-tool-data interactions that enable the creation, usage, and communication of information

end products to support effective execution of analytical information processes:

Users

Interact with their data through a….

Tool Suite, which includes:

Query tools Reporting Tools,

OLAP tools Analytical Tools

Multi-tiered Information Environment,

consisting of:

Extreme volume, low latency

High Volume, quick access for analytics

Low volume, immediate access for realtime decisions

Which access data structures within a...

Analytical Information Processes,

or activities that together will optimize business rules and

generate improved profitability or competitive advantage.

Interact with each other

to implement…

24

Part of the problem with assessing and truly understanding analytical information processes is

that these processes can be very complex, and often are ad-hoc and poorly documented.

Without a framework for simplifying, systematizing, and organizing these processes into

understandable components, they can be completely overwhelming. Faced with this

complexity, many project managers responsible for data warehouse requirements gathering

will generally just ignore the details of the business processes themselves, and focus on the

simplistic question ‘what data elements do you want?’ If you design a warehouse with the

focus on merely delivering pieces of data, and neglect to ascertain how it will be used, then

your result may be a system that is difficult, time consuming, or even impossible to use by its

intended users for its intended purpose.

Understanding the nature of information processes is therefore critical for success. If we look

closely at the type of processes that are performed that fall within the decision support realm,

we can actually notice some significant commonalities across processes. My assertion is that

virtually all analytical information processes can be decomposed into a common sequence of

sub-processes. These sub-processes have a specific set of inputs, outputs, and data

analysis/manipulation activities associated with them. This also implies that specific sub-

processes can be mapped to roles, which are performed by specific segments of the

information user community, and which require specific repository types and tools. The

Analytical Information Process Model decomposes projects into a sequence of five standard

components, or sub-processes:

Problem/Opportunity Identification

Drill-down to determine root causes

Identify/select behaviors & strategy for change

Implement strategy to induce changes

Measure behavioral changes/ assess results

25

A detailed description of each sub-process is as follows:

Sub-process 1 – Problem/Opportunity Identification

In this process component, the goal is to achieve a high-level view of the organization.

The metrics here tend to be directional, allowing overall organizational health and

performance to be assessed. In many cases, leading indicators are used to predict

future performance. The organization is viewed across actionable dimensions that will

enable executives to identify and pinpoint potential problems or opportunities. The

executives will generally look for problem cells (intersections of dimensions) where

performance anomalies have occurred or where they can see possible opportunities.

These may be exceptions or statistical outliers, or could even be reasonable results that

are just intuitively unexpected or inconsistent with other factors.

Sub-process 2 - Drill Down to Determine Root Causes

Here, analysts access more detailed information to determine the ‘why’s. This is done

by drilling into actionable components of the high level metrics at a granular level, and

examining the set of individuals comprising the populations identified in the cells

targeted for action. The end-product of this step is to discover one or more root causes

of the problems identified or opportunities for improvement, and to assess which of

these issues to address. For example, if we identify a profitability problem with holders

of a specific product, the drivers of profitability would be things like retention rates,

balances, channel usage, transaction volumes, fees/waivers, etc. By pulling together a

view of all the business drivers that contribute to a state of business, we can produce a

list of candidate business drivers that we could potentially manipulate to achieve our

desired results. Once we have collected the information on candidate business drivers,

the decision needs to be made of which to actually target. There are a number of

factors that need to be considered, including sensitivity (amount of change in the

business driver needed to affect a certain change in your performance metric), cost, and

risk factors. The output from this sub-process will be a target set of business drivers to

manipulate, a target population that they need to be manipulated for, and some high-

level hypotheses on how to do it.

Sub-process 3 - Identify/Select Behaviors & Strategy for Change

This sub-process probes into the next level, which is to understand actual set of

interacting behaviors that affect business drivers, and determine how to manipulate

those behaviors. For those who are familiar with theories of behavior, this is an

application of the ABC theory: antecedent => behavior => consequence. What this

26

means is that in order to initiate a behavior, it is first necessary to create antecedents,

or enablers of the behavior. This could include any type of communication or offers

relative to the desired behavior. To motivate the behavior, one must devise

consequences for performing and possibly for not performing the behavior. This could

include incentive pricing/punitive pricing, special rewards, upgrades, etc. Assessing

this requires complex models which predict behavioral responses, additionally taking

into account how certain actions performed on our customer base can have a series of

cascading impacts, affecting both the desired behavior and also potentially producing

other side affects. From an information perspective, this is by far the most complex and

least predictable task, and often requires deep drill into the data warehouse, sifting

through huge volumes of detailed behavioral information.

Sub-process 4 - Implement

Ability to implement is perhaps the most critical but least considered part of the entire

process. This criticality is due to the fact that the value of an action decreases as the

duration of time from the triggering event increases. This duration has two

components. The first is the analytical delay, which is the time it takes to determine

what action to take. The second is the execution delay, the time for the workflow to

occur that implements the antecedents and/or consequences required into the

operational environment

Implementation is often a complex activity, requiring not only information from the

decision support environment (data marts, data warehouse, and operational data

stores), but also processes to transport this back into the operational environment.

Because time to market is a critical consideration in being able to gain business value

from these strategies, special mechanisms may need to be developed in advance to

facilitate a rapid deployment mode for these strategies. Generally, this is a very

collaborative effort, but is focused around the ability of the information management

and application systems programming staffs to be able to execute. There could be a

wide variation in this time, depending on what needs to be done. Changes to the

programming of a production system could take months. Generation of a direct mail

campaign could take weeks. Updating a table or rules repository entry may take hours

or minutes.

Sub-process 5 - Assess Direct Results of Actions

There are two key assessments that need to be made after a tactic is implemented. The

first is whether or not it actually produced the anticipated behaviors. Generally,

behaviors are tracked in the actual impacted population plus a similarly profiled but

unimpacted control group to determine the magnitude of the behavioral change that

occurred. In addition to understanding what happened (or did not happen), it is also

27

critical to understand why. There could have been problems with execution, data

quality, or the strategy itself that caused the results to differ from the expectations. The

output from this step is essentially the capture of organizational learnings, which

hopefully will be analyzed to allow the organization to do a better job in the future of

developing and implementing strategies.

Because most business processes are cyclical, you end where you began, assessing the current

state of business to determine where you are relative to your goals.

To illustrate how this maps to specific activities, I have chosen the marketing function for a

financial institution. I have taken an illustrative set of measures and activities that occur and

showed how they map into the five sub-processes:

Let’s look at an example of how operational and analytical information processes are

interrelated. My team was involved with a project that actually had both operational and

analytical components. Our task was to access the data warehouse for the retail portion of the

Business

Performance

Management

• Financial ratios

• Profitability • Retention/

attrition rates • Risk Profile

Drill-down to

root causes/

business

drivers:

• Product

utilization and profitability

• Channel utilization and profitability

• Customer relationship measurements

• Attriter/retained customer profiling

• Profitable customer profiling

• Transaction breakdowns

• Fees paid vs. waived

Assess/select

behaviors to

manipulate:

Implement

strategy to

alter

behaviors:

Measure

strategy

effectiveness:

• Statistical analysis

• Data mining • Predictive

model development

• What-if analysis

• Intuitive assessment of information

• Direct Mail • Telemarketing • Feedback to

customer contact points

• Changes in pricing, credit lines, service levels

• Changes to customer scores, segments, or tags

• Measure new and closed account activity

• Measure changes in balances

• Measure changes in transaction behavior

• Measure changes in attrition and retention rates

• Measure collections rates

28

bank (demand deposits, CDs, car loans, mortgages, etc.), and pull together information on the

customer’s overall relationship. This consisted of overall profitability, breadth of relationship,

and derogatories (late payments). This information was to be ported over to the credit card

data warehouse platform, after which it would be funneled into two different directions. The

data would be shipped to the operational system used to support credit card customer service,

where it would be displayed on a screen that supports various operational information

processes. In addition, it would go into the credit card data warehouse, where it would be

accumulated over time in the analytical environment.

By moving the data into the data warehouse, it could be integrated with organizational metrics

and dimensions and used in the execution of analytical information processes. These processes

would be used to devise new or enhanced business rules, so that operational processes such

credit line management, interest rate management, customer retention, and even cross-sells,

could leverage the additional information. These business rules could either be incorporated

directly into the customer service application (via scripting), or else could be incorporated into

procedures manuals and training. As you collect more historical data, your analytical

information processes will yield continuously improved business rules. This is because of two

factors: models would work better with a longer time series of information, and you

additionally have the benefit of the feedback loop as you assess the results of the application of

prior business rules and apply those learnings.

29

Understanding your Information

All information is not created equal. Different types of information have different roles in

analytical information processes. Different roles mean that it flows through the environment

differently, is stored differently, and is used differently. At a high level, I use this taxonomy to

describe the different classes of information to capture and manage:

Performance Metrics

Organizational Dimensions

Actionable Measures that drive performance

Behavioral descriptors/ measures

A small set of high-level measures, generally utilized by senior managers to evaluate and diagnose organizational performance; in addition to current performance indicators, leading indicators may be included to determine performance trend.

These are standardized, actionable views of an organization which allow managers to pinpoint the subsets of the customers or products where there might be performance issues

These represent the actual root causes of performance, at a sufficiently low level that actions can be devised which can directly affect them. Analysts can then make the determination of which of these measures should be targeted for improvement in their strategies to impact the high level organizational performance across the appropriate dimensions.

Customer behaviors related to purchases, transactions, or requests for service link back to the measures that drive performance. Strategy development consists of deciding which of these behaviors to modify, and how to do it. As behaviors are modified, assessment must be made of both intended and unintended consequences.

Examples include Profitability, Customer Retention, Risk-Adjusted Margins, ROA, ROI

Examples include Business Units, Segments, Profitability Tiers, Collection Buckets, geography

Examples include: Interest income, transaction volume, transaction fees, late fees, new account sales, balances, balance increases

Examples include: product usage, account purchases, channel access, transaction behavior, payments made

Facts

Reflect current or prior state of an entity or its activities/ changes. This would include

purchases, balances, etc.

Context

Frame of reference used to evaluate facts for meaning and relevance. Includes forecasts,

trends, industry averages, etc.

30

Facts and context are descriptors that permeate all other information categories. Whether

describing metrics, business drivers, or behaviors, you would present facts about business

entities framed in context. Facts and context apply equally well whether looking at a single

cell of the organization or looking at the enterprise at the macro level.

An extremely important concept here is that of the information end-product. An information

end-product is the direct input into a decision or trigger for an action. An information end-

product may consist of metrics, business drivers, or behaviors. It will contain all needed facts

and relevant context. It will be presented in such a fashion as to be understandable and

facilitate analysis and interpretation.

It is sometimes not clear what actually constitutes an information end-product. If an analyst

gets three reports, pulls some data from each the three reports and retypes it into a spreadsheet

so that he can make sense of it and make a decision, the spreadsheet is the end-product. In a

less intuitive but equally valid example, if the analyst took those same three reports and

analyzed the data in his head, his mental integration process and resulting logical view of the

data would be the information end-product. More complex information end-products could

include a customer stratification by a behavioral score, monthly comparisons of actual metrics

with forecasts, and customer retention by product and segment. Note also that a physical

information end-product with respect to one activity/decision point may additionally be an

input in the process of preparing a downstream information end-product.

Like any other consumer product, information has value because it fulfills a need of the user

within the context of a process. It is up to the developers of this information to ensure that it

can effectively produce that value. There are several determinants of information effectiveness

that have significant impacts on the ability of an organization to utilize the information and

ultimately produce real business value. The five broad categories impacting value of

information are:

Accuracy

Accuracy generally refers to the degree to which information reflects reality. This is the

most fundamental and readily understood property of information. Either it is correct,

or it is not. For example, if you have a decimal point shifted on an account balance, or

an invalid product code, the you do not have accurate information.

Completeness

Completeness implies that all members of the specified population are included. Causes

of incomplete data might be applications that are not sourced, or processing

/transmission errors that cause records to be dropped, or data omitted from specific

data records by the source system.

Usability

Usability is a much less tangible, but much more prevalent problem. It pertains to the

appropriateness of the information for its intended purposes. There are many problems

31

with information that could negatively impact its usability. Poor information could be

introduced right at the collection point. For example, freeform address lines may make

it very difficult to use the address information for specific applications. Certain fields

may be used for multiple purposes, causing conflicts. There could be formatting

inconsistencies or coding inconsistencies introduced by the application systems. This is

especially common when similar information is directed into the warehouse from

multiple source systems. Usability problems could also arise from product codes not

defined to an appropriate level of granularity, or defined inconsistently across systems.

Usability is even impacted when data mismatches cause inconsistencies in your ability

to join tables.

Timeliness

Timeliness is the ability to make information available to its users as rapidly as

possible. This enables a business to respond as rapidly as possible to business events.

For example, knowing how transaction volumes and retention rates responded to

punitive pricing changes for excessive usage will allow you to change or abort if it is

not having the predicted effect. In addition, knowing as early as possible that a

customer has opened new accounts and is now highly profitable will enable that

customer to be upgraded to a higher service level ASAP. Timeliness of information is

achieved by effectively managing critical paths in the information delivery process.

Cost-effectiveness

As with any other expense of operation, it is critical that the cost of collecting,

processing, delivering, and using information be kept to a minimum. This is critical for

maintaining a high level of return on investment. This means being efficient, both in

ETL process operations and in process development and enhancement.

These must be appropriately managed and balanced as the BI manager devises an information

delivery architecture and strategy.

32

Understanding your User Community

Prior to even contemplating databases, tools, and training, it is critical that an understanding be

developed of the actual people who are expected to be able to utilize and generate value from a

decision support environment. Just as companies segment their customer bases to identify

homogeneous groups for which they can devise a unique and optimal servicing strategy, so too

can your business intelligence user community be segmented and serviced accordingly. Like

customers, the information users have a specific set of needs and a desire to interact with

information suppliers in a certain way.

What I have done is to come up with a simple segmentation scheme that identifies four levels

of user, broken out by role and level of technical sophistication:

Level 1

Level 2

Level 3

Level 4

Senior Managers and Strategists

Business Analysts

Information Specialists

Predictive modelers and Statisticians

Looking for a high level view of the organization. They generally require solutions (OLAP, Dashboard, etc) which entail minimal data manipulation skills, often viewing prepared data/analysis or directly accessing information through simple, pre-defined access paths.

Analysts who are more focused on the business than technology. They can handle data that is denormalized, summarized, and consolidated into a small number of tables accessed with a simple tool. They are generally involved with drilling down to find the business drivers of performance problems. They often prepare data for strategists.

These are actual programmers who can use more sophisticated tools and complex, generalized data structures to assemble dispersed data into usable information. They may be involved with assessing the underlying behaviors that impact business drivers, in strategy implementation, and in measurement of behaviors. They may assist in the preparation of data for business analysts, strategists, or statisticians.

These are highly specialized analysts who can use advanced tools to do data mining and predictive modeling. They need access to a wide range of behavioral descriptors across extended timeframes. Their primary role is to identify behaviors to change to achieve business objectives, and to select appropriate antecedents/consequences to initiate the changes.

33

Let me clarify the fact that this is a sample segmentation scheme. This specific breakout is not

as important as the fact that you must not only know and understand your users, but you must

be aware of the critical differentiation points that will direct how these users would like to and

would be able to interact with information. It is also important to remember that this must

apply to your end-state processes and not just your current state. This means that roles may be

invented that do not currently exist, and those roles must be accounted for in any segmentation

scheme.

Note that while user segmentation is very important from a planning and design perspective,

real users may not fall neatly into these well defined boxes. In reality, there is a continuum of

roles and skill levels, so be prepared to deal with a lot of gray. Many people will naturally map

into multiple segments because of the manner in which their role has evolved within the

process over time. Many of the information users that I have dealt with have the analytical

skills of a business analyst and the technical skills of an information analyst. They would feel

indignant if a person was to try to slot them in one box or the other. The key point to be made

here is that role-based segmentation will be the driving force behind the design of information

structures and BI interfaces. The important thing is that you design these for the role and not

the current person performing that role. A person performing a business analyst role should

utilize information structures and tools appropriate for that role, even if that person’s skill level

is well beyond that. This will provide much more process flexibility as people move to

different roles.

One of the biggest mistakes in developing a data warehouse is to provide a huge and complex

entanglement of information, and expect that by training hundreds of people, usage of this

monstrosity will be permeated into corporate culture and processes. When only a tiny minority

of those who were trained actually access the databases and use the information (and those

people were the ones who already had expertise in using information prior to development of

the warehouse), they then assume that this is a user information adoption problem. Their

solution - apply more resources to marketing the data warehouse and to training users.

Unfortunately, training will only get people so far. Some people do not have the natural

aptitudes and thought processes that are necessary to being successful knowledge workers. In

addition, many people have absolutely no desire to become skilled technicians with

information and tools. No amount of training and support will change this.

The key point to remember is, you are supplying a service to information users, who are your

customers. You must therefore start with the knowledge of who your customers are, what they

are capable of doing, and what they have to accomplish. You then apply this information by

delivering to them things that they need and can actually use. If you are supplying a product or

service that they either do not need, or do not have the current or potential skills and aptitudes

to actually use, there will not be adoption and the system you are building will fail. Business

Intelligence needs to adapt to the user community, and not vice-versa.

34

Mapping and Assessing Analytical Information Processes

In order to be able to evaluate and improve your analytical information processes, it is essential

that there be some way to capture and communicate these processes in a meaningful way. To

do this, I came up with the Analytical Information Process Matrix. With user segments

representing rows and sub-processes as columns, this graphical tool allows individual activities

within the process to be mapped to each cell. In the diagram below, you can see some

examples of the types of activities that might be mapped into each cell:

Although this representation is in two dimensions, it can easily be extrapolated to three

dimensions to accommodate multi-departmental processes, so that activities and participants

can be tied to their specific departments.

Managers/ Strategists

Business Analysts

Statistical Analysts

WHAT are the performance

issues?

Deliver & Assess Performance

Metrics

WHY is this situation

occurring?

Drill-down/ research to find

root causes

HOW can we improve

performance?

Analyze alternatives and devise action

plan

IMPLEMENT Action Plan!

Interface with processes and

channels

LEARN and apply to future

strategies.

Measure changes to assess effectiveness

Information Specialists

A wide range

of possible

roles exist as

you design

your closed-

loop analytical

processes.

Mine data for

opportunities

Develop and research

hypotheses

Assess performance and identify

opportunities

Performance reporting

Collect, analyze,

and present metrics

Collect and Assess

Results

Transactional and

behavioral reporting

Data integration

and complex drill down

Perform what-if

analysis

Create transport

mechanisms and

interfaces

Develop

behavioral models

Develop statistical profiles

Select optimal strategy

Develop alternative

strategies

35

To graphically map the process, the specific information activities are incorporated as boxes in

the diagram, along with arrows representing the outputs of a specific activity which are

connected as inputs to the subsequent activity. This is a very simple example:

This shows types of activities at a high level. Within each box could actually be numerous

individual information activities. The question then is: How much detail do you actually need

to do a meaningful process mapping? While more detail is definitely better, getting mired in

too much detail can lead to ‘analysis paralyses’. As long as you can identify the key

communication lines and data handoff points, you can derive meaningful benefit from a high-

level mapping. The key is to integrate this with the information taxonomy to identify the

information that corresponds with each box.

Managers/

Strategists

Business

Analysts

Statistical

Analysts

Information

Specialists

Sample

Analytical

Information

Process flow

scenario

Data Mining, Statistical analysis, and scenario evaluation

Query/Reporting

Assess performance

using OLAP/Dashboard

Reporting/Data Manipulation

Decide on appropriate

actions

Execute complex data

integration

Implement

Analyze Results

WHAT are the performance

issues?

Deliver & Assess Performance

Metrics

WHY is this situation

occurring?

Drill-down/ research to find

root causes

HOW can we improve

performance?

Analyze alternatives and

devise action plan

IMPLEMENT Action Plan!

Interface with processes and

channels

LEARN and apply to future

strategies.

Measure changes to assess

effectiveness

36

For example, in the ‘assess performance’ box, the key is to identify the meaningful, high-level

metrics that will be used for gauging organizational health and identifying opportunities. This

should be a relatively small number of metrics, since too many metrics can lead to confusion.

If the goal is to optimize performance, taking a certain action can move different metrics

different amounts, and possibly even in opposite directions. Simplicity and clarity are

achieved by having a primary metric for the organization, with supplemental metrics that align

with various aspects of organizational performance, and leading indicators that give an idea of

what performance might look like in the future. In addition, you need standardized dimensions

across which these metrics can be viewed, which can enable you to pinpoint where within the

organization, customer base, and product portfolio there are issues.

Once you know what needs to be delivered, you then need to understand the segment that will

be accessing the information to determine how best to deliver it. The managers and strategists

who are looking at organizational performance at a high level will need to do things like

identify exceptions, drill from broader to more granular dimensions, and be able to

communicate with the analysts who will help them research problems. A specific data

architecture and tool set will be needed to support their activities.

Business analysts need to be able to drill into the actionable performance drivers that constitute

the root causes for performance issues. For example, in a Credit Card environment, risk-

adjusted margin is a critical high level metric looked at by senior management. In our

implementation, we included roughly 40 different component metrics, which are sufficiently

granular to be actionable. The components include each individual type of fee (annual, balance

consolidation, cash advance, late, over-limit), statistics on waivers and reversals, information

on rates and balances subjected to those rates, cost of funds, insurance premiums, charge-

offs/delinquencies, and rebates/rewards. By drilling into these components, changes in risk-

adjusted margin can be investigated to determine if a meaningful pattern exists that would

explain why an increase or decrease has occurred within any cell or sub-population of the total

customer base. By analyzing these metrics, analysts can narrow down the root causes of

performance issues and come up with hypotheses for correcting them. Doing this requires

more complex access mechanisms and flexible data structures, while still maintaining fairly

straightforward data access paths.

The next level down consists of measures of behavior, which is generally the realm of

statistical modelers. Because there are so many different types of behaviors, this is the most

difficult set of activities to predict and support. Behaviors include whether individual

customers pay off their whole credit card bill or just make partial payments, whether they call

customer service regularly or sporadically, whether they pay their credit card bill by the

internet or by mail, whether they use their credit cards for gas purchases or grocery purchases,

and countless other possibilities. The issue here is to determine which of these behaviors could

be changed to in order to impact the business driver(s) identified as the root causes by the

analysts. The key is to determine antecedents and consequences such that the desired change

in behavior is maximized, without incurring negative changes in other behaviors that would

37

counteract the positive change. To do this requires access to detailed behavioral data over long

periods of time.

As an example, senior management has identified a performance issue with the gasoline

rewards credit card, which has been consistently meeting profitability forecasts in the past but

has all of a sudden had a noticeable reduction in profitability. After drilling into the problem,

analysts identified the issue as decreased balances and usage combined with increased attrition

among the customers who had been the most profitable. Because this coincided with the

marketing of a new rewards product by a competitor that provided a 5% instead of a 3%

reward, the hypothesis was that some of our customers were being lured by this competitor’s

offer. Some customers were keeping their card and using it less, others were actually closing

their card accounts.

Through statistical analysis, we then needed to figure out:

What would we need to do to keep our active, profitable customers?

Is there anything we could do for those customers already impacted?

Given the revenue impacts of increasing rewards, what do we offer, and to what

subset of customers, that will maximize our risk-adjusted margin.

Once the course of action was determined, next comes the implementation phase. For this,

there might be three hypothetical actions:

Proactively interface with the direct marketing channel to offer an enhanced rewards

product to the most profitable customers impacted.

Closely monitor transaction volumes of the next tier of customer, and use statement

stuffers to offer them temporary incentives if their volumes start decreasing.

Identify any remaining customers who would be eligible for temporary incentives or

possible product upgrade if they contact customer service with the intention of

closing their account.

Implementation of this strategy would therefore require that information be passed to three

separate channels via direct feeds of account lists and supporting data.

Once implementation is complete, the final sub-process is monitoring. For those people who

were impacted, their behavior was tracked, and compared with control groups (meeting the

same profile but not targeted for action) to determine the changes in behavior motivated by the

offer. Based on the tracking, appropriate learnings would be captured that could assist in the

development of future response strategies.

As with any iterative process, with the next reporting cycle senior management will be able to

look at the key metrics across the dimensions of the organization, and be able to ascertain if

overall performance goals have been attained.

38

As you can see from the previous example, the process requires the involvement of various

individuals and execution of numerous information activities. In any given sub-process,

individuals may take on roles of producers or consumers of information end-products (or

sometimes both). Effective information transfer capabilities and consistent information must

be available across all process steps to ensure that collaboration among individuals and

interfaces between activities within and across sub-processes occur accurately and efficiently.

For example, when a manager specifies a cell, this corresponds to an intersection of

dimensions. These same dimensions must be understood by the analyst and must be

incorporated with the exact same meanings into whatever data structures are being used by the

analysts. When the analyst identifies business drivers that are potential problems, these need to

be communicated to the statistical analysts. In addition, if a business analyst identifies a set of

customers that looks interesting, these need to be able to be transferred directly to the statistical

analysts so that they can work with this same data set. After a strategy is developed and

implementation occurs, the same data that drives implementation must be available within the

tracking mechanisms to ensure that the correct customers are tracked and accurate learnings

occur.

As we map out these information flows, we will be looking for the following types of issues:

Too many assembly stages can cause the process to pull together information to be

excessively labor intensive, extend information latency time, and increase likelihood of

errors being introduced. In many cases, excessive assembly stages are not due to intent,

but due to the fact that processes evolve and information deliverables take on roles that

were not initially intended. A rational look at a process can easily identify these

vestigial activities.

Inefficient information hand-off points between stages can occur when information is

communicated on paper, or using ‘imprecise’ terminology. For an example, if a

manager communicates that there is a problem in the New York consumer loan

portfolio, it could refer to loan customers with addresses in New York, loans which

were originated in New York branches, or loans that are currently domiciled and being

serviced through New York branches. It is extremely important that precise

terminology is used to differentiate similar organizational dimensions. It is also critical

that electronic communication be used where possible, so that specific metrics and

dimensions can be unambiguously captured, and data sets can be easily passed between

different individuals for subsequent processing.

Multiple people preparing the same information to be delivered to different users can

cause potential data inconsistencies and waste effort. This wasted effort may not be

limited to the production of data – it may also include extensive research that must be

done if two sources provide different answers to what appears to be the same question.

Information gaps may exist where outputs from one process activity being passed to the

next do not map readily into the data being utilized in the subsequent step. This can

occur if data passes between two groups using different information repositories which

39

may include different data elements or have different data definitions for the same

element.

Delivery that is ineffective or inconsistent with usage can cause excessive work to be

required to produce an end-product. This can occur when the data structure is too

complicated for the ability of the end user, requiring complex joins and manipulation to

produce results, or when intricate calculations need to be implemented. It can also

occur when the tool is too complicated for the intended user. An even worse

consequence is that the difficulty in data preparation may make the process vulnerable

to logic errors and data quality problems, thereby impacting the effectiveness of the

supported information processes.

In addition to promoting efficiency, understanding processes helps in one of the most daunting

tasks of the BI manager – identifying and quantifying business benefits. With the process

model, you can tie information to a set of analytical processes that optimize business rules.

You can tie the business rules to the operational processes that produce value for the

organization, and you can estimate the delta in the value of the operational processes.

Otherwise, you get Business Intelligence and Data Warehousing projects assessed for approval

based on the nebulous and non-committal “improved information” justification. The problem

with this is that you do not have any valid means of determining the relative benefits of

different BI projects (or even of comparing BI projects with non-BI projects). As a result,

projects get approved based on:

Fictional numbers

Who has the most political clout

Who talks the loudest

This is definitely not the way to ensure that you maximize the business benefits of your data

warehousing and business intelligence resources. If you, as a BI manager, are going to be

evaluated based on the contribution of your team to the enterprise, it is essential that you

enforce appropriate discipline in linking projects to benefits to guarantee the highest value

projects being implemented.

40

Information Value-Chain

Many of you are familiar with the concept of the Corporate Information Factory described by

Bill Inmon, which is well known in data warehousing circles. It depicts how data flows

through a series of processes, repositories and delivery mechanisms on route to end users.

Driving these processes is the need to deliver information in a form suitable for its ultimate

purpose, which I model as an information value-chain. Rather than looking at how data flows

and is stored, the value chain depicts how value is added to source data to generate information

end-products.

Note that while traditional ‘manufacturing’ models consist only of the IT value-add steps (or

what is here referred to as architected value-add), this model looks at those components as only

providing a ‘hand-off’ to the end users. The users themselves must then take the delivered

information and expend whatever effort is needed to prepare the information end-products and

deploy them within the process. I tend to call the user value-add the ‘value gap’ because it

represents the gulf that has to be bridged in order for the business information users to be able

to perform their roles in their analytical information processes.

Raw data from

operational systems

Integrational Value-add:

Consistency, rational entity-

relationship structure, and accessibility

Computational Value-add:

Metrics, scoring, segmentation at

atomic levels

Structural Value-add:

Aggregation,

summarization, dimensionality

BI Tool Value-add

Simplified semantics, automated/

pre-built data interaction

capabilities, visualization

Information end-products deployed in analytical processes

User Value-add

Interacting with tools/ coding of

data extract/ manipulation/ presentation processes

Core ETL

Analytical/Aggregational

Engines/Processes

User Interface

Delivery Infrastructure

Architected Environment Value-add

Information Hand-off Point

Value-gap

There is a distinct value chain for the information end-products associated with each unique information activity across your analytical processes

Human Effort

Information Infrastructure

41

When looking at the data manipulations needed to bridge the value-gap, you will find that a

substantial value-gap is not necessarily a bad thing, just like a small value-gap is not

necessarily a good thing. It is all relative to the dynamics of the overall process, the user

segments involved, and the organizational culture and paradigms. The key to the value gap is

that it should not just ‘happen’; it needs to be planned and managed. The process of

planning and managing the value gap corresponds to your information strategy.

Once a handoff point is established that specifies how the information is supposed to look to

the end-users, you then need to determine how best to generate those information deliverables.

The way this is done is to work backwards to assess how best to partition value-add among the

four environmental categories. Note that there are numerous trade-offs that are associated with

different categories of environmental value-add. The process of assessing and managing these

trade-offs in order to define the structure of your environment corresponds to the development

of your information architecture.

An information plan is then needed to move from a strategy and architecture to

implementation. Input into the planning process consists of the complete set of data

deliverables and architectural constructs. The planning process will then partition them out

into a series of discrete projects that will ultimately achieve the desired end-state and in the

interim provide as much value early on as possible.

42

A look at Information Strategy

The key issues associated with devising and implementing an information strategy are related

to managing the value gap. This gap must be bridged by real people, whose ability to

manipulate data is constrained by their skills and aptitudes. They must access information

elements using a specific suite of tools. They will have certain individual information accesses

that they must perform repeatedly, and certain ones that are totally unpredictable that they may

do once and never again. The nature of this gap will determine how much support and training

are needed, how effectively and reliably business processes can be implemented, and even

whether specific activities can or can not be practically executed using the existing staff. By

understanding the value gap, cost-benefit decisions can be made which will direct the amount

of value that will need to be built into the pre-handoff information processes, and what is better

left to the ultimate information users.

When developing an information strategy, the first thing that needs to be documented is the

target end-state process vision. The information strategy needs to consider three sets of issues

and strike a workable balance:

Users and Activities

Tools

Information

Tool Issues: • In-use already vs.

acquire

• Wide vs. Narrow

Scope

• Power vs. ease of

use

• Best of breed vs. integrated suite

User/Activity Issues:

• Training/learning

curve

• Activity/process

redesign required

• Realignment of

roles required

• Acquisition of

skilled resources

required

• Development time and cost

• Load and data availability timing

• Flexibility vs. Value Added

Information Issues

43

Based on the issues illustrated in the diagram, it is apparent that:

Strategy development is focused around the mapping of the information end-

products associated with analytical information processes back to a set of

information deliverables corresponding to a set of information structures and

delivery technologies

Implicit in strategy development is the resolution of cost-benefit issues surrounding

technology and systems choices

Responsibility for strategy is shared by both business functions and IT, and has close

ties to architecture development

Once you have the target processes documented and activities identified, you will see that

strategy development is essentially the recognition and resolution of process tradeoffs. The

types of trade-offs you will have to consider will include:

Trade-off of environmental value-add for user value-add

Trade-off of dynamic computations within tools versus static computations within

data structures

Trade-off of breadth of tool capabilities with ease of usage and adoption

Trade-off of segment-focus available with multiple targeted tools versus reduced

costs associated with fewer tools

Trade-off of development complexity with project cost and completion time

Trade-off of ETL workload with data delivery time

The trade-off of environmental value-add with user value-add is critical to the success or

failure of a BI initiative. To start off, a complete user inventory would need to be undertaken

to segment users based on their current skill levels. This would then need to be mapped into

roles in the end-state process. This will allow you to assess:

Current user capabilities and the degree of productivity that can be expected.

What training, coaching, and experience are necessary to expand users from their

current skill level to where they need to be to fulfill their intended process roles.

Critical skill gaps that cannot be filled by the existing user community.

By shifting the information hand-off point to the right, users will need less technical skill to

generate their information end-products. This would reduce the need for training and

enhancing skills through hiring. However, this potentially increases development complexity

and ETL workload, which would increase development cost and data delivery times.

Another huge issue which will impact the trade-off of environmental value-add versus user

value-add is the stability of the information end-products, which is a critical consideration for

organizations that already have a population of skilled information users. Both value-add

scenarios will have both drawbacks and benefits associated with them. The key is to balance

44

reliability and organizational leverage versus cost and expediency.

Those who have been involved with companies with a strong information culture know that

information users can be extremely resourceful. Having previously operated on the user side

and made extensive usage of fourth generation languages, it is amazing what kind of results

can be achieved by applying brute force to basic data. Since users can dynamically alter their

direction at the whim of their management, this is by far the most expedient way to get

anything done. It is also the least expensive (on a departmental level), since user development

does not carry with it the rigors of production implementation. Unfortunately, this has some

negative implications:

Each user department must have a set of programming experts, forming

decentralized islands of skill and knowledge.

Much is done that is repetitive across and within these islands. This is both labor

intensive and promotes inconsistency.

Service levels across the organization are widely variable, and internal bidding wars

may erupt for highly skilled knowledge workers.

User-developed processes may not be adequately tested or be sufficiently reliable to

have multi-million dollar decisions based on them.

Documentation may be scant, and detailed knowledge may be limited to one or a

small number of individuals, thereby promoting high levels of dependency and

incurring significant risks.

Therefore, while expedient at the departmental level, this carries with it high costs at the

overall organizational level.

Building information value into production processes is a much more rigorous undertaking,

which carries with it its own benefits and drawbacks. It requires that much thought and effort

be expended up front in understanding user processes and anticipating their ongoing needs

over time. Therefore, this is a very deliberate process as opposed to an expedient one. It

requires significant process design, programming, and modeling work to produce the correct

information and store it appropriately in repositories for user access. It also entails risk, since

if the analysis done is poor or if the business radically changes, the value added through the

production processes may be obsolete and not useful after a short period of time, thereby never

recouping sunk costs.

However, there are also extremely positive benefits of implementing value-added processes in

a production environment.

It reduces the value-gap that users must bridge, allowing user departments to utilize

less technically skilled individuals. This results in less need for training and

maximizes ability to leverage existing staffing.

It increases consistency, by providing standard, high-level information building

45

blocks that can be incorporated directly into user information processes without

having to be rebuilt each time they are needed.

It is reliable, repeatable, and controlled, thereby reducing integrity risk.

It provides a Metadata infrastructure which captures and communicates the

definition and derivation of each data element, thus minimizing definitional

ambiguity and simplifying research.

It can dramatically reduce resource needs across the entire organization, both human

and computing, versus the repeated independent implementation of similar processes

across multiple departments.

Note that no reasonable solution will involve either all ‘user build; or all ‘productionization’.

The key is understanding the trade-offs, and balancing the two. As you move more into

production, you increase your fixed costs. You will be doing additional development for these

information processes, and will operate those processes on an ongoing basis, expending

manpower and computing resources. The results will need to be stored, so there will be

continuing cost of a potentially large amount of DASD. When processes are built within the

user arena, costs are variable and occur only when and if the processes are actually executed.

However, these costs can quickly accumulate due to multiple areas doing the same or similar

processes, and they entail more risk of accuracy and reliability problems. This can actually

mean an even larger amount of DASD, since the same or similar data may actually be stored

repeatedly. The trade-off is that sufficient usage must be made of any production

summarizations and computations so that the total decrease in user costs and risks provides

adequate financial returns on the development and operational cost.

Depending on technologies used, data structures accessed, and complexity of algorithms,

performing the same set of calculations in a production environment versus an ad-hoc (user)

environment can take several times longer to implement, assuming implementers of equivalent

skill levels. This difference will be related to formalized requirements gathering, project

methodology compliance, metadata requirements, documentation standards, production library

management standards, and more rigorous design, testing, and data integrity management

standards.

Savings occur due to the greater ease of maintenance of production processes. Since

derivation relationships within metadata can enable you to do an impact analysis and identify

things that are impacted by upstream changes, it is much easier to keep production processes

updated as inputs change over time. Also, built-in data integrity checking can often detect

errors prior to delivery of data to user applications, thereby avoiding reruns and reducing the

probability of bad data going out. For ad-hoc processes, the author must somehow find out

about data changes in advance, or else data problems may propagate through these processes

and may not be captured until after information has already been delivered to decision makers.

In some cases, trade-offs made will impact tool and data communication expenses. If data is

delivered in a relatively complete form, it merely needs to be harvested. This generally means

that what the user will pull from the information environment is either highly summarized data

46

or a small targeted subset of the record population. In situations where the data is raw and

substantial value needs to be added by the users in order to make the information useful, large

amounts of data may need to be either downloaded to a PC or transmitted to a mid-range or a

server for further manipulation. This can dramatically impact data communications bandwidth

requirements and related costs.

For end-products that tend to change over time, consider providing to users a set of stable

components that they can dynamically assemble into their needed end-products. Changing

production processes requires significant lead time. If certain analytical metrics can potentially

change frequently, attempting to keep up with the changes could bog down your ETL

resources and not be sufficiently responsive. A lot depends on the types of changes that could

occur, since some types of change could be handled merely by making the process table or

rules driven. For changes that mandate recoding, efficiency in making changes is related to the

nature of the technology used for ETL. In many cases, the usage of automated tools can

dramatically reduce turnaround time and resources for system changes and enhancements.

Regardless, the need for flexibility must always be considered when determining what

deliverables need to be handed over to end users and in what format.

In addition to the issue of whether to do calculations in production or leave them to the user, an

even more vexing issue is how to structure data for retrieval. Complexity of access path is

often an even more impenetrable barrier to information adoption than even having to do

complex calculations. If tables are highly normalized, an extended series of table joins might

be necessary in order to pull together data needed for analysis. To simplify the user interface,

there are two alternatives. We can bury the joins into a tool interface to try to make them as

simple and transparent as possible, or else we can introduce denormalized tables. Tool-based

dynamic joins may be more flexible, but do not provide any performance benefit.

Denormalized tables provide a significant performance benefit, but at the cost of additional

DASD and the requirement of recoding ETL and restructuring databases if significant changes

occur in the business that require different views of data. Again, there will generally not be an

either/or solution, but rather a blending that takes into account which things are most stable and

which things require quickest access.

Critical decisions will need to be made with respect to tools. Tools with more power tend to be

harder to learn. In some cases, tools that are provided that are not consistent with the

corresponding user segment’s technical abilities can cause adoption resistance and ultimate

failure. There are trade-offs that need to be made with the number of tools. The more

individual tools in the suite, the more targeted they can be towards their intended segment.

However, this leads to increased support and training costs and reduced staff mobility and

interchangeability. In some cases, a single-vendor suite can be used that provides targeting of

capabilities while simplifying adoption by providing a consistent look and feel. This may

result in a compromise in functionality, since many of the best of breed individual solutions do

not have offerings that cover the complete spectrum of end-user requirements.

In a dynamic business environment, time is a critical factor. Here we need to look at time from

47

two perspectives. The first is the time it takes to implement a new capability. From the time a

need is recognized that requires an information solution, the clock is ticking. Every week and

month that we are waiting for that solution to be implemented, we are missing the opportunity

to generate value and gain competitive advantage. It is therefore critical to recognize the

implementation time as a critical consideration, and be willing to assess trade-offs that deliver

less, but do it faster. Likewise, when detecting events or evaluating results of decisions and

actions, information latency is a critical consideration. Hours and minutes may have

significant value. Again, it may be beneficial to make tradeoffs between latency and value-add

in order to expedite the delivery of information and improve responsiveness to events and

changes.

By the time you are finished with your strategy, you should have determined:

What are the optimal information handoff points to produce the needed end-

products?

What information accesses are repetitive versus sporadic?

What are the information clusters, or information that tends to be needed together?

What are the different access paths (selection criteria/drill sequence) needed at the

various handoff points?

How do we support information and data flows between activities?

How are people mapped to tools and data, and how will the needed skills be

acquired?

What are the various entities, events, and relationships for which data must be

captured?

What are the trade-offs associated with time versus capabilities?

With this information, we can now begin to work with the architects to establish a solution

architecture.

48

A look at Architectural Issues and Components

In the Information Architecture, you essentially define the interrelationships between data

sources, data management processes, and information repositories, and select appropriate

technologies and paradigms for implementation. Included would be a series of philosophical

directives that will determine what is stored, how processes are implemented, how metadata

(data about data) is managed, plus a series of technical directives related to platforms,

software, data communications, user and programmer tools, etc. Note that an architecture is a

means to an end. The end is the ability to implement your information strategy as quickly,

efficiently, and cost-effectively as possible. When developing an architecture, there are

numerous factors that must be considered:

Scalability, or ability to grow as the business grows.

Throughput, or the ability move and transform data quickly enough to satisfy

business timing needs.

Complexity/reliability, which will ultimately impact data integrity and the amount of

effort that must be expended to operate the processes.

Human productivity and implementation issues, which will impact the efficiency

with which your current staff (or target staff) will be able to develop, maintain, and

fix processes.

Enterprise integration issues, or how the business intelligence technology suite maps

into the overall technology set employed by the enterprise.

Note that strategy and architecture must converge at a single point. Strategy works backwards

from users and activities to identify the appropriate information deliverables to support process

execution. Architecture works forward from available data and building blocks to construct a

framework for implementing a set of information deliverables. The implication here is that

neither can be done in a vacuum. The starting point is always the set of information processes

that must be supported. This will drive a first pass at an information strategy, and the needs

represented by the strategy will provide the basic requirements for a technical architecture.

Technical and business practicalities will force compromises in both the strategy and

architecture, but in the end a consistent and workable scenario must be the result.

The first thing I would like to do is basically inventory the various architectural components

that may be assembled to create a Business Intelligence and Data Warehousing architecture. I

have divided these components into three sets:

Database Structures

Environments/Platforms

User access tools

The following charts identify these components:

49

Database Structures to support integrational, computational, and

structural value-add

May be used by

information specialists

for sub-process 4,5

Rapid turnaround

limits possible

value add.

Minimal latency data provides

quick feedback as to changes

in specific behaviors being

monitored.

Operational Data

Store

May be used by

managers or some

business analysts for

sub-process 1.

Becomes unwieldy

with numerous or

large dimensions.

Provides fast and easy access

to multidimensional

information for interactive

analysis and drill down.

Multi-dimensional

Database

May be used by

information analysts for

sub-process 1,2

Complex to build,

possible

redundancy,

limited ability to

store individual

event data.

Allows flexible, dimensional

access to detailed or summary

data, and enables drill through

from multidimensional

database.

Star Schema Mart

May be used by


sub-process 1,2.

Not suitable for

constantly

changing query

mix.

Improves performance of

frequently submitted queries.

Automatic/Manual

Summary Tables

May be used by


sub-process 1,2.

Introduces

difficulty if

external

information must

be integrated, plus

possible

redundancy,

Segregates, restructures,

and/or aggregates relevant

data for performance and

access simplicity.

Process/Subject

focused data mart

May be used by


and statistical modelers

for sub-process 1-4

Introduces largest

value gap,

requiring largest

user value add and

information usage

risk

Most flexible means of storing

data, reflecting the natural

structure of the data itself.

Supports diverse retrieval

patterns and entry points.

Normalized

analytical data

warehouse

Suitable process

Phases/Segments

Limitations Areas where applicable Architectural

Construct

50

Environments/Platforms to support BI Tool value-add

All sub-processes and

segments.

Not suitable for

desktop

applications.

Unified front-end that allows

consolidated authentication

and access control for all BI

capabilities.

Portal

Suitable for use by

managers in sub-

process 1.

Information may

still require

interpretation from

data specialists.

Consolidates high level

performance metrics or

behavioral measures into

easily interpreted visual

format.

Dashboard/

Scorecard

Can be used by


and modelers for sub-

processes 3,4, and

possibly others

Excessive

dependency on

custom integration

may result in

duplication of

effort and

inconsistency.

Used to integrate and

consolidate data from multiple

sources/environments to

support complex analysis and

modeling.

Analytical

Workspace

For data distribution to

managers and business

analysts for sub-

process 5, and possibly

1.

Reports are not

readily manipulated

by users if different

view is needed.

Delivery of detailed reports to

a wide range of users on a

need-to-know basis.

Report Management

and Distribution

Library

May be used by

managers and business

analysts for sub-

processes 1,2.

May be difficult to

incorporate metrics

and process

interfaces unique to

your business.

Custom coded or purchased

application which computes

and presents appropriate

metrics for specific business.

Analytical

Application

May be used by

business analysts and

information analysts

for sub-processes 1,2,5

Single environment

may not completely

support diverse tool

suite.

Web or desktop based access

to one or more BI tools for

accessing data. Allows for

storage/sharing of queries and

results.

Analytical Query

Environment

Suitable process

Phases/Segments


Construct

51

Data Delivery and Manipulation Tools to support BI Tool value-add

May be used by

information specialists for

sub-process 3

Requires significant

skill to execute and

interpret analysis.

Used to cluster/segment

customer population and predict

future behavior based on

historical data.

Data Mining

and Statistical

Analysis Tool

Used by Information

Specialists to prepare data

for sub-processes 1,5

Often linked to

specific query tools

and not sufficiently

flexible.

Used to populate and format

dashboard/scorecard delivery

applications.

Dashboard

Design/

Delivery

May be used by

information specialists for

sub-processes 1,2,3,4,5

Requires much skill,

and introduces

possibility of

redundant efforts,

inconsistency, and

errors

Used in development of complex

processes, including data

integration, what-if analysis, and

modeling.

Procedural

Programming

Language

May be used by managers

or business analysts for

sub-processes 1, 2, 5

Requires MDDB or

Star/Snowflake

Schema.

Allows data to be flexibly

viewed across dimensions at

multiple levels, with drill-down

into more detail.

OLAP Tool

Used By Information

Specialists to prepare data

for sub-processes 1,5

Formatting is static –

reports must be

recoded to look at

alternate views of

data.

Creation of highly formatted

report outputs, which can then be

saved and distributed.

Report

Creation and

Management

Tool

May be used by business

analysts and information

specialists for sub-

processes 1,2,5

May be difficult to do

complex

manipulations, multi-

step processes, etc.

Allows preparation and

submission of SQL using

simplified interface with

semantic layer.

General Query

Tool

Suitable process

Phases/Segments


Construct

52

Let’s discuss some of the issues to be faced when trying to assemble these pieces into a

cohesive information architecture. What I would like to do is evaluate this from the

perspective of the five sub-processes of your analytical information processes.

Let’s start from the beginning, which is the distribution of broad organizational metrics in

support of high-level performance management. The strategy will drive whether you will have

managers directly accessing the data through scorecards, dashboards, and/or OLAP, or whether

they will have analysts prepare decks in which the appropriate information is filtered,

massaged, interpreted, and delivered in a customized fashion. If a decision is made for

automated delivery of information, then all metrics must be calculated in advance, stored, and

be available dimensionally. Depending on data volumes, data interactivity required, and

performance constraints, this may be stored in a star schema or in cubes. A portal must be

selected to deliver this information (which will also be leveraged within the other sub-

processes), as well as mechanisms to deliver the scorecard information and to populate it.

Rather than attempting a custom solution, there are numerous industry-specific analytical

applications that can be deployed to compute, store, and deliver metrics. These are pre-

packaged vendor (or custom) applications which prepare metrics and analytics and deliver

them using a series of existing templates and report formats. They will often have capabilities

suited both to managers for performance management, and some limited capabilities for

business analysts to do drill down to root causes. Analytical applications are generally a quick

way to catch up, but also may not be sufficiently tailored to your internal business processes to

maximize your competitive advantage.

For the second sub-process, drill-down to root causes, we will need to decompose the high

level metrics delivered to senior managers into a robust set of component metrics. These will

generally be stored in a star schema, with a wide range of meaningful dimensions. This would

include the same organizational dimensions looked at by senior managers, but also others used

to divide customers into more actionable sub-classes based on current and historical behavior

patterns. To supplement the calculated metrics, drill through from the star schema metrics

table into detail data should be enabled to create an expanded ‘virtual’ data mart, which allows

access to a wider range of data. Depending on the nature of the process, additional data marts

can supplement this, providing information highly specific to individual activities.

Analysts performing this sub-process will need flexible tools, which will allow OLAP access

to dimensional views, plus more generalized query capabilities. These generalized query

capabilities should be sufficiently broad that they can accommodate the star/snowflake

schemas, normalized data warehouse, and even generalized data sets created by information

analysts that integrate data warehouse and other external data sources.

One of the things you will notice is that I identified the data warehouse as being normalized.

There is a substantial school of thought that proposes that all data warehouses be modeled

dimensionally as star/snowflake schema structures with a fact table and standardized

53

dimensions. I am a firm believer that once you get down into the realm of analyzing and

predicting behaviors, the multiplicity of access patterns and in many cases the utilization of

facts themselves as entry points into the data rather than dimensions substantially reduce the

benefits of dimensional structures, and actually make the normalized structures more intuitive

to use.

A large part of your information architecture is the identification of unique data marts that can

be applied to specific user segments and sets of information activities. These marts will be

optimized for their intended usage, so that the specific supported activities can be executed

very simply and/or very quickly. There are a number of different approaches that can be used

optimize data marts for their intended usage:

Computation and storage of summary information, particularly across transactions

and events that pertain to a specific entity. This can be implemented in static

(incorporated into the original schema) or dynamic (materialized query tables

derived as needed by analyzing data access patterns) structures

Integration of time-series information, so that a series of 12 or 24 of the same metric

corresponding to different time periods can be co-located into a single record to

simplify and speed up the extraction of historical trend information.

Extracting just a subset of information relevant to the specific information activities,

thereby simplifying access and improving performance due to narrowing of data

scope.

Optimizing access paths for frequent information entry points through the

implementation of multi-dimensional models, either through a star/snowflake

schema or a multi-dimensional database.

Denormalization to reduce joins

My recommendation is that multidimensional access be used for metrics-based data marts,

supporting managers/strategists and business analysts. This is the stage of the process where

the data entry points (what you are selecting on) and access patterns are most predictable.

This can be implemented via multi-dimensional databases for managers and strategists, who

need quick but fairly standardized access to pinpoint performance issues across the dimensions

of the organization. An underlying star schema could then provide the flexibility and drill

through capability needed for business analysts to be able to drill down to the next level of root

causes.

Decisions will need to be made as to the degree of denormalization to be introduced into the

overall system architecture. In denormalized tables, you are trading off redundancy for

efficiency. This could significantly increase DASD storage costs, and also introduce the

possibility of inconsistency within the database. In addition, pulling together data from entities

that have one-to-many or many-to-many relationships can still be complicated even after

denormalization, and care must be taken by users when reporting on data elements that might

repeat (i.e., customer related data elements on account related tables). My recommendation for

54

handling the integration of data is as follows:

The data warehouse should generally be normalized and always be at the lowest level of

detail. This ensures ease of update, a single source for each data element, and

consistency across tables.

Denormalization should generally take place through the development of data marts and

OLAP solutions based directly on needs of individual user segments in support of their

specific information activities.

Virtual data marts can be extremely effective. By joining co-hosted or federated

normalized tables into the main fact table of a star schema data mart, you can take

advantage of dimensional access paths into the data while eliminating the need for

duplicate development to load both marts and the warehouse.

By using denormalization for highly specific applications, you can potentially focus on small

subsets of data elements or records, you can better understand usage patterns and build around

those patterns, and you can verify usability through having a specific group of users do the

acceptance testing.

For sub-process three, which is the identification of specific behaviors that can be changed to

drive changes to the root causes and subsequently to the high-level performance issues, there

may be a variety of both internal and external sources of data. The primary internal sources of

data will be the normalized data warehouse, and possibly a ‘behaviors mart’ that captures

standardized measures of common behaviors that support production models. To support the

development and testing of behavioral analytic processes, you will generally need an analytical

workspace. An analytical workspace is critical for environments with a significant population

of skilled information analysts. This is a shared environment where users can dynamically

integrate data from multiple sources, and be able to run sophisticated data mining and

clustering software to identify patterns in the data. This scenario works best when the

environment is used for dynamic data integration and analysis to leverage external data

sources, and for research and development of new statistical models. A large temptation is to

leverage this environment for the production execution of scoring processes and behavioral

models. While having some short term benefits in expediency, the lack of production controls

and elevated risk of data quality impacts due to uncommunicated data changes tend to more

than compensate for any benefits.

Because of trends in hardware costs, the big trade-off here is whether to utilize high-power

workstations on each person’s desk, or create a shared Symmetric Multi-Processor

environment in which multiple users share the same large computing resource. Again, the

architects will need to look at the dynamics of how people collaborate and share data. A high-

degree of collaboration and numerous shared data sources would tend to point to a single SMP

environment, while more independent work with communication primarily of small sets of

end-products would point more towards a networked series of workstations. In general, a

single SMP environment can take individual, parallelizable jobs and execute them faster. In

some cases, this can also be done with networked workstations by setting them up as a grid,

55

but this is of benefit only for algorithms that are grid-suitable and requires technology that is

not as mature.

The tools required for this range from powerful procedural programming type languages that

support data integration, scoring, and advanced computations, to advanced statistical and data

mining software applications. These tools would need flexible access to data. This would

include both the data warehouse and external data sources. A ‘behaviors mart’ developed in

support of production models would also be of significant benefit in model development by

serving as a source of standard behavioral measures at an atomic level. For a financial

institution, this mart could include facts such as percentage of different transaction types

handled by different channels, number of months since last late payment, monthly variance of

deposit balances, and numerous others. Leveraging these measures will improve productivity

and consistency for modeling activities, and will facilitate the movement of models into a

production environment.

Implementation requires that the results of the strategy development be packaged and

transmitted to the point of execution. If the result is a direct mailing, the identifying

information for the individuals being contacted and the specifics of the message/offer must be

transmitted to whoever will be doing the fulfillment. If the result is a pricing change, any

communication of the changes must be implemented through a communications channel, and

the pricing data must be updated in the appropriate system. The architect must understand how

communication and data transfers need to take place to make the execution of these business

rules changes as quick and accurate as possible.

A potentially critical piece of the puzzle is the operational data store, for storing near real-time

data on significant events and behaviors. While many now are more inclined to integrate this

information into the data warehouse itself as part of an ‘active data warehousing’ paradigm, the

key is not so much where it is but how it can be used. Depending on the degree to which this

information needs to be integrated with other data warehouse information, it may be sufficient

to have a separate ODS which can be dynamically linked to the data warehouse via a federated

middleware scenario. Integrating this data directly into the data warehouse itself provides for

tighter coupling of information and processes and allows for more robust and better performing

integration of current data with historical context.

The ODS can actually serve multiple duties. From an analytical information process

perspective, it will support the measurement of behaviors in the final sub-process, allowing

quick feedback as to the effectiveness of the strategy and actions and the ability to assess and

re-apply learnings. This requires that the ODS have access to tagged sets of customers or

accounts to allow the behaviors to be measured for those specific subsets. It also requires that

all ODS data be consistent with data warehouse data both in terms of completeness and data

definitions. For specific operational reporting requirements, small ‘oper-marts’ can be

extracted from this data to simplify specific types of regular reporting processes. From an

operational perspective, the ODS can be used to collect and filter events that can drive event-

triggered operational information processes.

56

Delivery of operational reporting for the fifth sub-process may be effectively implemented by a

report management and distribution infrastructure. Some reporting packages allow highly

formatted reports to be created, stored in a library, and then distributed either through a push or

a publish-subscribe scenario via the web. User IDs will limit what any individual user has

access to. Because of the inability of users to effectively interact with data using this scenario,

it is generally good for simple, repetitive processes like tracking of behaviors.

When considering tools, the important thing is to understand the implicit mapping of tools to

user segments and activities, which will drive the critical capabilities and usability parameters.

Always evaluate a tool on the subset of capabilities and characteristics that are important to the

segments and activities for which it is intended to be used, not necessarily for its broader

spectrum of capabilities. For example, you may have a list of 50 capabilities that may be

incorporated into a query tool. You look at who will be using it (based on user segment), and

what types of activities will need to be executed. This will drive the specific capabilities that

are actually relevant, and also allow you to weight them as to their relative importance. If it

has been decided to go with separate OLAP and query tools, then you do not need a query tool

with OLAP capabilities. The OLAP capabilities would be weighted zero for that evaluation,

and a separate OLAP tool evaluation would be undertaken. However, if interoperability of

OLAP and query tools is essential due to the manner in which those users will be working

together, then it may be necessary to actually select a single tool to do both, or to select a

single vendor suite that encompasses both to ensure seamless integration.

The key to tools evaluation is that you evaluate tool scenarios (i.e., plausible combinations of

tools that would cover the spectrum of your requirements), rather than just individual tools in

isolation. This will ensure that all needed capabilities are covered, and that tool

interoperability is appropriately considered. This way you could readily identify suite benefits:

Individual productivity

Data sharing and transfer capabilities

Match to processing requirements

It will also allow you to look at the overall cost associated with the whole suite, which would

include:

Combined infrastructure

Licensing

Training

Ongoing support

Here is a good generalized approach for mapping of business requirements to information

structures, which will then drive the tool selection process:

57

The following diagram shows a possible information architecture that has components that

support each individual sub-process and user segment. It starts with managers and strategists,

who can access their data through scorecards, dashboards, and standard reports. These are

powered by cubes that enable extremely fast response times. Analysts can access the cubes, or

drill back even further to a star-schema metrics mart, which allows much more flexibility in

terms of the conformed dimensions used and number of metrics accessible. This also supports

drill-through back to the normalized data warehouse, or virtual data mart views. Finally a

normalized data warehouse is leveraged for modeling and complex data manipulation.

Cube

Identify High Level Metrics that

can be used to gauge the

performance of the

organization, and dimensions that

enable pinpointing of issues and targeting of accountability.

For each metric, identify the key

components (or drivers) at a high

enough level to be meaningful, but

at a low enough level to drive

strategic and tactical actions

For all entities, determine

the key behavioral components that

will impact their relationship with

the company and their resulting

cost to serve, revenue stream, and

profitability, and use this to

develop strategies

Denormalized, pre-

computed, summarized

information

Normalized

Detail

Integrated for key

entities

Multi-dimensional summary

Business Requirement Architecture

58

Sample Segment/Activity Focused Information Environment

Enterprise Warehouse

Normalized detail data

Analytics Engine generates all

metrics and summaries once

needed IW data is loaded.

Filtering, aggregation,

and analytical

processing using

standard metrics/ dimensions

Products

Dashboard

Cubes

Built around

customer

metrics and

dimensions from

ROLAP

structure, plus

external data

such as program

targets,

response rates,

and industry

statistics

Dimension Dimension Dimension Dimension

Metrics/Scores Table

Production ETL processes Production data

collection, reporting, and

presentation processes

Applications

- Dashboard

- Standard Reports

Relational access using any

table as a starting point yields

flexibility to get at any needed

information.

ROLAP access allows access to

metrics via any standard

dimension. Drill-through to

warehouse provides 'virtual

extended data mart' and allows

access to any data needed.

Cube access, geared towards

those with repetitive data

needs,provides high performance

and allows for integration of non-

customer / external data into

reporting structures.

Applications allow for data

distribution to completely non-

technical staff, including senior

managers. Logging into a portal

automatically provides access to

dashboard data and all reports

suitable for that user.

Highly skilled technicians use

this capability for detailed

reporting on customer

behaviors, searching for

correlations, performing 'what

if' analysis, and developing

statistical profiles and

predictive models.

Skilled analysts can do detailed

analysis of information at a program,

segment, or marketing group level to

research performance anomalies and

drill into root causes. Customer 'cells'

can be researched at the individual

account level to identify relevant

behaviors and issues.

Analysts can utilize a simple

interface to evaluate a program

or segment, make comparisons

across time or between forecasts

and results, and identify 'cells'

(intersections of dimensions)

where there may be performance

issues.

Managers can get a highly

visual and intuitive view of

summary-level data to get an

overview of total portfolio and

program results and quickly

identify any areas where

further research is warranted.

Customer

ETL Perspective

Data

Access Perspective

Process Perspective

59

Information Manufacturing and Metadata

What most Business Intelligence managers do not realize is that they are not in the

programming business – they are in the manufacturing and distribution business! There is no

conceptual difference between providing users with information versus providing consumers

with a broad suite of tangible or intangible products. Essentially, the BI/DW group collects raw

materials (data) from its providers (source application systems). It goes through a

manufacturing process which integrates and synthesizes the data to produce information

deliverables. It accumulates this information in a bulk warehouse, and can then pass this on to

different types of retail outlets (data marts and OLAP), organized around convenience and ease

of access. Finally, customers either access information directly through self-service delivery

channels (tools), or have information provided to them through value-added resellers

(programming specialists), who prepare and deliver spreadsheets, reports, decks, etc.

In spite of all of our advances in tools, the basic concepts behind how we approach the whole

process of information systems are often archaic. Data warehousing organizations often

develop their information as a series of threads. A project will be defined that identifies the

data elements that must be produced as output, and how they will be delivered. A group of

programmers is assigned, who will build a beginning-to-end process which handles all of the

required inputs, collects all of the needed data, pulls it together into temporary files, and finally

produces outputs. This is done independently of the other projects that are going on. What

kinds of problems does this lead to?

There is potential replication of effort across project teams within the IW

organization.

Processes produced could be inefficient due to touching the same data multiple times

across projects.

Process structures are often left to the discretion of the implementing team, and may

be inconsistent.

This is not conducive to implementing broad and consistent data quality checking

and correction.

Because many people are touching the same data, it is much more difficult to make appropriate

corrections to programs in response to input data changes. Also, we are leaving ourselves open

to inconsistencies across data repositories and even tables.

Manufacturing can inspire us about how things can be done better. First of all, the manufacture

of an end-product is not done in a vacuum by an isolated team. Manufacturing is focused

around maximizing production efficiency. It means taking raw inputs and producing sub-

assemblies, which can then be inventoried and used in building higher level sub-assemblies. In

this scenario, the key is designing reusable, general purpose components. When producing a

car on an assembly line, components are incrementally added until finally you have your

finished product. If you are building a coupe, sedan, and convertible of the same model, you

60

share as many components and assembly processes as possible on the same assembly line,

diverging only where necessary to support fundamental differences in the end products.

Having three different and independently developed assembly lines for the three automobiles

would dramatically drive up costs due to the proliferation of additional parts that have to be

inventoried, the additional people needed to operate the assembly lines, and the reduced

flexibility.

This analogy can be extended to cover synchronous, period-based EDW data updates (ie, data

is updated for the same period across sources at the same time). The core processes which

produce information can also be organized into an assembly line, which progressively builds

information by combining atomic data element instances into increasingly more complex

information sub-assemblies. Again, our objective would be to reuse as much as possible across

processes to improve productivity and control costs, diverging only where there are

fundamental differences in the output information. Intermediate results are not only

permanently saved, but integrated into our data stores and described in the metadata repository.

This allows them to be easily reused across existing processes and leveraged as new processes

(supporting models, data marts, reporting processes, etc) are developed. This gives you

economies of scale, eliminating the need to replicate that portion of the design and

development effort across the other product lines. It also promotes consistency, since if you get

it right once, it will be correct in all end-products in which it is a part.

When utilizing an assembly line to produce an end product, the production process is divided

among a series of discrete assembly stations. At each one, activities are performed in the

proper sequence and the end-product is incrementally built. The actions being done together at

the same assembly station are selected because of some natural connection. After the activities

of one station are concluded, quality is verified and the product moves on to the next station.

In ETL design, I refer to the individual steps as layers. As data passes through each successive

layer, value is added that moves the data closer to an information deliverable. Through the

organization of processes into layers, you are trying to achieve:

Grouping of like processes together

Minimization of number of external touch points

Elimination or reduction of repeated handling of data

Depending on the nature of the end-to-end processing that is required for any given company,

there is no single solution for how things should be organized. The following, however, is a

good general purpose layering scenario that can be adopted and modified for most Business

Intelligence teams to organize ETL processes:

1. Data Acquisition, or the extraction of data from the operational environment and

transport of data to the analytical environment.

2. Data Commonization, or standardization across diverse sources

3. Data Calculation, or creation of meaningful new elements

4. Data Integration, or creating/validating data linkages and populating tables

61

5. Information Assembly, or summarizing across tables to create complex metrics and

aggregations

6. Information Delivery, or populating summarized/aggregated structures for quick and

easy access by end-users

The six layers are meant to drive how you structure your ETL processes and how you organize

your support team. At a high level, the flow of data would look something like this:

In this diagram, the process layers are in olive and the repositories (including metadata) are in

aqua. Also shown, which integrates with the other layers, are data quality monitoring and

information delivery via data access tools.

In the section on the Information Value Chain, we identified a series of four components that

describe how value is added to information in the process of preparing information

deliverables. All integrational, computational, and structural value-add is actually produced

through the ETL process. While often excluded from traditional views of ETL, I include data

access tool and platform support within the final ETL layer, Information Delivery. In the

following diagram, you can see how the activities associated with each ETL layer contribute to

the various value-add components of the information value chain:

Operational Data

Sources

(1) Data

Acquisition

Landing Area

Initial Staging

Area

(2) Commonization

(3) Calculation

(4) Integration

Data Warehouse

Prior Month Reference

Data Data Quality

Monitoring

(5) Information Assembly

(6) Information

Delivery

Data Mart

Data Mart

Data Access Tools And

Platforms

Metadata

Quality Statistics

Users

Analytical Information Environment

A high level look at synchronous, periodic

ETL Process Flow

Cube Cube

62

Let us look in more detail at the specific activities that you will need to incorporate into the

various ETL layers:

Cross-referencing ETL Layers with

Information Value-add Components

Update tool

semantic layers;

prepare reports/

templates/

dashboards

Populate marts and

cubes

Information

Delivery

Compute conformed

facts and conformed

dimensions for

multidimensional

structures, plus any

summarization

across dimensions

Compute critical

business metrics and

measures globally

(across tables and time

periods)

Information

Assembly

Populate primary

warehouse tables

with normalized data

Integration

Compute new relevant

data elements locally

(within tables)

Compute primary

and secondary keys

as necessary

Calculations

Map disparate systems

into common

definitions and formats

and compute

commonized values

Build and load

common data

staging tables

Data

Commonization

Interface with

production systems;

Acquire extracts

Land extracts on

ETL platform

Data Acquisition

BI Tool Value-

add

Structural

Value-add

Computational

Value-add

Integrational

Value-add

Value-add:

ETL Layers:

63

Data Acquisition

This is the most fundamental layer of processing, but in many ways is the most critical. Those responsible for acquisition must obtain data from all internal and external sources and populate it into the decision support environment. This is the starting point from which everything else progresses. There are three key responsibilities for this layer:

Developing and operating the processes that collect and transport data from production application sources into the decision support environment.

Maintaining the lines of communication with the production staff so that they are aware of all changes, and then altering metadata and passing information on changes to subsequent processing layer. Note that this is the only layer that performs this external communication.

Maintaining data transport checks to ensure that everything is extracted and received in the landing area completely

It is the vigilance of these individuals that will determine the ability of the process to respond to changes in data to eliminate potential data quality problems. The output from this layer is the loading of images of the extract data into the analytical environment landing area.

Operational Data

Sources

(1) Data

Acquisition

Landing Area

Initial Staging

Area

(2) Commonization

(3) Calculation

(4) Integration

Data Warehouse


Data

Data Quality

Monitoring


(6) Information Delivery

Data Mart

Data Mart


Platforms

Metadata

Quality Statistics

Users



ETL Process Flow

Cube Cube

64

Data Commonization This step is extremely important for large companies that have multiple legacy application systems covering the same function, such as a bank built on acquisitions and mergers that may have several deposit systems, but is also useful for ensuring data consistency across different functional applications. Data commonization refers to identifying identically purposed data elements that may have different formats or code values on different systems, and converting them into a single, unified format and definition. Note that commonization should only occur when multiple systems can be mapped into data elements that mean exactly the same thing. If there are any definitional differences, they should be mapped into different data elements to ensure that downstream coding recognizes the differences. For example, in banking, one deposits system might store month to date debits, while another might store statement cycle to date debits. It would be my recommendation to keep them in two different data elements rather than at this level trying to combine them into a single one. This would give the downstream processes more flexibility in terms of how these differences should be handled. The basic responsibilities of this layer are:

Developing and maintaining the processes that provide consistently defined and formatted data that facilitates downstream activities

Making sure any upstream changes are reflected in the outputs from this layer.

Performing data integrity testing to ensure that no quality issues were introduced within this layer.

Note that the outputs from this layer will be stored in a set of staging tables along with all information pulled directly from the production systems via the landing area.

Operational Data

Sources

(1) Data

Acquisition

Landing Area

Initial Staging

Area

(2) Commonization

(3) Calculation

(4) Integration

Data Warehouse


Data

Data Quality

Monitoring



Data Mart

Data Mart


Platforms

Metadata

Quality Statistics

Users



ETL Process Flow

Cube Cube

65

.

Calculation Once data is in a common format, basic calculations can now be executed. In the most efficient possible scenario, a single pass would be made through each table, during which all calculations that do not require information from other tables will occur. For example, in banking, the outstanding balance of a loan may actually be comprised of a dozen different component variables, which need to be numerically combined in order to provide this overall balance. Note that some of these calculations will be necessary in order to compute keys and support subsequent integration and assembly steps. Basic responsibilities of this layer are:

Developing and maintaining the processes that do a series of basic, common calculations that will be used downstream.

Making sure any upstream changes (ie, changes to source data files, including new codes and changed formats) are reflected in their calculation algorithms, and that changes that occur in their algorithms are communicated downstream.

Performing data integrity testing to ensure that all calculations have executed error-free.

Outputs from this step will generally be stored in the staging tables along with the commonized data.

Operational Data

Sources

(1) Data

Acquisition

Landing Area

Initial Staging

Area

(2) Commonization

(3) Calculation

(4) Integration

Data Warehouse


Data

Data Quality

Monitoring



Data Mart

Data Mart


Platforms

Metadata

Quality Statistics

Users



ETL Process Flow

Cube Cube

66

Integration

The integration layer is where that we populate the keys for all tables, and ensure that all linkages of data, whether across entities within the current period or across time periods, work correctly. This ensures referential integrity across the entire data warehouse. Note that strategies for developing keys can vary sharply, depending on the nature of the input data. In some cases, a single data element with no modification can be used as a key field. In other instances, multiple data elements may be concatenated to form a compound key, or may be used to calculate a new, non-intelligent key. Basic responsibilities of this layer are as follows:

Developing and maintaining the integration and cross-reference processes that compute key fields

Making sure any upstream changes are reflected in key fields.

Performing bi-directional tests to ensure that all keys are defined consistently across tables that need to be joined, and that current time period can be joined back properly to prior time periods.

Loading data into relational tables in data warehouse At this point, data is loaded into the live data warehouse tables.

Operational Data

Sources

(1) Data

Acquisition

Landing Area

Initial Staging

Area

(2) Commonization

(3) Calculation

(4)

Integration

Data Warehouse


Data

Data Quality

Monitoring



Data Mart

Data Mart


Platforms

Metadata

Quality Statistics

Users



ETL Process Flow

Cube Cube

67

Information Assembly The final layer in the information manufacturing process is assembly. This consists of any type of calculation, summarization or aggregation that is implemented across entities or across time periods. Note that assembly is by nature a very broad term. Outputs from this process are the metrics, segments, and scores that users will actually incorporate into their business processes. These information deliverables are populated into the data warehouse, and may subsequently make their way into data marts, OLAP cubes, and/or reports. The information assembly layer performs the following functions:

Builds and maintains production processes that summarize and aggregate data. Leverages all raw inputs and information sub-assemblies available throughout the information environment to build information deliverables

Makes changes to processes based on any changes to inputs or changes in user requirements/specifications.

Validates all computations and summaries to ensure consistency with inputs and history.

This area is the most visible to users, and especially senior managers, since it is responsible for the calculation of KPIs and other business-critical metrics. Once this step is completed, information is ready to be organized and structured for delivery.

Operational Data

Sources

(1) Data

Acquisition

Landing Area

Initial Staging

Area

(2) Commonization

(3) Calculation

(4) Integration

Data Warehouse


Data

Data Quality

Monitoring



Data Mart

Data Mart


Platforms

Metadata

Quality Statistics

Users



ETL Process Flow

Cube Cube

68

Information Delivery

Information delivery optimally structures information and stores it in a location where users can retrieve and manipulate it using their tools of choice. This includes information that is delivered via:

Data Marts/OLAP cubes

In-memory Databases

Executive Information Systems, Dashboards, Scorecards

Web-delivered reports The manner in which information is delivered will depend on the type of information, the type of users, and the type of usage. Delivered Information is built off of standard metrics and dimensions that are calculated in the information assembly stage and stored in the data warehouse. In some cases, it is possible that those metrics may be supplemented by customized metrics prepared specifically for a report or data mart. My recommendation is that this be avoided unless there is absolute certainty that this information will never need to be shared. The same information often needs to be delivered in multiple places to satisfy different user needs, and not having this information prepared and stored in the data warehouse or mart structures may limit flexibility and result in replication and inconsistency.

Operational Data

Sources

(1) Data

Acquisition

Landing Area

Initial Staging

Area

(2) Commonization

(3) Calculation

(4) Integration

Data Warehouse


Data

Data Quality

Monitoring


(6) Information

Delivery

Data Mart

Data Mart


Platforms

Metadata

Quality Statistics

Users



ETL Process Flow

Cube Cube

69

Implementation of this discrete-layer paradigm can significantly improve the internal

efficiency of an information management department:

Each layer is dependent only on the inputs it receives from its sources, not on how

the sources operate.

Roles are precise and well-defined.

The manager of each layer is empowered to search for efficiencies across a wide

range of homogeneous processes.

The quality can be verified before and after each layer to ensure that no errors are

introduced, and the manager of each layer is held accountable for its outputs.

An extremely critical part of the whole development and deployment process is metadata.

Metadata actually serves a dual role, supporting both information users and information

developers. For information developers, metadata is used to manage the entire inventory of

data inputs, transformations, and intermediate results. Information that must be captured

includes:

Business definitions, in sufficient detail to allow users to completely understand a

data element, its applications, and its implications.

Transformation rules that dictate exactly how a data element is computed, along

with bi-directional derivation linkages between any data element and its

components.

A complete set of code values and formats.

In the context of the information manufacturing process, it is critical that not only do the

deliverables that appear in the final repositories need to be documented, but also all of the

intermediate data elements. This supports the reuse of these data elements in downstream

calculations, and supports the continuity of the ‘impact chain’, which identifies those

information deliverables that are impacted when source data elements change.

In many cases, ETL development tools have their own metadata management. This allows

source data to be queried and profiled, with the statistical information about this data

incorporated into the metadata repository along with information on its definition and origin.

This profiling can show how categoricals are split among their discrete values, or show the

statistical distribution of a numeric value.

All information environments must be organized around a metadata repository to maintain

critical knowledge about the data. With an appropriate metadata and ETL development tool,

we can implement a highly efficient information engineering process. We start with an

information blueprint, which involves designing delivered information elements, identifying

the series of components that need to be collected or assembled to build them, and designing

70

the processes that synthesize them. This blueprint is then embodied into the metadata. Thus,

metadata should not just be a mirror of the development process, it should be the driver of the

development process

When producing metadata for a project, the first thing you start with is the business definitions

for the information deliverables that will be presented to end-users. From there, the

information engineering process begins by working backwards to identify the various

information sub-assemblies that should be intermediate components of these deliverables. We

start with all of the sub-assemblies that are pulled together in the information assembly layer.

As these are being identified, there will be some sub-assemblies that can be acquired from

other processes and used as-is, and some that will need to developed. These may also require a

set of linkages that must be developed by the integration layer, and inputs that must be

computed in the calculation layer. Each layer will then set the requirements for what it must

receive from the prior layer.

However, this is just a first iteration. Once you get down to the acquisition layer, the data

elements you need may not be there, or may not be exactly what you wanted. At this point,

different sourcing scenarios are proposed, and their effects bubbled up the process to determine

the impacts. A negotiation process then ensues to determine the actual inputs to be used and

the actual hand-off points and sub-assemblies that will be developed.

Implicit in the definitions of the various information sub-assemblies is an information

manufacturing process that builds the information. Processing considerations may therefore

cause the level at which information sub-assemblies are defined to be changed, or may

necessitate additional intermediate information components. In conjunction with the definition

of data elements, high level process flows must be developed which identify inputs,

programs/processes, and outputs.

Once this is completed, the next phase is to completely populate the metadata repository with

all definitions, transformation rules, and derivation linkages. To ensure that metadata is

consistent with what is delivered, the metadata should directly serve as the programming

specifications for the data elements that are being produced. Note that this is the case for all

changed data elements, also. The metadata should be rewritten to incorporate the latest

definitions, calculations, and components, and this should form the basis for any programming

changes.

Once metadata is captured, technicians from each implementation layer go through the process

of defining their roles and outputs to create the required elements. They then develop any new

transformation programs/processes, or plan changes to existing ones as defined in the

information flow documentation. Note that ideally, the tools used for ETL implementation

should be integrated with the metadata, so the transformation rules captured in the metadata

can be readily converted into the code for implementation.

71

“Real Time” or Active Data Warehouse ETL

In essence, a “real time” or active data warehouse is a hybrid of an operational data store and a

data warehouse. It supports the current view of the organizational to see the most recent

transactions or events, while also supporting a historical view of longer term trends.

I put real time into quotes because whether it is truly real time or just close to real time is a

matter to be determined based on planned usage and associated cost-benefit. In my experience,

real time has proven to be too expensive and of insufficient value to justify. When loading a

data warehouse, the cost per record loaded is generally inversely proportional to the number

loaded together. Therefore, low-latency micro-batching can be a highly cost effective

alternative to real-time loading of individual transactions.

The benefits of this type of scenario are substantial for a data warehouse at a level of maturity

to take advantage. Daily or even intra-day status of sales, inventories, or transactions can be

tracked, and events can be detected and that information used to trigger quick responses.

Information can be dynamically provided to people accessing web sites according to not only

their historical accesses, but also their current clickstream!

By nature, this type of update scenario (which I refer to as low-latency asynchronous updates),

carries with it some additional complexities that we do not need to worry about in the

synchronous update scenario. Because it is asynchronous, you can not cross validate data in

different tables because they could be in different states, depending on what has or has not

been updated. This means that as you update, you can only validate data locally, or on the

specific record(s) you are adding or changing.

To provide a similar degree of control and confidence in the data that my original process

would provide, we need to modify it to produce a two-step process here. The first is a data

acquisition step which takes new data and adds it into the data warehouse. The second is a

computational step where metrics are computed and aggregations/summaries are made. In this

second step, data can be “checkpointed” so cross-table synchronization and data validation can

take place. This cross-table synchronization could include referential integrity verification,

and also consistency checking of data across related tables. Results can be noted in data

quality reports, and can optionally result in questioned data being removed from production

tables and placed into “suspense” tables to be researched prior to being replaced in production.

Once appropriate data corrections are made, at that point assembly of key metrics can proceed

and then data aggregations and summarizations can occur in preparation for delivery. The

following two diagrams show a high level look at the two stages of processing of low-latency

asynchronous data feeds.

72

Stage 1 Processes

Low latency data can come from application systems in one of two ways:

1. It can be pre-consolidated into micro-batches at frequent intervals, that can be processed and loaded as batches. This is represented by the path starting with (1b).

2. It can be placed into queues where it can be subscribed to and read. Data from a queue can then either be direct loaded (Process and load step incorporates all communization, calculation, local validity testing, and loading into database) or else collected into a micro-batch in the landing area and then batch processed and loaded. This is represented by the path starting with (1a).

Micro batches

(1b) Data

acquisition

Landing Area

Initial Staging

Area

(2) Commonize

(3) Calculation

(4a) Load

Data Warehouse

Data Quality Monitoring


Platforms

Metadata

Quality Statistics

Users


A high level look at low-latency asynchronous

ETL Process Flow (stage 1)

Operational Systems

Queued transactions

(1/2/3/4a) Process

and load

(1a) Collect from queue into micro-batch

73

Stage 2 Processes

Step is where the asynchronously received data is checkpointed to verify internal consistency:

1. Bi-directional joins are tested to ensure referential integrity 2. For related data in related tables, data cross-checks can be implemented to

verify consistency Once data quality and consistency is confirmed, data assembly and delivery steps can be performed as in the synchronous, periodic update scenario.

(4) Referential integrity/

Cross-table Validation

Data Warehouse

Prior Checkpoint Reference

Data

Data Quality Monitoring


(6) Information

Delivery

Data Mart

Cube Cube

Data Mart


Platforms

Metadata

Quality Statistics

Users


A high level look at low-latency asynchronous

ETL Process Flow (stage 2)

74

Data Quality Concepts and Processes

Data quality is something that many talk about, but few are highly effective at. Just as with the

manufacturing of consumer products, quality orientation must permeate the entire information

manufacturing process. This means that all involved, from beginning to end, must think in

terms of quality as a primary objective.

Unfortunately, quality is often the least understood and lowest priority aspect of information

delivery. Information is a tenuous concept, and quality of information is even more tenuous.

Many organizations do not have even basic quantitative operational quality metrics such as

defect counts or defect rates, let alone a true understanding of the cost of poor quality or the

impacts of defects on the business value of the information deliverables. Thus, it is much easier

for a data warehousing group to concentrate on measurable things like delivery dates, number

of data elements, or number of terabytes.

The big question on data quality is: How do you quantify its impacts? It is only through

quantification of the benefits that are at risk that you can determine what an appropriate level

of expenditure is to identify and correct data quality problems. Let’s start by looking at the four

different ways in which data quality problems can cost an organization real dollars:

They may prevent you from applying information in specific ways that would

generate profitability, either based on known data problems or on a general lack of

faith in the data based on prior problems and perceived poor quality.

The process of correcting the data quality problems may delay implementation or

execution of critical information processes, which may defer the benefit stream and

reduce its long-term value.

They may incur a significant repair cost on the back end by forcing users to install

temporary (or even permanent) workarounds into their processes.

They may impact the accuracy of the business rules derived from analytical

information processes, which reduce profitability by causing you to take incorrect

actions.

The worst case is having data problems that nobody knows about. These can cause your

analytical and/or operational information processes to work incorrectly, thereby reducing or

even totally negating their benefit. For example, flawed customer data and/or householding

algorithms can cause a customer’s relationship to be misrepresented in the data warehouse.

This can cause a top-tier customer to be treated like a bottom-tier customer, and can result in

increased attrition. Likewise, it can cause customers to get the wrong direct marketing

solicitations sent to them. It can cause pre-approved loan offerings to be made to customers

who have already defaulted on loans, or cause top prospects to be bypassed in a marketing

campaign.

When devising an integrity strategy, you must approach it from a business perspective. You

75

must weigh the benefits in terms of increased information reliability (or at least being able to

identify problems before they negatively impact your processes), versus the costs incurred in

implementing and operating the data quality checks. It is never a matter of whether or not to

build in data quality checking, since there will always be critical data elements that are worth

checking no matter what the cost. Rather, the decisions to be made will generally consist of

what to check and where in the information assembly line to check it.

The way to assess the downstream organizational costs of potential data integrity problems is

to look at each data element independently. There are two key factors that must be assessed,

the aggregate value dependent on that data element, and its associated risk factors. The

following figure illustrates the error propagation path.

To determine aggregate business value at risk relative to an information deliverable, we must

link the data element to all of the analytical and operational information processes that are

dependent on it. From there, it is necessary to estimate the impact of a data integrity problem

on the ability to produce value from that process. This takes into account four factors:

The total amount of profitability generated by the business process if all of the

information end products upon which it is based are correct.

The sensitivity (rate of decrease) in business value associated with moving from an

optimal to a sub-optimal decision

The sensitivity of decisions to changes in the information end products upon which

they are based.

The sensitivity of the information end products to changes in the inputs that are used

to calculate them.

Error Propagation Chain

Error Condition

Delivered Information

Information End Products

Decisions/Actions

Enterprise Value Added

Invalid/unexpected inputs or interactions between data and ETL processes

Deviations in data elements populated to warehouse and marts

Incorrect results from end-user data preparation processes

Sub-optimal decisions or actions

Reduced Business Value!

76

This applies to both analytical and operational information processes. In an analytical process,

data quality issues could cause invalid business rules to be derived, which will reduce the

benefits of operational information processes. In addition, poor quality of inputs into the

operational information processes themselves will also diminish their value, even if the

business rules are perfect.

Understanding how defects propagate through processes allows you to understand and estimate

value at risk. Data defects ultimately impact the outputs from user processes (information end

products), which then have a cascading effect on the decisions made and resulting business

value. In some cases, changes in inputs will have minimal impact on the business value of the

back-end decisions. These are low impact elements. High impact elements are those that will

yield a significant change in the resulting decisions and business value. This relationship can

be referred to as the value sensitivity associated with a data element, which is the rate at which

value is lost from a business process as that data element diverges from correctness. For

example, assume you have 100,000 customers, of varying profitability levels. 10,000 of the

highest profitability customers (over $1,000 per year) should qualify for high tier service,

which costs the company an additional $100 per customer per year. If data problems caused

you to erroneously target an additional 5,000 low profitability customers to receive the high-

tier service, $500,000 would be spent unnecessarily. If data problems caused the 10,000 high

profitability customers to be targeted for low tier service and attrition increased by 5%, we

could lose $500,000. $500,000 is the value at risk for those two defect scenarios. To fully

assess value at risk, you must estimate value at risk across the entire set of decisions being

made on the basis of the information end product and of the full spectrum of defect scenarios.

Note that there may actually be multiple intermediate levels of information subcomponents

between the initial inputs from source systems and the information deliverables. Deviations in

subcomponents may either originate at that point due to faulty logic, or may be the result of

incorrect inputs to the computations being correctly processed. These deviations will

propagate as variances in delivered information, which ultimately impact the information end

products (via user processes) and the business processes they drive. The relative rate of change

of delivered information versus its components can be thought of as the computational

sensitivity. For example, profitability is computed as revenue minus expenses. For a slight

percentage change in revenue, there is actually a multiplicative effect on the percentage change

in profitability. This corresponds to a profitability calculation having a high degree of

computational sensitivity with respect to revenue. If we look at the average of 12 months of

balances, the sensitivity of the average to a deviation in the current month is only about 8%

(1/12). The product of the computational sensitivities of the sequence of information

subcomponents leading to an information deliverable, multiplied by the computational

sensitivity associated with the user process creating the end product, yields the aggregate

sensitivity of the information end product to that input element.

For any data element, the integrity risk factor is an estimation of the probability that the data

element, as delivered to end users, will diverge from a correct and usable form. Note the

inclusion of usability in this definition, since receiving correct information with unexpected

77

formatting or alignment differences may be just as bad as receiving incorrect information. For

a data element such as a balance that passes through virtually unchanged from the source

system to the data warehouse, the integrity risk of the deliverable is approximately equal to the

integrity risk of the source element. For a computed data element, it is the sum across all

inputs of the integrity risk associated with that input multiplied by the computational

sensitivity of the computed data element with respect to that input, plus the risk factor

associated with faulty computational logic.

Let’s first look at integrity risk that originates at the point of data capture. This is related to the

nature of the data validation that takes place upon input. One factor impacting validation is the

degree of granularity with which data is input. For example, granularity is increased when you

have separate data elements for city, state, and zip code, versus a free form address line. These

elements can then be individually verified for correctness. Risks here can be estimated in

either or both of two possible ways. Existing data anomalies can be identified within the data

stores by programmatically analyzing the contents of individual data elements. In addition, an

assessment can be made of the input controls to determine what types of errors are possible.

Generally, data elements used in actual operations (such as deposit balances or fee totals for a

bank) will be very reliable. Data elements captured for informational purposes only will

generally have much higher integrity risk, since there is often neither the focus nor the

capability to ensure correctness.

Many data quality issues are introduced after data acquisition. These generally consist of:

Errors in design assumptions that cause erroneous outputs for unexpected inputs.

Errors in coding that cause erroneous outputs for infrequently occurring/untested

input combinations.

Modifications in data element content from source systems for which ETL

programming changes were not made or were made incorrectly.

Incomplete transport of data due to hardware or software issues

Omission or inaccessibility of records due to processing errors or key duplication

problems

The point to remember here is that the more transport and transformation stages something

goes through, the higher the probability of a problem. Every time data is touched in any way,

there is the possibility of an error being introduced. This likelihood increases as complex

calculations are made such as customer profitability, which may pull in a wide range of data

elements from many sources.

Data quality assessment is almost as much art as it is science. While it is possible to recognize

data that is wrong, it is impossible to guarantee actual correctness of data. This requires that

you have access to the right answers to compare things to, which is generally not the case.

However, there are three things that you can assess to increase your comfort level with data

and verify that it is probably good:

78

Consistency with sources, to determine that no additional problems have been

introduced in any intervening transport or transformation function.

Plausibility, which pertains to whether any individual value encountered is a

possible one based on the business rules associated with that data element. For

example, negative interest rates are not plausible, nor are rates above legal limits.

Reasonableness, which pertains to whether or not the value is consistent with history

and with other values of related data elements. For example, a 10-fold increase in

balances is not reasonable. It could be an error resulting from a shifted decimal

place.

When evaluating data for these three qualities, the strategy you use will vary depending on the

type of data you are assessing. I place data into four broad categories:

Freeform Text

This is data that is input into large text fields. While documented business rules exist for

populating these fields, there is often little or no programmatic checking done to ensure

that it is being updated properly. This type of data may include name, address, or memo

fields. Generally, the only type of checking you can do with these are that they are

present or not present, and that they are correctly justified. In some cases, it may be

possible to check for the proper positioning of embedded substrings.

Dates

Dates should be checked for plausibility, to determine that they are of the correct

format (ie, month is 12 or less and date corresponds to maximum for month). In

addition, they need to be checked for reasonableness. For example, a closed date should

be null of there is an open status and it should be populated with a valid date if there is

a closed status. In addition, logical sequencing must be assessed to ensure that certain

dates are prior to other dates as mandated by business rules. For example, a closed

date should be subsequent to the open date.

Categoricals

These are any data elements that place an entity into a category. This could be product

codes, status codes, etc. Note that categoricals must be included in a pre-defined set of

valid values, based on business rules. Therefore, plausibility checks will consist of

verifying that the value of that categorical is included in that list of acceptable values,

and also to verify that it is in conformity with business rules relative to other related

data elements. An unknown value could actually mean that the list needs to be updated,

or bad data has been introduced at the source. There are two types of reasonableness

checks that could be utilized. The first is to compare the categorical in each record

with the prior version, to determine that the change of state is valid according to

current business rules. The second is to look at the overall distribution of values across

the entire population to determine consistency with historical norms.

79

Measured Values

This pertains to any numeric quantity that communicates the activity or current state

associated with an entity. Using a banking example, this would include the current

balance associated with a deposit account, or the number of debits that were processed

last month. With measured values, the data from the source can be assumed to be

correct, since the operational world will monitor this for accuracy. The key concern is

to make sure that as data transport and transformation occur, errors are not

introduced. Plausibility checks can include decimal positioning and valid sign.

Reasonableness checking can include comparisons to other related data elements and

trending over time.

The most difficult type of data to verify is computed data. This is especially true when

evaluating data that was arrived at through a complex series of summarizations and

computations, such as customer profitability. The key here is to determine plausibility and

reasonability of individual information sub-assemblies, or the components that are used to

build the information finished products. This checking is continued until you finally check the

finished products.

Key fields for relational tables may be a combination of computed data and directly sourced

data. Note that keys have an additional layer of complexity, in that they have to be determined

independently for different entities, and yet still need to be consistent across entities. Individual

key elements can be checked for consistency, plausibility, and reasonableness, and then

matched across tables to ensure that the linkages function properly.

Trending in data validation is a very straightforward concept. Simply stated, it refers to the

modeling of historical data into a predictive function that is used to extrapolate the next data

point, and the measurement and interpretation of the degree of deviation of the actual value

from that predicted data point. Deviations above a threshold quantity or percentage qualify as

anomalies, and trigger an action. Although trending applies specifically to measured (or

computed) quantities or counts, trend information broken out by categoricals can be used to

verify the consistency of the categorical definitions across time.

Of course, while the concept is simple, implementation can be complex and many

considerations must be balanced. The first trending issue to be decided is whether to have

automated or manual trend evaluation. Manual (judgmental) trend evaluation is merely the

display of N consecutive historical values for a data element, which are then visually compared

to the current value. The person evaluating the data will then make a judgment call as to

whether the data falls within reasonable bounds or not. This method carries with it a high level

of subjectivity, labor intensiveness, and potential for human error. This may be a good stop-

gap approach, but is definitely sub-optimal as a long term approach.

Automatic trending is a mathematical approach which utilizes a simple predictive (or possibly

heuristic) algorithm for determining the probable next value in a time series, and a probable

range based on the historical volatility of the variable. The probable range is used to define one

80

or more thresholds or trigger points. Comparing the actual data value for that time periods with

the trigger points will determine what level of data integrity problem exists and what potential

actions might be taken. The manner in which the historical data can be utilized to predict the

current value of data may be either simple or complex, rough or precise. All make assumptions

about the overall direction of data movement. Some of the possible prediction methods are:

Assume this month’s data will be approximately equal to last month’s data, so use

the prior value for comparison.

Assume this month’s data will be the average of the last N months

Assume linear growth and compute the slope of the line connecting the last data

month with the period N months prior, and then extrapolate forward

Assume linear growth and do a linear regression on the last N months, which can

then be extrapolated forward

Assume compounded growth, and compute the compound growth rate between the

last data month and the period N months prior, and apply this compound growth rate

to the current month

Identify annual cyclical patterns, superimposing this pattern on an annual growth

rate gleaned from analysis of the data, and use this to predict the next period

Prediction methods can be made increasingly complex, but often with rapidly diminishing

returns.

Volatility is the propensity of a value to fluctuate around its statistical trend line. Volatile data

is subject to large swings on a continual basis, which makes it much harder to distinguish

normal fluctuations from data integrity problems. Volatility is the basis for defining

threshold/action-trigger points, and can be estimated by using variance calculations or by

averaging the absolute value of the displacement of each individual period value from the

modeled value for that period (trend line or curve). Understanding the natural variations in

values would allow you to establish ‘trigger points’ that warn you when variations exceed

historical norms.

Approaching it from the other direction, you could disregard actual historical fluctuation

measurement and determine trigger points based on perceived variability of the data element

and on potential impacts that fluctuations would have on delivered information and business

processes

Trigger points could be defined to support an escalating series of actions based on potential

criticality of problems:

First trigger point (Yellow) would identify changes in values that are on the outer

fringes of normality or inner fringes of abnormality, and may or may not represent

potential problems.

Final trigger point (Red) would identify changes in values that are clearly beyond

81

normal and demand immediate attention

.

By doing this, separate ‘red alert’ reports and ‘yellow alert’ reports or email alerts can be

generated, and specific manual review processes can then be built around these two lists.

In addition to the type of trending and volatility measurement that is done, a critical design

decision is at what level of granularity the trending is to be done at. Trending can be done at:

The entity level

The application system level

The product, organizational unit, or customer type level

The account or customer level

Obviously, as you get to the lower levels of granularity for trending, the amount of work and

complexity goes up. However, there is much to be gained by going to the lower levels of

granularity:

Errors that might cancel each other out at the high level are easily detectable at the

lower levels of granularity

The lower levels of granularity support customized trend analysis for different sub-

groups

Low levels of granularity can still be summarized for reporting at higher levels

The lower levels of granularity simplify research into problems

For example, if you are tracking trends at the application system level, you would not notice

that the balances for a specific product went dramatically up, with a corresponding decrease in

the balances for another product. For this, you would need to trend a summarization by product

type. Trending each account individually, identifying the ‘alert level’ (red, yellow, or green)

associated with each account, and then summarizing that at the file level (including the total

number of red, yellow, or green account categorizations) provides the most insight into what is

happening in the data. It is also possible to apply different trending mechanisms at the record

level. In banking, for example, business checking accounts might be trended and evaluated for

volatility differently from retail, and senior checking accounts in Florida (subject to large

seasonal fluctuations) might be treated differently from the rest of retail accounts. This level of

granularity also facilitates research, since a statistical profile can be made of the ‘red’ accounts

that show up in the report to research what might be causing the problem.

Quality must be approached as an end-to-end strategy. It is only as strong as the weakest link

in the chain. The key is to understand where certain errors are most likely to occur, where they

can be most readily detected, where the data is most likely to be able to be fixed, and what

error detection mechanisms appropriately balance costs and risk. Note that the earlier in the

process a problem is detected, the easier it is to correct. Also, identifying problems later in the

process may entail significant rework if the problem has to be corrected in an earlier step, and

82

all downstream activities need to be redone. The following is a basic scenario for the types of

data quality checking that could occur in each information processing layer:

Data Acquisition

For Data Acquisition, there are three key objectives of data quality testing:

Identifying errors originating from the source systems, whether calculated or

user input.

Identifying valid data changes that will have downstream impacts as data

flows through the information assembly process.

Ensuring that all data is successfully assimilated into the decision support

environment.

This generally requires both plausibility and limited reasonability checking of all

critical data acquired. For all categoricals, every value must be compared with the

current listing of acceptable values for that data element. Exception reports should

identify any new categoricals introduced into the data (as well as any that disappeared),

for the purpose of determining if that is actually a valid change or represents an error

condition, and to determine what, if anything, needs to be done about it. This requires

coordination with all downstream ETL process managers.

To confirm that all records were transmitted successfully between the operational and

decision support environments, record counts must be generated both upon extraction

from the operational databases and loading into the decision support landing areas,

and compared to ensure completeness of transfer. In addition, control totals should be

compared to ensure that critical data elements are passed accurately.

Data Commonization

There are two different data quality objectives associated with commonization:

Making sure the correct inputs are mapped into the correct commonized data

elements.

Making sure there are no records dropped or omitted

The strategy used for checking will depend on the type of transformation that occurs.

For example, if you are mapping one set of categoricals into a different set, then a

plausibility check might be to verify that the combination of input and output is in a set

of valid combinations. A reasonability check might be to look at the distribution of the

transformed categorical in this data month, and compare it with the prior month.

83

For totals where only formatting was changed, plausibility can be verified by looking at

sign and number of decimal places, while reasonability can be verified by comparing

the pre- and post-transformation totals.

Calculations

In this phase, the primary objective is to determine that calculations are being done

properly. Calculations are somewhat difficult to verify, since there is nothing that you

can directly compare them to. Plausibility can be determined by checking formatting,

sign, and decimal points on numeric outputs, and checking output categoricals against

a list of possible values. Reasonability checking can be done by checking to make sure

that there is consistency between related data elements, which generally entails

verifying that the combination of data element contents that are being input to create

the calculation are covered by a business rule. This might involve determining if a set

of related categoricals represents a valid combination, if input values fall into correct

ranges, or if certain fields may or may not be populated based on the contents of other

elements. Unexpected combinations should be flagged as an error and those records

placed in an exception file, rather than being mapped into a default category. Any

mapping into a default category should be a decision that is made judgmentally, not

built in to the information manufacturing process. Additional quality verification can be

provided through trending analysis performed at various levels.

Integration

Integration, while the most difficult stage to implement, is actually fairly

straightforward to check. Generally, this involves answering two questions:

Are linkages across entities finding matches where they are supposed to find

matches?

Are the rows that they are being matched to the ones that they are actually

supposed to match to?

What this means is that any joins need to be verified bi-directionally to ensure that

everything that needs to match across tables does so correctly. While this is

conceptually the same as referential integrity, this does not necessarily mean building

referential integrity into the database load process. One problem is that we do not

necessarily want records to be rejected and potentially lost if there is not a match. The

key is to be able to identify those that do not match early in the process, to be able to

figure out why they did not match, and then to remedy the situation prior to the

continuation of production processing.

84

Information Assembly

Information assembly is the hardest process to check. Generally, the best you can do

here is reasonableness checking. The key is to continuously check the information sub-

assemblies as they are being manufactured. This allows you to pinpoint any specific

problems early in the processing where they are more easily detectable. A problem

which has a great impact on the value of a specific information sub-assembly may have

a fairly small but significant impact on the information deliverables that it rolls into, or

it may impact a small but important subset of the output deliverables. Checking only the

information deliverables themselves may not trigger red-flags, even though quality

issues exist.

Reasonableness checks can often be done by trending. In some cases, cross-element

checking can be done. For example, verification that two elements move in the same

direction across time periods can help to assure accuracy. Also, when trending totals, it

is often worthwhile to break out totals by various categoricals (product types, customer

types, etc). Often, problems that impact a small subset of the records that may be

invisible to the bottom line will be visible when looking a this type of meaningful

breakout of records.

Information Delivery

For information delivery, there is minimal additional processing done to data. It is

mainly an aggregation and transport process. However, there are still things that can

go wrong that must be specifically addressed. Records can be dropped as data

transmissions take place, or misinterpreted if the template through which records are

interpreted is different from the one through which they were written. If records are

being written to a repository, they can be rejected because of formatting errors,

duplicate keys, or referential integrity problems. The main things to be checked are

verifying totals stored or reported versus control totals associated with the information

assembly process. For cubes, we need to make sure dimension and metric totals

correspond with those in the atomic data.

Effective up-front validation will prevent expensive back-end cleanup. Often, the way that

data gets validated on an ongoing basis is that users get reports that look ‘funny’. A help-desk

request is placed to research the potential data anomaly. After a couple of weeks of research,

the root cause of the problem is uncovered. A request then has to be put in for the problem to

be corrected. In the meantime, the problem has permeated two months of tables, and has

dramatically impacted the company's ability to understand their customers. Compounding the

situation is that processes have already been executed and decisions made utilizing the invalid

data. The whole purpose of continuous checking of data is to find problems at the earliest

possible point, so that they can be corrected before propagating through the information

assembly line into the end-products. If a new product type has been identified or an error

85

introduced due to a systems conversion, this needs to be detected at data acquisition time. This

will allow corrections to be made downstream processes to accommodate the changed data

before it turns into a data integrity problem.

Of course, it is not possible to fix all problems that arise. This becomes a business decision.

Certain problems may be sufficiently low impact that they are not worth the effort to fix.

Fixing other problems may involve significant delays in the overall process, which would be

more detrimental to the business than the problem data elements themselves.

The key here is to develop a decision-making process. Upon detecting any error in a data

element, an escalation procedure is invoked. This will involve the appropriate individuals to

determine whether to stop the assembly line, fix the problem, and live with the delays, whether

to do a partial fix or patch that might have less impact on the delivery schedule, or even

whether to just let it go. These decisions are very important ones, and must have the right level

of management involved who can make decisions that could involve substantial sums of

money or business risk.

For each data element, we can define four critical roles in the oversight process:

The data owner is the individual who serves as the primary contact for a data

element as it exists in its source system, or system of record.

A process owner manages a process that transports or transforms a piece of data,

The data steward is the individual who serves as the primary contact for a data

element from a decision support perspective. This person is responsible for the data

element in its various manifestations throughout a decision support environment,

and will be the ultimate decision maker in the event of any quality issues.

Concerned parties are individuals who have a vested interest in the accuracy and

timeliness of data, and have identified themselves as participants in the decision

making process relative to data integrity. They will provide input to the data steward

as to impacts on their specific reports or processes.

In the event of a data integrity issue, the data steward will be immediately contacted. The data

steward will then have the option of making an immediate decision, or involving any or all of

the users who have identified themselves as concerned parties for that data element. In

addition, the data owner may be engaged if the problem is actually with the source. Together

with the programmer or programmers involved, they will determine a course of action. From a

business perspective, the data steward will bear the responsibility for the outcome of this

decision.

Note that for this type of scenario to be possible, information management processes must be

highly flexible and transparent. This would allow for that specific piece of information to be

reprocessed and reloaded with a minimal amount of disruption. This has to be designed into

your ETL. When you are designing a car, you want to make sure that you do not have to take

the engine apart to be able to change the oil! Likewise, companies whose processes consist of

86

an entanglement of incomprehensible Cobol programs will probably not be able to respond in

real time to data problems. Companies that utilize high productivity ETL tools and who have

intelligently structured their transformation and transport mechanisms using a metadata-driven

development process should be able to do this effectively.

87

Information Planning and Project Portfolio Management

An Information Plan is the offspring of the strategy and architecture. The planning process

serves to devise a workable and cost-effective scenario for building out an infrastructure that

satisfies the business requirements. This is a roadmap that will identify not only the end-state

for the decision support environment, but also each intermediate state as a series of projects are

implemented that will achieve the intended objective. An information plan will start with the

end-state:

Data elements/subject areas to be captured, stored, and delivered

A list of planned user-accessible data repositories and structures, including

warehouse, data marts, and OLAP cubes

Access tools/analytical applications

Mapping data elements/subject areas to repositories and tables

Business rules for linking data within and across repositories

Once the deliverables are identified, the work needs to be partitioned into projects. The

partitioning of work into projects is a critical part of the plan for two reasons:

Relative timing of deliverables can have a huge impact on the way end-user

processes evolve and the amount of benefit achieved

Scope of work included in different projects can have a huge impact on design and

interoperability

When planning BI implementations, it is critical to maintain a broad enough scope so that you

can capture all sub-processes and all participants in the targeted analytical information

processes. Bottom-up planning focused around satisfying tactical needs for specific projects or

products can dominate prioritization and resource allocations. Requests for single-function,

point solutions can yield sub-optimal results by neglecting impacts and interactions associated

with cross-departmental processes. Both of these can lead to process dysfunctionalities:

Discontinuities that prevent individuals from working together properly

Inefficient/ineffective processes that evolve to conform to the information available

Let’s look at how scope can impact projects. Finance comes to the business intelligence team

and requests assistance in reengineering reports. They complain that the reports are too labor

intensive. By looking at their outputs, the BI group is able to automate the process and prepare

the same output with 30% of the effort.

While that seems like a significant accomplishment, it was not necessarily the best approach.

The report generated by finance was for use by the marketing department. They went through

this report and picked out a few numbers, which they then manually integrated into their

88

spreadsheets along with some other information that they had to pull. Had the scope been

larger, the BI group would have understood the bigger picture of the project and recognized the

actual information end product that this was merely a component of. They would then

reengineer the entire process to produce the marketing end product, automating the preparation

and collection of data and building in appropriate human checks where necessary.

Here is an example on the operational side. One credit card group manages credit line

increases, while another manages credit line decreases. Both are proposed as independent

projects by their respective business units, and implemented as completely independent

processes. Does this make sense? Maybe or maybe not. However, it is critical to assess them

to determine if there are sufficient synergies and integration points that would make it

beneficial to pull both into the same project. It may be that they can both be implemented into

a common event-driven process where, depending on what event precursors might occur, the

customer may have a model dynamically executed to assess whether an increase or decrease

would be necessary. It may be that the dynamics are so different that integration would supply

no benefit. Credit line increases may only need to be done in batch on a monthly basis.

However, if high-risk customers need to be evaluated for decreases more frequently, or

possibly continuously triggered by events, then keeping them separate could make sense. At

minimum, however, these processes need to be synchronized and potential interactions

understood. Having both the credit line increase and decrease processes acting on the same

person could result in the worst case scenario of that person’s credit line bouncing up and

down as each process runs. The key is that effective analytics up front to understand the

dynamics of these processes will allow for better operational decisions to be made.

From a timing perspective, the relative timing of project completion may yield unexpected and

unwanted results. Those of us who have been on the user side realize that when life gives you

lemons, you make lemonade. Whatever IT puts out there, good, bad, or indifferent, users will

figure out a way to duct-tape it together to somehow do their jobs. The impact is that if you

deliver a partial solution with the assumption that you will later provide the remainder of the

solution, you may find that processes (even extremely sub-optimal ones) that have developed

around the partial solution are so deeply entrenched that there will be reluctance to adapt to the

latter stages of the complete solution. When you plan deliverables, make sure you consider the

timing so that you do not create your own adoption obstacles.

For example, a company may have both a data warehouse and a summarized data mart which

are used by a specific department. A project was planned to add a number of new data

elements to the data warehouse, and it was subsequently planned to add them to the data mart

also. Based on project and resource scheduling, the data was scheduled to be added to the data

mart about six months after the data was included in the data warehouse.

By the time the six months had passed and data was available in the mart, the information

analysts in the end-user department had already created their own processes to leverage the

information in the data warehouse. At this point, they did not see any reason to have to change

things, since they were running overnight in batch anyway so the difference in performance

89

relative to their existing process was not meaningful. The managers are already accustomed to

looking at things in a certain way, and since the processes already worked there was no reason

to change. This is in spite of the fact that it is a more effective usage of time for the managers

to interact directly with an OLAP tool and the information analysts to be spending their time

on more organizationally productive endeavors.

This effect can be minimized by reducing implementation lag times. If there is a separate staff

assigned to data marts, the two development efforts should definitely be done in parallel with

just slightly staggered delivery times. If a metadata driven development process is used, much

of the work will be in analysis and design, which should be done together. This should

populate metadata, from which automated functions will create much of the application.

Actual data will be required for unit testing and beyond. Synchronizing this development will

produce the two sets of deliverables sufficiently close enough to discourage trying to do

inappropriate development from the data warehouse.

Another possible course of action is to store data into the warehouse in such a manner that it

can be integrated virtually into the data mart. This would allow the view that the end user

interfaces with to be identical or at least very similar to what the data would look like if it was

physically in the data mart. By doing this, there would be no rush to physically integrate, other

than for improved performance and reduced resource usage.

The bottom line is, taking a large amount of work and partitioning it out into projects is not

easy. Larger project scopes provide broader process coverage and better handling of

information interconnection points. The smaller the scope, the faster completion occurs and

the earlier benefits can start accruing. Faster delivery keeps people engaged, creates

excitement around the new capabilities, and creates value faster. Ability to stage deliverables

is extremely important to the success of a BI initiative. To do this, the BI manager must be

able to understand the interconnections between systems to support linkages between

information activities, and make sure the deliverables for each project include the critical

connection points with other projects. As the information plan is developed, you will find that

the manner in which the BI effort is subdivided into projects may be almost as important to the

success of the project as the creativity and insight applied in devising the BI end state.

90

Organizational Culture and Change

In my opinion, the ideal situation for any BI manager is to come into an organization that has

virtually nothing. In this situation, there is tremendous flexibility and opportunity to create

efficient and effective processes according to your vision, and to mold the culture and

information paradigms of the organization into an ideal balance.

However, in any organization where BI has been entrenched for an extended period of time, BI

has a life of its own. This is what I call the self-perpetuating information culture. Imagine a

carpenter who initially learned how to use just a hammer and nails. No screws, bolts, or glue –

just hammer and nails. He becomes extremely proficient with his limited tool set. The result

is that any problem he approaches will be based on applying the hammer and nail solution

paradigm, and when he looks for tools, what he is really looking for is just better hammers and

better nails.

Similarly, back in the beginning of the BI world, there was an initial set of information

deliverables. Or maybe there was an initial set of individuals with a specific set of skills.

Nobody may remember how it started, but regardless of whether the chicken or egg came first,

what we have is a perpetuating lineage of chickens and eggs. The users with a specific skill set

will want their information in a certain way. End-user programmers will want information in

normalized tables that they can flexibly pull from. Star schema users will want to be able to

slice and dice their information through simplified interfaces. Whatever type of user will

continue to request the information in the manner in which they are used to dealing with it.

Likewise, as information continues to be delivered according to that paradigm, people are hired

and skills developed in order to be able to handle that specific environment. Thus, a self-

perpetuating culture.

This culture will determine how projects are selected and prioritized. A culture dominated by

programmers will not care about simplified delivery, elegant data structures, or pre-computed

summaries. All they want is data. Their managers on the business side, who depend on them

for data (which further reinforces the self-perpetuating culture), will delegate to them to work

with the BI group to define projects. Therefore, in this type of culture, the BI manager will get

a laundry list of data elements, with instructions to “just put them out there, and make sure they

are right”.

A BI manager can approach this in two ways. He can breathe a sigh of relief because they

have made his life so easy, and just deliver the information as requested. Or he can delve

deeper to really understand the underlying information processes. Delving deeper can be

fraught with risk, and must be approached carefully. Programmers in business units can feel

very possessive about the processes that they develop, and they can be very threatened by

somebody who they feel may want to shake things up.

The BI manager must be able to trace the analytical information process flow to determine the

91

key actors and decision makers. He must be able to define a vision for what a new process

might look like. He must be able to communicate with both the business management and the

information practitioners. Most importantly, he must be able to articulate “what’s in it for me”

for all involved parties.

The business manager must understand the concept and ramifications of process. He must buy

into the notion of how value is added to information and how it is delivered most efficiently.

Both he and the programmers must also buy into the concept that by automating more

mundane data delivery processes, the programmers can spend their time on more value-added

types of activities, while the manager can get his data more easily and consistently through an

automated interface.

By far the most challenging problem occurs when processes cross organizational boundaries,

and the bulk of the cost and effort of change is borne by a different business unit than the one

receiving the bulk of the benefit. This requires a great deal of organizational/political savvy, in

conjunction with strong marketing, mediation, and negotiation skills. This must somehow be

reframed as a win-win situation, where the costs and benefits are more equitably distributed.

Foremost to remember, though, is the BI group is there to serve. It is more important to put

out solutions that will not blatantly clash with the prevailing culture and that will find

acceptance and be adopted, versus attempting to change the world (albeit for the better), but

winding up expending significant resources on something that will not be used. Processes in

many cases have to evolve incrementally. Systems and data structures change slightly; then

the staff will change slightly in response. Since the information users in the business units

have a huge amount of intellectual capital pertaining to the implementation details of the

existing analytical information processes, no strategy is going to work that does not enlist them

as partners.

However, if process change is in fact an objective of the organization, then both the business

and the BI group need to cooperate to make this happen. Cultural change and technology

change are both antecedents to process change. They provide an environment that enables and

nurtures change. Successful change requires that these be in alignment.

However, behavioral change will not take place without appropriate consequences that

motivate the change to occur. When changing information processes, there are two types of

consequences that are at work:

Innate consequences are impacts to the actor inherent in the new behaviors within

the context of the process being changed. These may relate to the ease or difficulty

of the new behaviors, and the perceived value-add to the quality of their work.

External consequences are impacts to the actor based on linkage of the new

behaviors to external penalties or rewards. These may be embodied in goals and

objectives incorporated by their managers into performance management systems, or

may be directly tied to incentive pay and bonuses.

92

It is important to understand that behavior changes may have unintended negative

consequences, and that even positive consequences must be carefully crafted to ensure that

they do not motivate unintended behaviors that undermine their true intent.

Behavioral changes tied to process changes require at minimum informal and possibly formal

action plans. All actors need to be identified, along with their intended behavioral changes. All

enabling antecedents must be determined for each person. Innate consequences then need to

be determined, both positive and negative. External consequences must be applied to

counteract any negative innate consequences or supplement any innate consequences that are

not sufficient to motivate change. By doing this for each person or set of homogeneous actors,

a plan can be initiated and monitored to ensure that change occurs as planned.

Change does not come easy. Overcoming inertia and resistance requires communication of a

vision that all involved parties can buy in to, and the down-and-dirty work of continuous

monitoring and persistent follow-up. This is what makes a leader.

93

Tactical Recommendations

Assuming you, as Business Intelligence manager, are now a convert and believe a process-

focused approach will improve the effectiveness of your organization and the enterprise as a

whole, what should you do now? While there are some agile, flexible, and progressive

organizations out there that can quickly adapt to a new paradigm, for most it will be a long,

hard, slog. It is like turning the Titanic… keep pushing and eventually it will change direction.

In the interim, there are many tactical things that you can do to eliminate obstacles to the

development of efficient processes, even if you cannot directly engineer them:

Whatever KPIs or metrics are being used by senior managers for directing the organization,

should be tied to whatever analysis you may perform. As you identify and evaluate strategies

to change customer behavior and improve performance, the key evaluation criteria is to

approximate the contribution to the KPI metrics that this change would generate. For example,

if risk-adjusted margin is one of the key driving metrics for managing a credit card portfolio,

each action you could take that would impact the cardholder base must be evaluated for its

potential contribution to overall risk-adjusted margin. If return on assets is a critical KPI,

contributions of specific customers and accounts to the overall return on assets and the impacts

of changes in their behavior require the ability to calculate return on assets at the individual

account level. As you plan campaigns, interest rate changes, or enhanced reward strategies, the

association of the organizations KPIs with individual accounts will directly connect your drill-

down to root causes and strategies for behavior modification back to the original performance

issues identified by the executives.

Of course, there are other benefits from defining common metrics and dimensions just once. It

improves consistency in information usage across and within business units and ensures

common information language. It also can have significant impacts on system load.

Calculations of these metrics may require the integration of information across a wide range of

subject areas and individual tables across the data warehouse, and can be extremely expensive

to run. By allowing users to ‘harvest’ pre-computed metrics rather than having to recompute

them from scratch whenever they are needed, this resulted in significant reductions in system

workload and improved turnaround times for the remaining workload.

Make sure common metrics and dimensions are defined

once (at most granular level) and shared.

94

You can look at an analytical information process as a mechanism whereby you continuously

drill from more generalized observations into more specific, actionable details. What this

means is that either implicit or explicit mechanisms for drilling into more detailed data must be

available. These drill mechanisms may be built into tools and largely transparent, or they can

be merely procedural, leveraging a common data language to allow the same selection criteria

that identified a specific organizational cell to be replicated to select individual accounts from

another data repository. The important thing is that each dimension that could be used for drill

back must be available (and computed uniformly!) in all repositories that could be utilized

together within the same analytical information process.

Datamarts that exist solely for access by a specific tool can sometimes be an acceptable

answer, but in many cases introduce significant issues:

This could limit or even eliminate potential for joining data across repositories,

creating data and process discontinuities.

It could prevent other segments of users, or those with existing skillsets in other

tools, from effectively accessing this data, thereby creating an ‘island’ of

information.

It could force reengineering of the repository if the need arises to migrate from that

tool, either due to vendor problems or emergence of significantly better technology.

Before making a decision to implement this type of solution, make sure you understand the

process and data interoperability issues. Be sure to also look at more open solutions, to

determine whether slight differences in functionality can result in large improvements in

interoperability.

Design should include drill-back paths from dimensional

views back to detail wherever possible.

Avoid data that is tool specific.

95

Strategic Recommendations

Ideally, a partnership should exist between BI and enlightened managers that leverages

knowledge of processes to enhance the strategy planning process. The next step beyond

merely eliminating tactical process obstacles is active management and design of the analytical

information processes themselves as part of the strategy development process:

In reality, it all starts with the operational information processes. These are the processes that

actually create value for the organization, and are the ones that the business will focus on as it

prepares its strategy. Changes to these processes may be mandated by external requirements,

or they may be desired because of anticipated positive impacts they will have on the business.

Either way, the nature of the changes in operational information processes will drive the types

of new or enhanced business rules needed, which will then determine the needs for the

supporting analytical information processes.

Because processes may span multiple business units, they should be looked at across business

units instead of within business units. Even if the planning processes are independent, the

interconnection points between the business units need to be identified and planned for. Make

sure that as you plan, you consider:

New and enhanced information end-products and the information deliverables from

the BI environment needed to support them

Changes in information activities and in roles of different segments, and

implications for training, staffing, and tools

Changes in process/system interconnection points and communication media

From there, you will need to work backwards:

Develop strategic information plans within and across business functions according to a process-focused

future vision.

Leverage target vision for analytical information processes to drive information strategy, architecture,

and design.

96

As was discussed in the section on BI planning, you start with the metrics that will be needed

to support your strategic information plan, and the other information end-products needed to

drive the supporting analytical activities. These will then need to be associated with user

segments that will be performing these activities. Once this is done, you can derive the set of

information deliverables and delivery technologies from IT to support the generation of the

end-products and execution of the surrounding processes. It is then necessary to identify any

new and enhanced architectural components needed to support this, and map out the projects

that will generate the appropriate information and structures to make this happen.

To ensure that these projects are appropriately funded and prioritized, you must:

Under a process-focused planning scenario, you will have the linkages to business processes

needed to measure value. You can drill back from BI projects to the operational information

processes that they impact, and even back to the production/delivery and financial control

processes that the operational information processes impact. This trace-back allows you to

identify deltas in revenue and profitability which can then be allocated back to the BI projects.

In general, there are two ways of tying Business Intelligence projects to the broader strategy

deliverables. The first is to consolidate BI into an overarching operational project (it can be

implemented separately as a subproject), so that the BI costs and benefits are assumed in the

overall project costs and benefits. The prioritization of the overall project will drive the

prioritization of the BI sub-project.

If BI and associated operational changes are not inextricably tied together, then we need to

look at the marginal contribution to profitability generated by the enhanced BI capabilities.

This means that you look at the probable profitability of a project without the enhanced BI

capabilities added, and also with the enhanced BI capabilities added. The assumption is that

without the upgraded tracking and optimization capabilities afforded by Business Intelligence,

the operational process will not be as effective. The delta is the contribution to profitability of

the Business Intelligence project, so you can independently determine the return on investment

and prioritization of the BI initiative.

Let’s look at an example. A Credit Card company is introducing a new ‘secured card’ product,

which allows people with sub-prime credit to have a credit card with a credit line secured by

the contents of a savings account. To support this, a number of BI enhancements need to be

implemented:

Measure the effectiveness of BI and Data Warehousing technology based on the business value of the underlying operational information processes.

97

Data elements unique to secured credit cards (such as information on the linked

savings account) must be added

Some specialized report templates need to be developed to manage the product.

Changes need to be made to high-level metrics to support this product.

In the consolidated prioritization scenario, the data warehouse work is integrated into the

overall project. A single go/no-go decision is made, which will include both the operational

and analytical work associated with this product. If the overall project meets the ROI

threshold, then the warehouse portion of the project will automatically be approved for

implementation. If the ROI of just the data warehouse portion is needed to determine sequence

of implementation within the data warehouse implementation queue, you can allocate a portion

of the total net benefits of the project to the data warehouse effort, possibly based on portion of

overall development cost associated with the data warehouse effort.

If you look at the projects separately, you will need to figure out how much of the anticipated

profitability generated by the secured card product will be lost if enhanced analytical

capabilities can not be provided to appropriately monitor and optimize this portfolio of

accounts. This difference would be the benefit assigned to the BI effort, and would be used in

conjunction with the overall cost of the BI portion of the project to determine the ROI.

Note that an entire book can be written about project costing. There are numerous ways to deal

with the allocation of infrastructure costs, incremental DASD and processing, etc. Be sure to

include the entire costs of producing information end-products and not just the information

deliverables output from the information environment. This includes training and tool costs,

plus end-user development and operational efforts and the CPU/DASD needed to execute their

processes!

Finally, from a process perspective:

Continuous process improvement is extremely important to compete and win in this

marketplace. Those who use DMAIC (define/measure/analyze/improve/control) process

optimization methodologies such as Six Sigma will find that this Business Intelligence

paradigm fits very well into that framework.

Continuously review end-to-end processes for efficiency and effectiveness, and optimize BI tools and structures

to eliminate gaps and bottlenecks.

98

Six Sigma can actually fit into this in two ways. The first is that Six Sigma can be used to

analyze the operational information processes to determine their degree of optimality and the

amount of opportunity that analytical information processes would have to improve their

performance. For example, a credit card company looking at fee waivers for late payment fees

may determine through Six Sigma that they are exceeding the percentage waivers of their

competition for similar products and customer segments, and therefore improvements in their

waiver strategy are necessary. This could result in a recommendation to provide additional

data to the BI environment or to make significant operational changes to the waiver

determination process.

Six Sigma could also be applied to the analytical information processes directly. It can be used

to track the flow of information as end-products are produced, and to identify gaps and

bottlenecks that are delaying and hampering the effectiveness of these processes. It can then

make recommendations as to how the analytical information processes can be reengineered to

eliminate the inefficiencies and be more effective.

99

The Relentless March of Technology

The BI technologist of today has an amazing array of technologies to capture and retrieve

information. The data warehouse is now just a part of the data reservoir and data lake.

Hadoop can be used to capture huge volumes of unstructured data. MPP technologies have

generated new design paradigms based on optimizing your data distribution and reducing

query spaces for finding your data, supplanting the need for the star schema databases that

were so effective for legacy database technologies. Fast data delivery through cubes has been

replaced by immediate availability of data via in-memory databases, which can pull your

answers from detail faster than a cube can retrieve summary information by dimension from

disk. The latest MPP platforms are expanding their parallel architectures to include columnar

and in-memory data storage to provide unheard of levels of performance, and the latest

federation technology can pull data from Hadoop, your data warehouse, or the cloud without

you being any the wiser as to where it is actually coming from.

New data sources exist that 15 years ago we would never have dreamed of. Social Networking

generates huge amounts of data that can be captured and mined to generate new insights and

help us better understand our markets relative to our product portfolio. Text, voice, and

unstructured data can be mined and can be combined with structured data to provide insights

beyond anything we could come up with before.

The immense data volumes available and speed of access will enable new business processes

that can dramatically enhance our ability to analyze and engage with customers and prospects.

Yet through all this, the basic concepts of BI remain unchanged – leverage information

resources to understand your business to optimize results, or understand your customers to best

engage with them at points of contact to drive their behaviors. Information processes still fall

within the same familiar patterns and structures, even as the information and technological

components are more sophisticated than ever. The business process models are not obsoleted

by the new technology – they are even more critical for pulling all these diverse pieces together

to give them purpose and meaning.

100

Conclusion

The concepts presented in this book are more directional than they are cookbook. I often think

of Business Intelligence as being as much art as science, and as much soft skill focused as it is

technology focused. Those who are planning and developing BI often have to work with

imperfect information and make decisions under uncertainty. They have to deal with people

with diverse and sometimes diametrically opposed needs and wants. They have to deal with

enthusiastic views of the future and vested interests in the past. The BI manager needs to know

when to drive, when to acquiesce, and when to be a diplomat. Change cannot be pushed on an

organization- it must be marketed and sold to an organization.

The BI manager will not be presented with problems where there is a definitive right or wrong

answer. He will select one of a broad range of possible approaches, and his effectiveness will

fall somewhere within a continuum from black to white encompassing all intermediate shades

of gray. Identical solutions applied in different situations could be effective or ineffective

depending on the context. The important thing is to be creative in coming up with ideas and

flexible in adapting to the needs and culture of the enterprise. In short, success comes to the

Business Intelligence manager who can somehow prod the rest of the enterprise into adopting

his ideas, so they can lead him in the direction he wants to go!

Making sense of BI

Data & Analytics