Data Warehousing Lecture XVI Dr. Javed Ali Baloch.

Data Warehousing Lecture XVI

Dr. Javed Ali Baloch

Outline

• Hybrid OLAP (HOLAP) or Desktop OLAP (DOLAP)

• The HOLAP Architecture• HOLAP Development Issues• Data Design & Preparation

Hybrid OLAP (HOLAP) or Desktop OLAP (DOLAP)

• HOLAP is meant to provide portability to users of OLAP.

• HOLAP provide limited analysis capability, either directly against RDBMS products, or by using an intermediate MOLAP server.

• HOLAP tools deliver selected data directly from the DBMS or via a MOLAP server to the desktop (or local server) in the form of data-cube, where it is stored, analyzed and maintained locally.

HOLAP Development Issues

• The architecture results in significant data redundancy and may cause problems for networks that support many users.

• Ability of each user to build a custom data-cube may cause a lack of data consistency among users.

Data Design & Preparation

• The DW feeds data to the OLAP system.• In the MOLAP model, multidimensional databases

store the data fed from the DW in the form of multi-dimensional cubes.

• In the ROLAP model, data is pushed into the OLAP system with cubes created dynamically on the fly.

• Thus, the sequence of the flow of data is from the operational source systems to the DW & from there to the OLAP systems.

• Why not build the OLAP system on the top of the operational source systems? – An OLAP system needs transformed & integrated

data.– An OLAP system needs extensive historical data.– An OLAP system requires data in multi-

dimensional representation.– Different departments require data from different

operational systems.

• The techniques for preparing OLAP data for a particular department e.g.: marketing.

– Define Subset: Select the subset of detailed data the marketing is interested in.

– Summarize: Aggregate the data in the way marketing department needs.

– De-normalize: Combine relational tables exactly the same way the marketing dept. needs.

– Calculate & Derive– Index: Choose those attributes that are appropriate for

marketing to build indexes.

Data Warehousing Lecture XVII


Outline

• Data Mining• Decision support progress to Data Mining• Data Mining Defined• The Knowledge Discovery Process

DATA MINING

• Data Mining is used in just about every area of business from sales and marketing to new product development to inventory management and human resource.

• In today’s world, an organization generates more information in a week than most people can read in a lifetime. It is humanly impossible to decipher and interpret all that data to find useful information.

• Data Mining enables companies to find answers and discover patterns in their customer data.

Decision support progress to Data Mining

Basic accounting

data

Operational systems

data

Data for decision Support

Data formulti-

DimensionalAnalysis

Selectedand extracted

data

KnowledgeDiscovery

PrimitiveDecisionSupport

TrueDecisionSupport

ComplexAnalysis &

Calculations

No DecisionSupport

Early File-based Systems

DatabaseSystems

DataWarehouse

OLAPSystems

Data MiningApplications

Data Mining Defined• Is the efficient discovery of valuable, non-obvious

information from a large collection of data.

• Data Mining centers around the automated discovery of new facts and relationships in data.

• With traditional query tools, you search for known information. Data mining tools enable you to uncover hidden information.

• The assumption is that more useful knowledge lies hidden beneath the surface.

The Knowledge Discovery Process

• Data Mining discovers knowledge or information that you never knew was present in your data.

• The uncovered hidden knowledge manifests itself as relationships or patterns.

Data Warehousing Lecture XVIII


Outline

• Relationships• Patterns• Knowledge Discovery Phases

Relationships

• Suppose on the way home you visited the nearby supermarket to pickup bread, milk, and few other “things”. What other things? You are not sure.

• While you fetch the milk container, you happen to see a pack of assorted cheeses close by. Yes, you want that.

• You pause to look at the next five customers also reach for the cheese pack. Coincidence?

• Now on the bread shelf. As you get your bread, a bag of potato chips catch your eye. Why not get that bag of potato chips? Now the customer behind you also wants bread & chips. Coincidence? Not necessarily.

Relationships

• It is possible that this supermarket is part of a national chain that uses data mining.

• The data mining tools have discovered the relationship between bread and chips and between milk and cheese packs.

• So the items must have been deliberately placed in close proximity.

• Data Mining discovers the relationships of this type. • The relationships may be between two or more

different objects along with the time dimension.• Discovery of relationships is a key result of data mining.

Patterns

• Pattern discovery is another outcome of data mining operations.

• Consider a credit card company trying to discover the pattern of usage that usually warrants increase in credit limit or a card upgrade.

• They would know which of their customers must be lured with card upgrade & when.

• The data mining tools mine the usage patterns of thousands of card-holders and discover the potential pattern of usage that will produce result in marketing campaign.

Knowledge Discovery Phases

• Step 1: Define Business Objectives: Determine whether you really need a data mining solution. State your objectives. Are you looking to improve your direct marketing campaigns? Do you want to detect fraud in credit usage? etc

• Step 2: Prepare Data: consists of data selection, preprocessing of data and data transformation. Include appropriate metadata.

• Step 3: Perform Data Mining: the knowledge discovery engine applies the selected algorithm to the prepared data. The output from this step is a set of relationships or patterns.

• Step 4: Evaluate Results: In this step, you examine all the resulting patterns. Apply filtering mechanism & select only the promising patterns to be presented & applied.

• Step 5: Present Discoveries: may be in the form of visual navigation, charts, graphs or free-form text. Presentation may also includes storing of interesting discoveries in knowledge base for repeated use.

• Step 6: Incorporate Usage of Discoveries: Assemble the results in the best way so that they can be exploited to improve the business.

Data Warehousing Lecture 19


Outline

• OLAP Versus Data Mining• Data Mining & the Data Warehouse• Major Data Mining Techniques

OLAP Versus Data Mining

• In OLAP analysis session, analyst looks for some prior knowledge.

• OLAP helps the user to analyze the past & gain insights.

• In OLAP, the analyst drives the process while using OLAP tools.

• In data mining, the analyst has no prior knowledge of what results are likely to be.

• Data Mining helps the user predict the future.

• In data mining, the analyst prepares the data and “sits back” while the tools drive the process.

OLAP Versus Data MiningFeatures

Motivation forInformation request

Datagranularity

Number of businessdimension

Number of dimension attributes

Sizes of datasets forthe dimensions

Analysis approach

Analysis techniques

State of the technology

OLAP

What is happening in the enterprise?

Summary data.

Limited number of dimensions.

Small number of attributes.

Not large for each dimension.

User-driven interactive analysis.

Multidimensional, drill-down, and slice & dice.

Mature & widely used.

DATA MINING

Predict the future based on why this is happening.

Detailed transaction-level data.

Large number of dimensions.

Many dimension Attributes.

Usually very large for each dimension

Data-driven automatic knowledge discovery

Prepare data, launch mining tool & sit back.

Still emerging.

Data Mining & the Data Warehouse

• Data Mining algorithms need large amounts of data, more so at the detailed level. Most DW contain data at the lowest level of granularity.

• Data Mining flourishes on integrated & cleansed data. If your ETL functions were carried out properly, your DW contains such data, very suitable for data mining.

• The infrastructure of DW is already robust, with parallel processing technology & powerful relational database systems. Because such scalable hardware is already in place, no new investment is needed to support data mining.

Major Data Mining Techniques• Data mining covers a broad range of techniques including

– Cluster Detection– Decision Trees– Memory-Based Reasoning– Link Analysis– Neural Networks– Genetic Algorithms etc.

• Various data mining techniques are applicable to each type of function.

• These techniques consist of the specific algorithms that can be used for each function.



Outline

• Cluster Detection• A Clustering Example• Clusters with two variables• Forming Clusters• Centroids and cluster boundaries

Cluster Detection

• Cluster means forming groups.• The clustering helps you take specific & proper

action for the individual pieces that make up the cluster.

• The cluster detection algorithm searches for groups or clusters of data elements that are similar to one another.

• You expect similar customers or similar products to behave in the same way. Then you can take a cluster & do something useful with it.

A Clustering Example

• Consider an example of specialty store owner in resort community who wants to cater to the neighborhood by stocking right type of products.

• Store owner has the data about the age group & income level of each of the people who frequently visit the store.

• Using these two variable the store owner can put the customers into 4 clusters, i.e. wealthy retirees staying in resorts, middle-aged weekend golfers, wealthy young people with club membership and low-income clients who happen to stay in community.

Clusters with two variables

Forming Clusters

• Suppose you want to market to the customers & you are prepared to run marketing campaigns for 15 different groups.

• Fifteen initial records (called “seeds”) are chosen as the first set of centroids based on the best guesses.

• One seed represent one set of values for all the dimension variables chosen for the customer record.

• In the next step, the algorithm assigns each customer record in the database to a cluster based on the seed to which it is closest.

Forming Clusters

• Closeness is based on the nearness of the values of the set of all dimension variables in a record to the values in the seed record.

• The first set of 15 clusters is now formed.• Then the algorithm calculates the centroid or mean

for each of the first set of 15 clusters.• The next iteration then starts. Each customer record

is rematched with the new set of centroids and cluster boundaries are redrawn.

• After a few iterations the final clusters emerge.

Centroids and cluster boundaries

Task

• Design 4 cluster for the students of 10cse selecting their final year project, list the cluster formation rules.



Outline

• Decision Trees• Decision Tree Modeling• Decision Tree Example• Task

Decision Trees

• This technique applies to classification and prediction.

• The major attraction of decision trees is their simplicity. By following the tree, you can decipher the rules and understand why a record is classified in a certain way.

• Decision trees represent rules. You can use these rules to retrieve records falling into a certain category.

Decision Trees

• It is a rooted tree in which each internal node corresponds to a decision, with a subtree at these nodes for each possible outcome of the decision.

• Decision trees can be used to model problems in which a series of decisions leads to a solution.

• The possible solutions of the problem correspond to the paths from the root to the leaves of the decision tree.

Decision Trees

• A decision tree represents a series of questions. Each question determines what follow-up question is best to be asked next.

• Good questions produce a short series.• Trees are drawn with the root at the top and the leaves

at the bottom, an unnatural convention. • The question at the root must be the one that best

differentiates among the target classes. • A database record enters the tree at the root node. The

record works its way down until it reaches a leaf. The leaf node determines the classification of the record.

Decision Tree model

• A Decision Tree Model is a computational model consisting of three parts:– Decision Tree– Algorithm to create the tree– Algorithm that applies the tree to data

• Creation of the tree is the most difficult part.• Processing is basically a search similar to that in a

binary search tree (although DT may not be binary).

Decision Tree Example

• Data

height hair eyesclass

short blond blueA

tall blond brownB

tall red blueA

short dark blueB

tall dark blueB

tall blond blueA

tall dark brownB

short blond brownB

hair

darkred

blond

short, blue = Btall, blue = Btall, brown= B

{tall, blue = A} short, blue = Atall, brown = Btall, blue = Ashort, brown = B

Completely classifies dark-hairedand red-haired people

Does not completely classifyblonde-haired people.More work is required

hair

darkred

blond

short, blue = Btall, blue = Btall, brown= B

{tall, blue = A} short, blue = Atall, brown = Btall, blue = Ashort, brown = B

eyeblue brown

short = Atall = A

tall = Bshort = B

Decision tree is complete because1. All 8 cases appear at nodes2. At each node, all cases are inthe same class (A or B)

hair

eyesB

B

A

A

darkred

blond

blue brown

Task

Design a decision tree for a customer planning to purchase a car, make sure you use the different deciding factors on which he would make the decision, the example should show a proper classification.



Outline

• Memory based reasoning (MBR)• MBR applications• MBR Challenges

Memory Based Reasoning

• Would you rather go to an experienced doctor or to a novice? Of course, the answer is obvious.

• Why? Because the experienced doctor treats you and cures you based on his or her experience. The doctor knows what worked in the past in several cases when the symptoms were similar to yours.

• We are all good at making decisions on the basis of our experiences.

• We depend on the similarities of the current situation to what we know from past experience.

• The same principles apply to the memory-based reasoning (MBR) algorithm.

• Our ability to reason from experience depends on our ability to recognize appropriate examples from the past…– Traffic patterns/routes– Movies– Food

• We identify similar example(s) and apply what we know/learned to current situation

• These similar examples in MBR are referred to as neighbors

• MBR uses known instances of a model to predict unknown instances.

• This data mining technique maintains a dataset of known records.

• When a new record arrives for evaluation, the algorithm finds neighbors similar to the new record, then uses the characteristics of the neighbors for prediction and classification.

• When a new record arrives at the data mining tool, first the tool calculates the “distance” between this record and the records in the training dataset.

• The results determine which data records in the training dataset qualify to be considered as neighbors to the incoming data record.

• Next, the algorithm uses a combination function to combine the results of the various distance functions to obtain the final answer.

• The distance function and the combination function are key components of the memory-based reasoning technique.

MBR Challenges

• Choosing appropriate historical data for use in training

• Choosing the most efficient way to represent the training data

• Choosing the distance function, combination function, and the number of neighbors

MBR Applications

• Fraud detection

• Customer response prediction

• Medical treatments

• Classifying responses – MBR can process free-text

responses and assign codes



Outline

• Link Analysis– Associations Discovery– Sequential Pattern Discovery– Similar Time Sequence Discovery

Link Analysis

• The link analysis technique mines relationships and discovers knowledge.

• For example, if you look at the supermarket sale transactions for one day, why are skim milk and brown bread found in the same transaction about 80% of the time?

• Is there a strong relationship between the two products in the supermarket basket? If so, can these two products be promoted together?

• Are there more such combinations? How can we find such links or affinities?

• Link analysis techniques have 3 types of applications1. Associations discovery 2. Sequential pattern discovery3. Similar time sequence discovery

Associations Discovery

• Associations are affinities between items.• Association discovery algorithms find combinations

where the presence of one item suggests the presence of another.

• When you apply these algorithms to the shopping transactions at a supermarket, they will uncover affinities among products that are likely to be purchased together.

• Association rules represent such affinities.

Associations Discovery Figure represents an association rule and the annotated parts

of the rule. The two parts—support factor and the confidence factor—

indicate the strength of the association.

Rules with high support and confidence factor values are more valid, relevant, and useful.

Sequential Pattern Discovery

• These algorithms discover patterns where one set of items follows another specific set.

• Time plays a role in these patterns.• Suppose you want the algorithm to discover the buying

sequence of products. • The sale transactions form the dataset for the data mining

operation. • The data elements in the sale transaction may consist of date

and time of transaction, products bought during the transaction, and the identification of the customer who bought the items.

• A sample set of these transactions and the results of applying the algorithm are shown in Figure.

Similar Time Sequence Discovery

• This technique, however, finds a sequence of events and then comes up with other similar sequences of events.

• For example, in retail department stores, this data mining technique comes up with a second department that has a sales stream similar to the first.

• Finding similar sequential price movements of stock is another application of this technique.



Outline

• Artificial Intelligence for Data Mining• Neural Networks• Neural Network Characteristics• Anatomy of a Neural Network• Neural Network Model • How a Neural Network Works?• Advantages and Disadvantages

Artificial Intelligence for Data Mining

• Neural networks are useful for data mining and decision-support applications.

• People are good at generalizing from experience.

• Computers excel at following explicit instructions over and over.

• Neural networks bridge this gap by modeling, on a computer, the neural behavior of human brains.

Neural Networks

• Neural networks mimic the human brain by learning from a training dataset and applying the learning to generalize patterns for classification and prediction.

• These algorithms are effective when the data is shapeless and lacks any apparent pattern.

• The basic unit of an artificial neural network is modeled by looking at the neurons in the brain.

• This unit is known as a node and is one of the two main structures of the neural network model.

• The other structure is the link that corresponds to the connection between neurons in the brain.

Neural Network Characteristics

• Neural networks are useful for pattern recognition or data classification, through a learning process.

• Neural networks simulate biological systems, where learning involves adjustments to the synaptic connections between neurons

Anatomy of a Neural Network

•Neural Networks map a set of input-nodes to a set of output-nodes

•Number of inputs/outputs is variable

•The Network itself is composed of an arbitrary number of nodes with an arbitrary topology

Neural Network

Input 0 Input 1 Input n...

Output 0 Output 1 Output m...

Neural Network Model

How a Neural Network Works?

• The neural network receives values of the variables or predictors at the input nodes.

• If there are 15 different predictors, then there are 15 input nodes.

• Weights may be applied to the predictors to condition them properly.

• There may be several inner layers operating on the predictors and they move from node to node until the discovered result is presented at the output node.

• The inner layers are also known as hidden layers because as the input dataset is running through many iterations, the inner layers rehash the predictors over and over again.

Advantages and Disadvantages

• Advantages– Adapt to unknown situations– Robustness: fault tolerance due to network redundancy– Autonomous learning and generalization

• Disadvantages– Not exact– Large complexity of the network structure



Link Analysis Example

Link Analysis Example



The Wellmeadows Hospital Case Study

Introduction

• The Wellmeadows Case Study describes a small hospital located in Edinburgh.

• The Wellmeadows Hospital, which specializes in health care for the elderly, requires a database comprised of data recorded, maintained, and accessed by the hospital.

• The objective is to create this database with as much functionality and as little redundancy as possible.

Introduction

• Successful projects begin with requirments gathering. • However, in this case, Wellmeadows performed their

own requirments gathering procedures. • These requirments are summarized on the basis of the

different entities. • Merely reading the material that should be contained in

a database is not enough, every sentence has to be analyzed and noted before continuance.

Identify Entity Types

• The Wellmeadows Case Study identifies fifteen distinct entity types.

Wards

• There are 17 wards each having a unique ward number & name (for example, Orthopaedic), location (for example, E Block), telephone extension and contains a total of 240 beds.

Staff• The Staff entity is by far the most complex, consisting of

multiple personnel of different rank (for example, senior and junior doctors, consultants, physiotherapists).

• The three main positions are: the Medical Director, the Personnel Officer and the Charge Nurse.

• The Medical Director has overall responsibility of management for the hospital.

• The Personnel Officer is responsible for ensuring that the appropriate number and type of staff are associated with the correct ward or out-patient clinic.

• The Charge Nurse is responsible for overseeing the day-to-day operation of the ward/clinic. This includes allocating a budget and tracking resources such as beds and supplies.

Staff Form

Wellmeadows HospitalStaff Form

Staff Number: S011

Personal Details

First Name

Address

Last Name

Sex

DOB

NINTel. No.

Position

Current Salary

Allocated to Ward

Hours/Week

Permanent or Temporary

Paid Weekly or Monthly

Salary Scale

Qualification(s) Work Experience

Type

Date

Institution

Position

Start Date

Finish Date

Organization

Note: Plz enter additional qualifications/work experience overleaf

Staff Qualifications Work Experience

• Each member of staff can have more than one qualification and Work experience.

• There exists one to many relationship between staff and their qualification & work experience.

• Therefore we require two separate tables StaffQualification and StaffWorkexperience for Staff Qualification & StaffWorkexperience respectively.

Patients

• Each patient has a unique patient number and a record of their personnel information.

PatientRegistration

Form

Wellmeadows HospitalPatient Registration FormPatient Number: P01234

Personal Details

First Name

Address

Last Name

Sex

DOB

Marital Status

Tel. No.

Full Name

Address

Relationship

Tel. No.

Local Doctor Details

Full Name

Address

Tel. No.

Clinic No.

Date Registered

Next-of-Kin Details

Patient Appointments

• Each referred patient is given an appointment, which is recorded and has a unique appointment number.

• The details of each patient’s appointment are recorded and include the name and staff number of consultant undertaking the examination, the data and time of the appointment, and the examination room.

• As a result of the examination, the patient is either recommended to attend the out-patient clinic or is placed on a waiting list until a bed can be found in an appropriate ward.

Out-Patients

• Each out-patient has a unique patient number and a record of their personnel information.

In-Patients

• Each in-patient has a unique patient number and a record of their personnel information.

In-Patient Form

Patient No. Name On Waiting List

Expected Stay (Days)

Date Placed Date Leave Actual Leave

Bed No.

Wellmeadows HospitalPatient Allocation

Personal Details

Ward Number

Ward Name

Charge Nurse

Staff Number

Tel Extn.Location

Page _______Week Beginning __________

Patient Medication

• Whenever a patient is prescribed medication, the details are recorded.

Patient Medication Form

Surgical Non-Surgical Supplies

• The Wellmeadows Hospital maintains a central stock of surgical (e.g. syringes, sterile dressings) and non-surgical (e.g. plastic bags, aprons) supplies

Pharmaceutical Supplies

• For each pharmaceutical supply (e.g. antibiotics, painkillers) there is a detailed recording.

Ward Requisitions

• Forms used to order supplies held by the hospital.

Requisition Form

Wellmeadows HospitalCentral Store

Requisition Form

Requisition Number: ___________

Ward Number

Ward Name

Requisitioned By:

Requisition Date:

Received By: __________________ Date Received: __________________

Suppliers

• Each supplier of surgical/non-surgical and pharmaceutical supplies has a unique number and details of the transaction.

Wellmeadow Hospital OLTP Model

Possible Reports from OLTP System

• Search for staff who have particular qualifications or previous work experience.

• Produce a report listing the details of staff allocated to each ward.

• Produce a report listing the details of patients referred to a particular ward.

• Produce a report listing the details of patients currently located in a particular ward.

Possible Reports from OLTP System

• Produce a report listing the details of patients currently on the waiting list for a particular ward.

• Produce a report listing the details of medication given to a particular patient.

• Produce a report listing the details of supplies provided to specific ward.

Data Warehousing Lecture XVI Dr. Javed Ali Baloch.

Documents

olap data

form of data

customer data

selected data

data fed

integrated data

flow of data

data mining tools