Data Warehousing Lecture XVI Dr. Javed Ali Baloch
Jan 02, 2016
Data Warehousing Lecture XVI
Dr. Javed Ali Baloch
Outline
• Hybrid OLAP (HOLAP) or Desktop OLAP (DOLAP)
• The HOLAP Architecture• HOLAP Development Issues• Data Design & Preparation
Hybrid OLAP (HOLAP) or Desktop OLAP (DOLAP)
• HOLAP is meant to provide portability to users of OLAP.
• HOLAP provide limited analysis capability, either directly against RDBMS products, or by using an intermediate MOLAP server.
• HOLAP tools deliver selected data directly from the DBMS or via a MOLAP server to the desktop (or local server) in the form of data-cube, where it is stored, analyzed and maintained locally.
HOLAP Development Issues
• The architecture results in significant data redundancy and may cause problems for networks that support many users.
• Ability of each user to build a custom data-cube may cause a lack of data consistency among users.
Data Design & Preparation
• The DW feeds data to the OLAP system.• In the MOLAP model, multidimensional databases
store the data fed from the DW in the form of multi-dimensional cubes.
• In the ROLAP model, data is pushed into the OLAP system with cubes created dynamically on the fly.
• Thus, the sequence of the flow of data is from the operational source systems to the DW & from there to the OLAP systems.
• Why not build the OLAP system on the top of the operational source systems? – An OLAP system needs transformed & integrated
data.– An OLAP system needs extensive historical data.– An OLAP system requires data in multi-
dimensional representation.– Different departments require data from different
operational systems.
• The techniques for preparing OLAP data for a particular department e.g.: marketing.
– Define Subset: Select the subset of detailed data the marketing is interested in.
– Summarize: Aggregate the data in the way marketing department needs.
– De-normalize: Combine relational tables exactly the same way the marketing dept. needs.
– Calculate & Derive– Index: Choose those attributes that are appropriate for
marketing to build indexes.
Data Warehousing Lecture XVII
Dr. Javed Ali Baloch
Outline
• Data Mining• Decision support progress to Data Mining• Data Mining Defined• The Knowledge Discovery Process
DATA MINING
• Data Mining is used in just about every area of business from sales and marketing to new product development to inventory management and human resource.
• In today’s world, an organization generates more information in a week than most people can read in a lifetime. It is humanly impossible to decipher and interpret all that data to find useful information.
• Data Mining enables companies to find answers and discover patterns in their customer data.
Decision support progress to Data Mining
Basic accounting
data
Operational systems
data
Data for decision Support
Data formulti-
DimensionalAnalysis
Selectedand extracted
data
KnowledgeDiscovery
PrimitiveDecisionSupport
TrueDecisionSupport
ComplexAnalysis &
Calculations
No DecisionSupport
Early File-based Systems
DatabaseSystems
DataWarehouse
OLAPSystems
Data MiningApplications
Data Mining Defined• Is the efficient discovery of valuable, non-obvious
information from a large collection of data.
• Data Mining centers around the automated discovery of new facts and relationships in data.
• With traditional query tools, you search for known information. Data mining tools enable you to uncover hidden information.
• The assumption is that more useful knowledge lies hidden beneath the surface.
The Knowledge Discovery Process
• Data Mining discovers knowledge or information that you never knew was present in your data.
• The uncovered hidden knowledge manifests itself as relationships or patterns.
Data Warehousing Lecture XVIII
Dr. Javed Ali Baloch
Outline
• Relationships• Patterns• Knowledge Discovery Phases
Relationships
• Suppose on the way home you visited the nearby supermarket to pickup bread, milk, and few other “things”. What other things? You are not sure.
• While you fetch the milk container, you happen to see a pack of assorted cheeses close by. Yes, you want that.
• You pause to look at the next five customers also reach for the cheese pack. Coincidence?
• Now on the bread shelf. As you get your bread, a bag of potato chips catch your eye. Why not get that bag of potato chips? Now the customer behind you also wants bread & chips. Coincidence? Not necessarily.
Relationships
• It is possible that this supermarket is part of a national chain that uses data mining.
• The data mining tools have discovered the relationship between bread and chips and between milk and cheese packs.
• So the items must have been deliberately placed in close proximity.
• Data Mining discovers the relationships of this type. • The relationships may be between two or more
different objects along with the time dimension.• Discovery of relationships is a key result of data mining.
Patterns
• Pattern discovery is another outcome of data mining operations.
• Consider a credit card company trying to discover the pattern of usage that usually warrants increase in credit limit or a card upgrade.
• They would know which of their customers must be lured with card upgrade & when.
• The data mining tools mine the usage patterns of thousands of card-holders and discover the potential pattern of usage that will produce result in marketing campaign.
Knowledge Discovery Phases
• Step 1: Define Business Objectives: Determine whether you really need a data mining solution. State your objectives. Are you looking to improve your direct marketing campaigns? Do you want to detect fraud in credit usage? etc
• Step 2: Prepare Data: consists of data selection, preprocessing of data and data transformation. Include appropriate metadata.
• Step 3: Perform Data Mining: the knowledge discovery engine applies the selected algorithm to the prepared data. The output from this step is a set of relationships or patterns.
• Step 4: Evaluate Results: In this step, you examine all the resulting patterns. Apply filtering mechanism & select only the promising patterns to be presented & applied.
• Step 5: Present Discoveries: may be in the form of visual navigation, charts, graphs or free-form text. Presentation may also includes storing of interesting discoveries in knowledge base for repeated use.
• Step 6: Incorporate Usage of Discoveries: Assemble the results in the best way so that they can be exploited to improve the business.
Data Warehousing Lecture 19
Dr. Javed Ali Baloch
Outline
• OLAP Versus Data Mining• Data Mining & the Data Warehouse• Major Data Mining Techniques
OLAP Versus Data Mining
• In OLAP analysis session, analyst looks for some prior knowledge.
• OLAP helps the user to analyze the past & gain insights.
• In OLAP, the analyst drives the process while using OLAP tools.
• In data mining, the analyst has no prior knowledge of what results are likely to be.
• Data Mining helps the user predict the future.
• In data mining, the analyst prepares the data and “sits back” while the tools drive the process.
OLAP Versus Data MiningFeatures
Motivation forInformation request
Datagranularity
Number of businessdimension
Number of dimension attributes
Sizes of datasets forthe dimensions
Analysis approach
Analysis techniques
State of the technology
OLAP
What is happening in the enterprise?
Summary data.
Limited number of dimensions.
Small number of attributes.
Not large for each dimension.
User-driven interactive analysis.
Multidimensional, drill-down, and slice & dice.
Mature & widely used.
DATA MINING
Predict the future based on why this is happening.
Detailed transaction-level data.
Large number of dimensions.
Many dimension Attributes.
Usually very large for each dimension
Data-driven automatic knowledge discovery
Prepare data, launch mining tool & sit back.
Still emerging.
Data Mining & the Data Warehouse
• Data Mining algorithms need large amounts of data, more so at the detailed level. Most DW contain data at the lowest level of granularity.
• Data Mining flourishes on integrated & cleansed data. If your ETL functions were carried out properly, your DW contains such data, very suitable for data mining.
• The infrastructure of DW is already robust, with parallel processing technology & powerful relational database systems. Because such scalable hardware is already in place, no new investment is needed to support data mining.
Major Data Mining Techniques• Data mining covers a broad range of techniques including
– Cluster Detection– Decision Trees– Memory-Based Reasoning– Link Analysis– Neural Networks– Genetic Algorithms etc.
• Various data mining techniques are applicable to each type of function.
• These techniques consist of the specific algorithms that can be used for each function.
Data Warehousing Lecture 20
Dr. Javed Ali Baloch
Outline
• Cluster Detection• A Clustering Example• Clusters with two variables• Forming Clusters• Centroids and cluster boundaries
Cluster Detection
• Cluster means forming groups.• The clustering helps you take specific & proper
action for the individual pieces that make up the cluster.
• The cluster detection algorithm searches for groups or clusters of data elements that are similar to one another.
• You expect similar customers or similar products to behave in the same way. Then you can take a cluster & do something useful with it.
A Clustering Example
• Consider an example of specialty store owner in resort community who wants to cater to the neighborhood by stocking right type of products.
• Store owner has the data about the age group & income level of each of the people who frequently visit the store.
• Using these two variable the store owner can put the customers into 4 clusters, i.e. wealthy retirees staying in resorts, middle-aged weekend golfers, wealthy young people with club membership and low-income clients who happen to stay in community.
Clusters with two variables
Forming Clusters
• Suppose you want to market to the customers & you are prepared to run marketing campaigns for 15 different groups.
• Fifteen initial records (called “seeds”) are chosen as the first set of centroids based on the best guesses.
• One seed represent one set of values for all the dimension variables chosen for the customer record.
• In the next step, the algorithm assigns each customer record in the database to a cluster based on the seed to which it is closest.
Forming Clusters
• Closeness is based on the nearness of the values of the set of all dimension variables in a record to the values in the seed record.
• The first set of 15 clusters is now formed.• Then the algorithm calculates the centroid or mean
for each of the first set of 15 clusters.• The next iteration then starts. Each customer record
is rematched with the new set of centroids and cluster boundaries are redrawn.
• After a few iterations the final clusters emerge.
Centroids and cluster boundaries
Task
• Design 4 cluster for the students of 10cse selecting their final year project, list the cluster formation rules.
Data Warehousing Lecture 21
Dr. Javed Ali Baloch
Outline
• Decision Trees• Decision Tree Modeling• Decision Tree Example• Task
Decision Trees
• This technique applies to classification and prediction.
• The major attraction of decision trees is their simplicity. By following the tree, you can decipher the rules and understand why a record is classified in a certain way.
• Decision trees represent rules. You can use these rules to retrieve records falling into a certain category.
Decision Trees
• It is a rooted tree in which each internal node corresponds to a decision, with a subtree at these nodes for each possible outcome of the decision.
• Decision trees can be used to model problems in which a series of decisions leads to a solution.
• The possible solutions of the problem correspond to the paths from the root to the leaves of the decision tree.
Decision Trees
• A decision tree represents a series of questions. Each question determines what follow-up question is best to be asked next.
• Good questions produce a short series.• Trees are drawn with the root at the top and the leaves
at the bottom, an unnatural convention. • The question at the root must be the one that best
differentiates among the target classes. • A database record enters the tree at the root node. The
record works its way down until it reaches a leaf. The leaf node determines the classification of the record.
Decision Tree model
• A Decision Tree Model is a computational model consisting of three parts:– Decision Tree– Algorithm to create the tree– Algorithm that applies the tree to data
• Creation of the tree is the most difficult part.• Processing is basically a search similar to that in a
binary search tree (although DT may not be binary).
Decision Tree Example
• Data
height hair eyesclass
short blond blueA
tall blond brownB
tall red blueA
short dark blueB
tall dark blueB
tall blond blueA
tall dark brownB
short blond brownB
hair
darkred
blond
short, blue = Btall, blue = Btall, brown= B
{tall, blue = A} short, blue = Atall, brown = Btall, blue = Ashort, brown = B
Completely classifies dark-hairedand red-haired people
Does not completely classifyblonde-haired people.More work is required
hair
darkred
blond
short, blue = Btall, blue = Btall, brown= B
{tall, blue = A} short, blue = Atall, brown = Btall, blue = Ashort, brown = B
eyeblue brown
short = Atall = A
tall = Bshort = B
Decision tree is complete because1. All 8 cases appear at nodes2. At each node, all cases are inthe same class (A or B)
hair
eyesB
B
A
A
darkred
blond
blue brown
Task
Design a decision tree for a customer planning to purchase a car, make sure you use the different deciding factors on which he would make the decision, the example should show a proper classification.
Data Warehousing Lecture 22
Dr. Javed Ali Baloch
Outline
• Memory based reasoning (MBR)• MBR applications• MBR Challenges
Memory Based Reasoning
• Would you rather go to an experienced doctor or to a novice? Of course, the answer is obvious.
• Why? Because the experienced doctor treats you and cures you based on his or her experience. The doctor knows what worked in the past in several cases when the symptoms were similar to yours.
• We are all good at making decisions on the basis of our experiences.
• We depend on the similarities of the current situation to what we know from past experience.
• The same principles apply to the memory-based reasoning (MBR) algorithm.
• Our ability to reason from experience depends on our ability to recognize appropriate examples from the past…– Traffic patterns/routes– Movies– Food
• We identify similar example(s) and apply what we know/learned to current situation
• These similar examples in MBR are referred to as neighbors
• MBR uses known instances of a model to predict unknown instances.
• This data mining technique maintains a dataset of known records.
• When a new record arrives for evaluation, the algorithm finds neighbors similar to the new record, then uses the characteristics of the neighbors for prediction and classification.
• When a new record arrives at the data mining tool, first the tool calculates the “distance” between this record and the records in the training dataset.
• The results determine which data records in the training dataset qualify to be considered as neighbors to the incoming data record.
• Next, the algorithm uses a combination function to combine the results of the various distance functions to obtain the final answer.
• The distance function and the combination function are key components of the memory-based reasoning technique.
MBR Challenges
• Choosing appropriate historical data for use in training
• Choosing the most efficient way to represent the training data
• Choosing the distance function, combination function, and the number of neighbors
MBR Applications
• Fraud detection
• Customer response prediction
• Medical treatments
• Classifying responses – MBR can process free-text
responses and assign codes
Data Warehousing Lecture 23
Dr. Javed Ali Baloch
Outline
• Link Analysis– Associations Discovery– Sequential Pattern Discovery– Similar Time Sequence Discovery
Link Analysis
• The link analysis technique mines relationships and discovers knowledge.
• For example, if you look at the supermarket sale transactions for one day, why are skim milk and brown bread found in the same transaction about 80% of the time?
• Is there a strong relationship between the two products in the supermarket basket? If so, can these two products be promoted together?
• Are there more such combinations? How can we find such links or affinities?
• Link analysis techniques have 3 types of applications1. Associations discovery 2. Sequential pattern discovery3. Similar time sequence discovery
Associations Discovery
• Associations are affinities between items.• Association discovery algorithms find combinations
where the presence of one item suggests the presence of another.
• When you apply these algorithms to the shopping transactions at a supermarket, they will uncover affinities among products that are likely to be purchased together.
• Association rules represent such affinities.
Associations Discovery Figure represents an association rule and the annotated parts
of the rule. The two parts—support factor and the confidence factor—
indicate the strength of the association.
Rules with high support and confidence factor values are more valid, relevant, and useful.
Sequential Pattern Discovery
• These algorithms discover patterns where one set of items follows another specific set.
• Time plays a role in these patterns.• Suppose you want the algorithm to discover the buying
sequence of products. • The sale transactions form the dataset for the data mining
operation. • The data elements in the sale transaction may consist of date
and time of transaction, products bought during the transaction, and the identification of the customer who bought the items.
• A sample set of these transactions and the results of applying the algorithm are shown in Figure.
Similar Time Sequence Discovery
• This technique, however, finds a sequence of events and then comes up with other similar sequences of events.
• For example, in retail department stores, this data mining technique comes up with a second department that has a sales stream similar to the first.
• Finding similar sequential price movements of stock is another application of this technique.
Data Warehousing Lecture 24
Dr. Javed Ali Baloch
Outline
• Artificial Intelligence for Data Mining• Neural Networks• Neural Network Characteristics• Anatomy of a Neural Network• Neural Network Model • How a Neural Network Works?• Advantages and Disadvantages
Artificial Intelligence for Data Mining
• Neural networks are useful for data mining and decision-support applications.
• People are good at generalizing from experience.
• Computers excel at following explicit instructions over and over.
• Neural networks bridge this gap by modeling, on a computer, the neural behavior of human brains.
Neural Networks
• Neural networks mimic the human brain by learning from a training dataset and applying the learning to generalize patterns for classification and prediction.
• These algorithms are effective when the data is shapeless and lacks any apparent pattern.
• The basic unit of an artificial neural network is modeled by looking at the neurons in the brain.
• This unit is known as a node and is one of the two main structures of the neural network model.
• The other structure is the link that corresponds to the connection between neurons in the brain.
Neural Network Characteristics
• Neural networks are useful for pattern recognition or data classification, through a learning process.
• Neural networks simulate biological systems, where learning involves adjustments to the synaptic connections between neurons
Anatomy of a Neural Network
•Neural Networks map a set of input-nodes to a set of output-nodes
•Number of inputs/outputs is variable
•The Network itself is composed of an arbitrary number of nodes with an arbitrary topology
Neural Network
Input 0 Input 1 Input n...
Output 0 Output 1 Output m...
Neural Network Model
How a Neural Network Works?
• The neural network receives values of the variables or predictors at the input nodes.
• If there are 15 different predictors, then there are 15 input nodes.
• Weights may be applied to the predictors to condition them properly.
• There may be several inner layers operating on the predictors and they move from node to node until the discovered result is presented at the output node.
• The inner layers are also known as hidden layers because as the input dataset is running through many iterations, the inner layers rehash the predictors over and over again.
Advantages and Disadvantages
• Advantages– Adapt to unknown situations– Robustness: fault tolerance due to network redundancy– Autonomous learning and generalization
• Disadvantages– Not exact– Large complexity of the network structure
Data Warehousing Lecture 25
Dr. Javed Ali Baloch
Link Analysis Example
Link Analysis Example
Data Warehousing Lecture 26
Dr. Javed Ali Baloch
The Wellmeadows Hospital Case Study
Introduction
• The Wellmeadows Case Study describes a small hospital located in Edinburgh.
• The Wellmeadows Hospital, which specializes in health care for the elderly, requires a database comprised of data recorded, maintained, and accessed by the hospital.
• The objective is to create this database with as much functionality and as little redundancy as possible.
Introduction
• Successful projects begin with requirments gathering. • However, in this case, Wellmeadows performed their
own requirments gathering procedures. • These requirments are summarized on the basis of the
different entities. • Merely reading the material that should be contained in
a database is not enough, every sentence has to be analyzed and noted before continuance.
Identify Entity Types
• The Wellmeadows Case Study identifies fifteen distinct entity types.
Wards
• There are 17 wards each having a unique ward number & name (for example, Orthopaedic), location (for example, E Block), telephone extension and contains a total of 240 beds.
Staff• The Staff entity is by far the most complex, consisting of
multiple personnel of different rank (for example, senior and junior doctors, consultants, physiotherapists).
• The three main positions are: the Medical Director, the Personnel Officer and the Charge Nurse.
• The Medical Director has overall responsibility of management for the hospital.
• The Personnel Officer is responsible for ensuring that the appropriate number and type of staff are associated with the correct ward or out-patient clinic.
• The Charge Nurse is responsible for overseeing the day-to-day operation of the ward/clinic. This includes allocating a budget and tracking resources such as beds and supplies.
Staff Form
Wellmeadows HospitalStaff Form
Staff Number: S011
Personal Details
First Name
Address
Last Name
Sex
DOB
NINTel. No.
Position
Current Salary
Allocated to Ward
Hours/Week
Permanent or Temporary
Paid Weekly or Monthly
Salary Scale
Qualification(s) Work Experience
Type
Date
Institution
Position
Start Date
Finish Date
Organization
Note: Plz enter additional qualifications/work experience overleaf
Staff Qualifications Work Experience
• Each member of staff can have more than one qualification and Work experience.
• There exists one to many relationship between staff and their qualification & work experience.
• Therefore we require two separate tables StaffQualification and StaffWorkexperience for Staff Qualification & StaffWorkexperience respectively.
Patients
• Each patient has a unique patient number and a record of their personnel information.
PatientRegistration
Form
Wellmeadows HospitalPatient Registration FormPatient Number: P01234
Personal Details
First Name
Address
Last Name
Sex
DOB
Marital Status
Tel. No.
Full Name
Address
Relationship
Tel. No.
Local Doctor Details
Full Name
Address
Tel. No.
Clinic No.
Date Registered
Next-of-Kin Details
Patient Appointments
• Each referred patient is given an appointment, which is recorded and has a unique appointment number.
• The details of each patient’s appointment are recorded and include the name and staff number of consultant undertaking the examination, the data and time of the appointment, and the examination room.
• As a result of the examination, the patient is either recommended to attend the out-patient clinic or is placed on a waiting list until a bed can be found in an appropriate ward.
Out-Patients
• Each out-patient has a unique patient number and a record of their personnel information.
In-Patients
• Each in-patient has a unique patient number and a record of their personnel information.
In-Patient Form
Patient No. Name On Waiting List
Expected Stay (Days)
Date Placed Date Leave Actual Leave
Bed No.
Wellmeadows HospitalPatient Allocation
Personal Details
Ward Number
Ward Name
Charge Nurse
Staff Number
Tel Extn.Location
Page _______Week Beginning __________
Patient Medication
• Whenever a patient is prescribed medication, the details are recorded.
Patient Medication Form
Surgical Non-Surgical Supplies
• The Wellmeadows Hospital maintains a central stock of surgical (e.g. syringes, sterile dressings) and non-surgical (e.g. plastic bags, aprons) supplies
Pharmaceutical Supplies
• For each pharmaceutical supply (e.g. antibiotics, painkillers) there is a detailed recording.
Ward Requisitions
• Forms used to order supplies held by the hospital.
Requisition Form
Wellmeadows HospitalCentral Store
Requisition Form
Requisition Number: ___________
Ward Number
Ward Name
Requisitioned By:
Requisition Date:
Received By: __________________ Date Received: __________________
Suppliers
• Each supplier of surgical/non-surgical and pharmaceutical supplies has a unique number and details of the transaction.
Wellmeadow Hospital OLTP Model
Possible Reports from OLTP System
• Search for staff who have particular qualifications or previous work experience.
• Produce a report listing the details of staff allocated to each ward.
• Produce a report listing the details of patients referred to a particular ward.
• Produce a report listing the details of patients currently located in a particular ward.
Possible Reports from OLTP System
• Produce a report listing the details of patients currently on the waiting list for a particular ward.
• Produce a report listing the details of medication given to a particular patient.
• Produce a report listing the details of supplies provided to specific ward.