Data Mining: A Closer Look
Kritda Vavnilanon
Out-line
• Objectives
• Data Mining Strategies
• Supervised Data Mining Techniques
• Association Rules
• Clustering Techniques
• Evaluating Performance
• Summary
• Questions
Objectives
• Determine an appropriate data mining
strategy for a specific problem.
• Know about several data mining
techniques and how each technique builds
a generalized model to represent data.
• Understand how a confusion matrix is
used to help evaluate supervised learner
models.
Objectives
• Understand basic techniques for
evaluating supervised learner models with
numeric output.
• Know how measuring lift can be used to
compare the performance of several
competing supervised learner models.
• Understand basic techniques for
evaluating unsupervised learner models.
Data Mining Strategies
Data Mining Strategies
• Supervised learning builds models by using input attributes to predict output attribute values.
• Output attributes are also known as dependentvariablesas their outcome depends on the values of one or more input attributes.
• Input attributes are referred to as independentvariables.
Data Mining Strategies
• Classificationis probably the best
understood of all data Classification tasks
have three common characteristics
– Learning is supervised.
– The dependent variable is categorical.
– The emphasis is on building models able to
assign new instances to one of a set of well-
defined classes.
Data Mining Strategies
• Some example classification tasks include the
following:
– Determine those characteristics that differentiate
individuals who have suffered a heart attack from those
who have not.
– Develop a profile of a “successful” person.
– Determine if a credit card purchase is fraudulent.
– Classify a car loan applicant as a good or a poor credit
risk.
– Develop a profile to differentiate female and male stroke
victims.
Data Mining Strategies
• Estimationdetermines a value for an unknown output attribute.
• However, unlike classification, the output attribute for an estimation problem is numeric.– Estimate the number of minutes before a
thunderstorm will reach a given location.– Estimate the salary of an individual who owns a
sports car.– Estimate the likelihood that a credit card has been
stolen.– Estimate the length of a gamma ray burst.
Data Mining Strategies
• Prediction: It is not easy to differentiate
prediction from classification or estimation.
• Un like a classification or estimation model,
the purpose of a predictive model is to
determine future outcome rather than current
behavior.
• The output attribute (s) of a predictive model
can be categorical or numeric.
Data Mining Strategies
• Determine whether a credit card customer is
likely to take advantage of a special offer
made available with their credit card billing.
• Predict next week’s closing price for the Dow
Jones Industrial Average.
• Forecast which telephone subscribers are
likely to change providers during the next
three months.
Data Mining Strategies
• IF169 <= Maximum Heart Rate <= 202 THEN
Concept Class = Healthy
Rule accuracy: 85.07%
Rule coverage: 34.55%
• IFThal = Rev and Chest Pain Type =
Asymptomatic THENConcept Class = Sick
Rule accuracy: 91.14%
Rule coverage: 52.17%
Data Mining Strategies
• WARNING1:Have your maximum heart rate checked on a regular basis. If your maximum heart rate is low, you may be at risk of having a heart attack! (Predictive)
• WARNING2:If you have a heart attack, expect your maximum heart rate to decrease. (Classification)
Data Mining Strategies
• With unsupervised clustering we are without a dependent variable to guide the learning process.
• Rather, the learning program builds a knowledge structure by using some measure of cluster quality to group instances into two or more classes.
• A primary goal of an unsupervised clustering strategy is to discover concept structures in data.
Data Mining Strategies
• Common uses of unsupervised clustering
include:
– Determine if meaningful relationships in the form
of concepts can be found in the data
– Evaluate the likely performance of a supervised
learner model
– Determine a best set of input attributes for
supervised learning
– Detect outliers
Data Mining Strategies
• Unsupervised clustering can also help detect any atypical instances present in the data.
• Atypical instances are referred to as outliers.
• Outliers can be of great importance and should be identified whenever possible.
• Statistical mining applications frequently remove outliers.
• With data mining, the outliers might be just those instances we are trying to identify.
Data Mining Strategies
• The purpose of marketbasketanalysisis to find interesting relationships among re tail products.
• The results of a market basket analysis help retailers design promotions, arrange shelf or catalog items, and develop cross-marketing strategies.
• Association rule algorithms are often used to apply a market basket analysis to a set of data.
Supervised Data Mining Techniques
• A dataminingtechniqueis used to apply a data mining strategy to a set of data.
• A specific data mining technique is defined by an algorithm and an associated knowledge structure such as a tree or a set of rules.
• ProductionRules: Any decision tree can be translated into a set of production rules.
• However, we do not need an initial tree structure to generate production rules. RuleMaker,the production rule generator that comes with your iDA software, uses ratios together with mathematical set theory operations to create rules from spreadsheet data.
• Earlier in this chapter you saw two rules generated by RuleMaker for the heart patient dataset. Let’s apply RuleMaker to the credit card promotion data.
Supervised Data Mining Techniques
A combination of one or more of the dataset
attributes differentiate between Acme Credit
Card Company card holders who have taken
advantage of a life insurance promotion and
those card holders who have chosen not to
participate in the promotional offer.
Supervised Data Mining Techniques
• IFSex = Female & 19 <= Age <= 43
THENLife Insurance Promotion = Yes
Rule Accuracy: 100.00%
Rule Coverage: 66.67%
• IFSex = Male & Income Range = 40–50K
THENLife Insurance Promotion = No
Rule Accuracy: 100.00%
Rule Coverage: 50.00%
Supervised Data Mining Techniques
• IFCredit Card Insurance = Yes
THENLife Insurance Promotion = Yes
Rule Accuracy: 100.00%
Rule Coverage: 33.33%
• IFIncome Range = 30–40K & Watch Promotion = Yes
THENLife Insurance Promotion = Yes
Rule Accuracy: 100.00%
Rule Coverage: 33.33%
Supervised Data Mining Techniques
• A neuralnetworkis a set of interconnected nodes designed to imitate the functioning of the human brain.
• As the human brain contains billions of neurons and a typicalneural network has fewer than one hundred nodes, the comparison is somewhat superficial.
• However, neural networks have been successfully applied to problems across several disciplines and for this reason are quite popular in the data mining community.
Supervised Data Mining Techniques
• Neural networks come in many shapes and forms and can be constructed for supervised learning as well as unsupervised clustering.
• In all cases the values input into a neural network must be numeric.
• The feed-forward network is a popular supervisedlearner model.
Supervised Data Mining Techniques
Supervised Data Mining Techniques
• Figure 2.2 shows a fully connected feed-forward
neural network consisting of three layers.
• With a feed-forward network the input attribute
values for an individual instance enter at the input
layer and pass directly through the output layer of
the network structure.
• The output layer may contain one or several nodes.
• The out put layer of the network shown in Fig. 2.2
contains two nodes.
Supervised Data Mining Techniques
• Therefore the output of the neural network will be an ordered pair of values.
• Neural networks operate in two phases.
• The first phase is called the learning phase. During network learning, the input values associated with each instance enter the net work at the input layer. One input layer node exists for each input attribute contained in the data.
Supervised Data Mining Techniques
• The actual output value for each instance is computed and compared with the desired network output. Any error between the desired and computed output is propagated back through the network by changing connection-weight values.
• Training terminates after a certain number of iterations or when the network converges to a predetermined minimum error rate. During the second phase of operation, the network weights are fixed and the network is used to classify new instances.
Supervised Data Mining Techniques
• Statisticalregressionis a supervised learning technique that generalizes a set of nu meric data by creating a mathematical equation relating one or more input attributes to a single numeric output attribute.
• A linearregressionmodel is characterized by an output attribute whose value is determined by a linear sum of weighted input at tribute values. Here is a linear regression equation for the data in Table 2.3:
life insurance promotion = 0.5909 ~ (credit card insurance) – 0.5455 ~ (sex) + 0.7727
Supervised Data Mining Techniques
• life insurance promotion = 0.5909(0) – 0.5455(0) + 0.7727 = 0.7727
• Individual is likely to take advantage of the promotional offer.
• Although regression can be nonlinear, the most popular use of regression is for linear modeling.
• Linear regression is appropriate provided the data can be accurately modeled with a straight line function.
Association Rules
• As the name implies, associationrulemining techniques are used to discover interesting associations between attributes contained in a database.
• Unlike traditional production rules, association rules can have one or several output attributes. Also, an output attribute for one rule can be an input attribute for another rule.
Association Rules
• Association rules are a popular technique for
market basket analysis because all possible
combinations of potentially interesting
product groupings can be explored.
• For this reason a limited number of attributes
are able to generate hundreds of association
rules.
Association Rules
• IFSex = Female & Age = over40 & Credit Card Insurance = No THENLife Insurance Promotion = Yes
• IFSex = Male & Age =over40 & Credit Card Insurance = No THENLife Insurance Promotion = No
• IFSex = Female & Age = over40THENCredit Card Insurance = No & Life Insurance Promotion = Yes
Clustering Techniques
• Several unsupervised clustering techniques
can be identified.
• One common technique is to apply some
measure of similarity to divide instances into
disjoint parti tions. The partitions are
generalized by computing a group mean for
each cluster or by listing a most typical subset
of instances from each cluster.
Clustering Techniques
• A second approach is to partition data in a
hierarchical fashion where each level of the
hierarchy is a generalization of the data at
some level of abstraction.
• One of the unsupervised clustering models
that comes with your iDA software tool is a
hierarchical clustering system.
Clustering Techniques
By applying unsupervised clustering to the
instances of the Acme Credit Card Company
database, we will find a subset of input
attributes that differentiate card holders who
have taken advantage of the life insurance
promotion from those cardhold ers who have
not accepted the promotional offer.
Clustering Techniques
• IFSex = Female & 43 >= Age >= 35 & Credit
Card Insurance = No THENClass = 3
Rule Accuracy: 100.00%
Rule Coverage: 66.67%
Evaluating Performance
• The most critical of all the steps in the data mining
process.
• A common sense approach to evaluating supervised
and unsupervised learner models.
• Three general questions:
– Will the benefits received from a data mining project more
than offset the cost of the data mining process?
– How do we interpret the results of a data mining session?
– Can we use the results of a data mining process with
confidence?
Evaluating Performance
• Is there knowledge about projects similar to the
proposed project? What are the success rates and
costs of projects similar to the planned project?
• What is the current form of the data to be analyzed?
Does the data exist or will it have to be collected?
When a wealth of data exists and is not in a form
amenable for data mining, the greatest project cost
will fall under the category of data preparation. In
fact, a larger question may be whether to develop a
data warehouse for future data mining projects.
Evaluating Performance
• Who will be responsible for the data mining
project? How many current employees will be
involved? Will outside consultants be hired?
• Is the necessary software currently available?
If not, will the software be pur chased or
developed? If purchased or developed, how
will the software be integrated into the current
system?
Evaluating Performance
• Evaluating Supervised Learner Models
• Classification correctness is best calculated
by presenting previously unseen data in the
form of a test set to the model being
evaluated.
• Test set model accuracy can be summarized
in a table known as a confusion matrix.
Evaluating Performance
• A generic confusion matrix for the three class case is shown in Table
• Values along the main diagonal give the total number of correct classifications for each class. For example, a value of 15 for C11 means that 15 class C1 test set instances were correctly classified.
• Values other than those on the main diagonal represent classi fication errors. To illustrate, suppose C12 has the value 4.This means that four class C1instances were incorrectly classified as belonging to class C2.
C1 C2 C3
C1 C11 C12 C13
C2 C21 C22 C23
C3 C31 C32 C33
Evaluating Performance
• Two-Class Error Analysis• Evaluating Numeric Output• The meanabsoluteerror for a set of test data is computed
by finding the average absolute difference between computed and desired outcome values.
• In a similar manner, the meansquarederror is simply the average squared difference between computed and desired outcome.
• It is obvious that for a best test set accuracy we wish to obtain the smallest possible value for each measure.
• The rootmeansquarederror(rms)is simply the square root of a mean squared error value. Rms is frequently used as a measure of test set accuracy with feed-forward neural networks.
Evaluating Performance
• Comparing Models by Measuring Lift
Lift = P(Ci | Sample)
P(Ci | Population)
Evaluating Performance
% Sampled
Evaluating Performance
• UnsupervisedModelEvaluation
• Evaluating unsupervised data mining is, in general, a more difficult task than super vised evaluation. This is true because the goals of an unsupervised data mining session are frequently not as clear as the goals for supervised learning.
• A general technique that employs supervised learning to evaluate an unsupervised clustering.
Evaluating Performance
• All unsupervised clustering techniques compute some measure of cluster quality.
• A common technique is to calculate the summation of squared error differences between the instances of each cluster and their corresponding cluster center.
• Smaller values for sums of squared error differences indicate clusters of higher quality. However, for a detailed evaluation of unsupervised clustering, it is supervised learning that comes to the rescue.
Evaluating Performance
• Finally, a common misconception in the business world is
that data mining can be accomplished simply by choosing the
right tool, turning it loose on some data, and waiting for
answers to problems.
• This approach is doomed to failure. Machines are still
machines. It is the analysis of results provided by the human
element that ultimately dictates the success or failure of a
data mining project.
• A formal KDD process model such as the one described in
Chapter 5 will help provide more complete answers to the
questions posed at the beginning of this section.
Summary
• Data mining strategies include classification, estimation, prediction, unsupervised clus tering, and market basket analysis.
• The output of a classification strategy is categorical.
• The output of an estimation strategy is numeric.• A predictive strategy is used to design models
for predicting future outcome.• Unsupervised clustering strategies are employed
to dis cover hidden concept structures in data as well as to locate atypical data instances.
• The purpose of market basket analysis is to find interesting relationships among retail products.
Summary
• Data min ing techniques are defined by an algorithm and a knowledge structure.
• Common fea tures that distinguish the various techniques are whether learning is supervised or unsupervised and whether their output is categorical or numeric.
• Performance evaluation is probably the most critical of all the steps in the data mining process.
• Supervised model evaluation is often performed using a training/test set scenario.
Summary
• A marketing application measures the goodness of a model by its ability to lift response rate thresholds to levels well above those achieved by naïve (mass) mailing strategies.
• Unsupervised models support some measure of cluster quality that can be used for evaluative purposes. Supervised learning can also be employed to evaluate the quality of the clusters formed by an unsupervised model.
Questions