Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Computers are useless. They can only give you answers. –Pablo Picasso
Pablo Picasso
My Request
“A good listener is not only popular everywhere, but after a while he gets to know something”
- Wilson Mizner
Association Rule Mining
PowerPoint permissionsCengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above.
• Features of association rule mining.• Apriori: Most popular association rule mining
algorithm.• Association rules evaluation.• Association rule mining using WEKA.• Strengths and weaknesses of association rule
mining.• Applications of association rule mining.
• Affinity Analysis• Market Basket Analysis: Which products go
together in a basket?– Uses: determine marketing strategy, plan
promotions, shelf layout.• Looks like production rules, but more than one
attribute may appear in the consequent.– IF customers purchase milk THEN they
purchase bread AND sugar
Association rules
Transaction data
Transaction ID
Itemset or Basket
01 {‘webcam’, ‘laptop’, ‘printer’}
02 {‘laptop’, ‘printer’, ‘scanner’}
03 {‘desktop’, ‘printer’, ‘scanner’}
04 {‘desktop’, ‘printer’, ‘webcam’}
Table 7.1. Transactions Data
Rule for Support:
• The minimum percentage of instances in the database that contain all items listed in a given association rule.
Concepts of association rules
Example
• 5,000 transactions contain milk and bread in a set of 50,000
• Support => 5,000 / 50,000 = 10%
Rule for Confidence:
Given a rule of the form “If A then B”, rule for confidence is the conditional probability that B is true when A is known to be true.
Concepts of association rules
Example
• IF customers purchase milk THEN they also purchase bread:– In a set of 50,000, there are 10,000
transactions that contain milk, and 5,000 of these contain also bread.
– Confidence => 5,000 / 10,000= 50%
Parameters of ARM
1. To find all items that appears frequently in transactions. The level of frequency of appearance is determined by pre-specified minimum support count.
Any item or set of items that occur less frequently than this minimum support level are not included for analysis.
2. To find strong associations among the frequent items. The strength of the association is quantified by the confidence. Any association below a pre-specified level of confidence is not used to generate rules.
Relevance of ARM
• On Thursdays, grocery store consumers often purchase diapers and beer together.
• Customers who buy a new car are very likely to purchase vehicle extended warranty.
• When a new hardware store opens, one of the most commonly sold items is toilet fittings.
Functions of ARM
• Finding the set of items that has significant impact on the business.
• Collating information from numerous transactions on these items from many disparate sources.
• Generating rules on significant items from counts in transactions.
Single-dimensional association rules
Transaction id
‘webcam’ ‘laptop’ ‘printer’ ‘scanner’ ‘desktop’
01 1 1 1 0 002 0 1 1 1 003 0 0 1 1 104 1 0 1 0 1
Table 7.2 Boolean form of a transaction data.
(cont.)
Multidimensional association rules
General considerations
• We are interested in association rules that show a lift in product sales where the lift is the result
of the product’s association with one or more other products.
• We are also interested in association rules that show a lower than expected confidence for a particular association.
Figure 7.1 Enumeration tree of transaction items of Table 7.1. In theleft nodes, branches reduce by 1 for each downward progression – starting with 5 branches and ending with 1 branch, which is typical
Association models
nCk = The number of combinations of n things
taken k at a time.
Two other parameters• Improvement (IMP) =
• Share (SH) =
where LMV = local measure value and TMV is total measure value.
• Low-support products are lumped into bigger categories and high-support products are broken up into subgroups.
• Examples are: Different kinds of potato chips can be lumped with other munchies into snacks, and ice cream can be broken down into different flavours.
Large Datasets
• The number of combinations that can generate from transactions in an ordinary supermarket can be in the billions and trillions. The amount of computation thus required for Association Rule Mining can stretch any computer.
APRIORI algorithm
1. All singleton itemsets are candidates in the first pass. Any item that has a support value of less than a specified minimum is eliminated.
2. Selected singleton itemsets are combined to form two-member candidate itemsets. Again, only the candidates above the pre-specified support value are retained.
(cont.)
3. The next pass creates three-member candidate itemsets and the process is repeated. The process stops only when all large itemsets are accounted for.
4. Association Rules for the largest itemsets are created first and then rules for the subsets are created recursively.
T.ID Items
01 2 3
02 1 3 5
03 1 2 4
04 2 3
Itemset support
{1} 2
{2} 3
{3} 3
{4} 1
{5} 1
Large I. sup.
{1} 2
{2} 3
{3} 3
Itemset
{1 2}
{1 3}
{2 3}
Itemset support
{1 2} 1
{1 3} 1
{2 3} 2
Large I. support
{2 3} 2
Database D
Scan D Select
Cre
ate
Scan DSelect
Figure 7.2 Graphical demonstration of the working of the Apriori algorithm
APRIORI in Weka
Figure 7.3 Weka environment with market-basket.arff data file
Step 2
Figure 7.4 Spend98 attribute information visualisation.
Step 3
Figure 7.5 Target attributes selection through Weka
Step 4
Figure 7.6 Discretisation filter selection
Step 5
Figure 7.7 Parameter selections for discretisation.
Step 6
Figure 7.8 Descretisation activation
Discretised data visualisation
Figure 7.9 Discretised data visualisation
Step 7
Figure 7.10 Apriori algorithm selection from Weka for ARM
• What is association rule mining?• Apriori: Most popular association rule mining
algorithm.• Applications of association rule mining.
The Clustering Task
PowerPoint permissionsCengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above.
• Unsupervised clustering technique• Measures for clustering performance• Clustering algorithms• Clustering task demonstration using WEKA• Applications, strengths and weaknesses of the
algorithms
Clustering: Unsupervised learning
• Clustering is a very common technique that appears in many different settings (not necessarily in a data mining context)– Grouping “similar products” together to
improve the efficiency of a production line– Packing “similar items” into a basket– Grouping “similar customers” together– Grouping “similar stocks” together
Sl. No. Subjects Code
Marks
1 COIT21002 85
2 COIS11021 78
3 COIS32111 75
4 COIT43210 83
Table 8.1 A simple unsupervised problem
A simple clustering example
Figure 8.1 Basic clustering for data of Table 8.1.The X-axis is the serial number and Y-axis is the marks
Cluster representation
How many clusters can you form?
A A A AK K K KQ Q Q Q J J J J
Figure 8.2 Simple playing card data
Distance measure
• The similarity is usually captured by a distance measure.
• The original proposed measure of distance is the Euclidean distance.
n
iii yxyxd
yyyYxxxX nn
1
2
2
)(),(
),,,(),,,,( 211
Figure 8.3 Euclidean distance D between two points A and B
1. Choose a value for K, the total number of clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their closest cluster center.
4. Calculate a new cluster center for each cluster.
5. Repeat steps 3-5 until the cluster centers do not change.
General considerations of K-means algorithm
• Requires real-valued data.• We must pre-select the number of clusters
present in the data.• Works best when the clusters in the data are of
approximately equal size.• Attribute significance cannot be determined.• Lacks explanation capabilities.
Example 8.2
Let us consider the dataset of Example 8.1 to find two clusters using the k-means algorithm.
Step 1. Arbitrarily, let us choose two cluster centers to be the data points P5 (5, 2) and P7 (1, 2). Their
relative positions can be seen in Figure 8.6. We could have started with any two other points. The initial selection of points does not affect the final result.
Step 2. Let us find the Euclidean distances of all the data points from these two cluster centers.
Step 2. (Cont.)
Step 3. The new cluster centres are:
Step 4. The distances of all data points from these new cluster centres are:
Step 4. (cont.)
Step 5. By the closest centre criteria P5 should be
moved from C2 to C1, and the new clusters are C1 =
{P1, P5, P6, P7, P8} and C2 = {P2, P3, P4}.
The new cluster centres are:
Step 6. We may repeat the computations of Step 4 and we will find that no data point will switch clusters. Therefore, the iteration stops and the final clusters are C1 = {P1, P5, P6, P7, P8} and C2 =
{P2, P3, P4}.
Density-based methods
C2
C1 C3
-4-2
02
4
-4
-2
0
2
40
0.2
0.4
0.6
0.8
1
Figure 8.8 (a) Three irregular shaped clusters (b) Influence curve of a point
Probability-based methods
• Expectation Maximization (EM) uses a Gaussian mixture model:• Guess initial values of all the parameters until a
termination criterion is achieved• Use the probability density function to compute
the cluster probability for each instance.• Use the probability score assigned to each
instance in the above step to re-estimate the parameters.
)()|()(
)|(iXP
kCiXPkCPiXkCP
Clustering through WekaStep 1.
Figure 8.9 Weka environment with credit-g.arff data
Step 2.
Figure 8.10 SimpleKMeans algorithm and its parameter selection
Step 3.
Figure 8.11 K-means clustering performance
Step 3. (cont.)
Figure 8.12 Weka result window
Cluster visualisation
Figure 8.13 Cluster visualisation
Individual cluster information
Figure 8.14 Cluster0 instances information
Step 4.
Figure 8.15 Cluster 1 instance information
Kohonen neural network
Figure 8.16 A Kohonen network with two input nodes and nine output nodes
Input 1 Input 2
Contains only an input layer and an output layer but no hidden layer.
The number of nodes in the output layer that finally captures all the instances determine the number of clusters in the data.
Kohonen self-organising maps:
Example 8.3
Input 1 Input 2
Output 1 Output 2
0.3 0.6
0.1
0.4 0.2
0.5
Figure 8.17 Connections between input and output nodes of a neural network
Example 8.3 Cont.
2( )i iji
I W
2 2(0.3 0.1) (0.6 0.2)
2 2(0.3 0.4) (0.6 0.5)
= 0.447
= 0.141
The scoring for any output node k is done using the formula:
Example 8.3 cont.
10
)(
where
)()(
r
wnrw
wcurrentwneww
ijiij
ijijij
Example 8.3 cont.
03.0)4.03.0(3.012 W
03.0)5.06.0(3.022 W
37.003.04.0(new)12 W
53.003.05.0(new)22 W
Assuming that the learning rate is 0.3, we get:
Cluster validation
t-test 2 -test
Validity in Test Cases
Strengths and weaknesses
• Unsupervised Learning• Diverse Data Types • Easy to Apply • Similarity Measures• Model Parameters • Interpretation
Applications of clustering algorithms
• Biology• Marketing research • Library Science • City Planning • Disaster Studies • Worldwide Web • Social Network Analysis • Image Segmentation
Recap
• What is clustering?• K-means: Most popular clustering algorithm• Applications of clustering techniques
The Estimation Task
PowerPoint permissionsCengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above.
• Assess the numeric value of a variable from other related variables.
• Predict the behaviour of one variable from the behaviour of related variables.
• Discuss the reliability of different methods of estimation and perform a comparative study.
What is estimation?
Finding the numeric value of an unknown attribute from observations made on other related attributes. The unknown attribute is called the dependent (or response or output) attribute (or variable) and the known related attributes are called the independent (or explanatory or input) attributes (or variables).
Figure 9.1a Computer screen-shots of Microsoft Excel spreadsheets to demonstrate plotting of scatter plot
Figure 9.1b
Figure 9.1c
Figure 9.1d
Scatter Plot
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00
BHP Share Price ($)
RIO
Sh
are
Pri
ce (
$)
Figure 9.1e Computer screen-shots of Microsoft Excel spreadsheets to demonstrate plotting of scatter plot
Correlation coefficient
r = Covariance between the two variables
(Standard deviation of one variable )(Standard deviation of other variable )
2 2
( )( )
( ) . ( )
i i
i i
X X Y Yr
X X Y Y
Scatter plots of X and Y variables and their correlation coefficients
Figure 9.2 Scatter plots of X and Y variables and their correlation coefficients
CORREL xls function
Figure 9.3 Microsoft Excel command for the correlation coefficient
Example 9.2
Date Rainfall (mm/day)
Streamflow (mm/day)
23-6-1983
0.00 0.10
24-6-1983
1.64 0.07
25-6-1983
20.03 0.24
26-6-1983
9.20 0.33
27-6-1983
75.37 3.03
28-6-1983
50.13 15.20
29-6-1983
9.81 9.66
30-6-1983
1.02 4.01
1-7-1983 0.00 2.052-7-1983 0.00 1.32
Example 9.2 cont.
n
XX i
n
YY i
The computations can be done neatly in tabular form as given in the next slide:(a) For the mean values:
= 167.2/10 = 16.72,
= 36.01/10 = 3.601
Example 9.2 cont.
Therefore, the correlation coefficient, r =
495.080.43
(5983.89) (226.06)
Example 9.2 cont.
Therefore, the correlation coefficient, r =
1039.060.95
(5673.24) (212.45)
Linear means all exponents (powers) of x must be one, i.e., it cannot be a fraction or a value greater than 1. There cannot be a product term of variables as well.
cnxnaxaxaxanxxxxf .......)...,,( 332211321
Linear regression analysis
Fitting a straight line
y = m x + c
Suppose the line passes through two points A and B, where A is (x1,y1) and B is (x2, y2).
y yy y
x xx x
1
1 2
1
1 2
yy y
x xx
x y x y
x x
1 2
1 2
1 2 2 1
1 2
Eq. 9.3
Example 9.3
Problem: The number of public servants claiming compensation for stress has been steadily rising in Australia. The number of successful claims in 1989-90 was 800 while in 1994-95 the figure was 1900. How many claims are expected in the year 2006-2007 if the growth continues steadily? If each claim costs an average of $24,000, what should be the budget allocation of Comcare in year 2006-2007 for stress-related compensation?
Therefore, using equation (9.3) we get:
Y X
19001900 800
19951995 1990
Solving, we have Y = 220.X – 437,000. If we now let X = 2007, we get the expected number of claims in the year 2006-2007. So the number of claims in the year 2006-2007 is expected to be 220(2007) – 437,000 = 4,540. At $24,000 per claim, Comcare's budget should be $108,960,000.
Example 9.3 cont.
Simple linear regression
Figure 9.6 Schematic representation of the simple linear regression model
Least squares criteria
( ) [ ( )]e Y Y Y b b Xi i i i i2 2
0 12
bS
Sxy
xx1 XbYb 10
average of all the y values iYYn
n
XX i valuesxtheallofaverage
S X X Y Y X YX Y
nxy i i i ii i sum of cross product deviations ( )( )
( )( )
S X X X XX
nxx i ii
sum of the squared deviations for ( )( )2 2
2
State No. of Inst., X
Membership, Y
X2 Y2 XY
NSW 17 5 987 289 3.58442x107
101 779
QLD 11 5 950 121 3.54025x107
65 450
SA 10 3 588 100 1.28737x107
35 880
TAS 3 1 356 9 1.83873x106
4 068
VIC 41 14 127 1681 1.99572x108
579 207
WA 9 4 847 81 2.34934x107
43 623
Others 11 3 893 121 1.51554x107
42 823
Total 102 39 748 2402 3.241799x108
872 830
Table 9.2 Unisuper membership by States
Example 9.5
Example 9.5 cont.
397485678
7iYY
n 102
14.577
iXX
n
Sxy 87283039748 102
7293645
( )( )Sxx 2402
1027
91572( )
.
b1 = Sxy/Sxx = 293 645/915.7 = 320.7
b Y m X0 . = 5 678 - (320.7)(14.57) = 1005
Therefore, the regression equation is Y = 320.7X + 1005.
Type regression under help and then go to linest function. Highlight ‘District office building data’ and copy with cntrl C and paste with cntrl V in your spreadsheet.
Multiple linear regression with Excel
Multiple regression
.....322211...2211 fXeXdXcXbXaXoY
Where Y is the dependent variable; X1, X2, ... are independent variables; 0,1, ... are regression coefficients; and a,b,... are exponents.
Example 9.6
PeriodNo. of Private Houses
Average weekly earnings ($)
No. of persons in workforce (in millions)
Variable Home loan rate (in %)
1986-87
83 973 428 5.6889 15.50
1987-88
100 069 454 5.8227 13.50
1988-89
128 231 487 6.0333 17.00
1989-90
96 390 521 6.1922 16.50
1990-91
87 038 555 6.0933 13.00
1991-92
100 572 581 5.8846 10.50
1992-93
113 708 591 5.8372 9.50
1993-94
123 228 609 5.9293 8.75
1994-95
111 966 634 6.1190 10.50
= LINEST (A2:A10,B2:D10,TRUE,TRUE)
Example 9.6 cont.
Figure 9.7 Demonstration of use of LINEST function
Hence, from the printout, the regression equation is the following:H = 155914.8 + 232.2498 E – 36463.4 W + 3204.0441 I
The Ctrl and Shift keys must be kept depressed while striking the Enter key to get tabular output.
Coefficient of determination
If the fit is perfect, the R2 value will be one and if there is no relationship at all, the R2 value will be zero.
2
22
)(
)ˆ(1
n variatioTotal
explained Variation
YY
YYR
i
ii
Regression equation cannot model discrete values. We get a better reflection of the reality if we replace the actual values by its probability. The ratio of the probabilities of occurrence and non-occurrence directs us close to the actual value.
Logistic regression
Transforming the linear regression model
Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance.
The logistic regression model
exp as denoted often logarithms natural of basetheis
where
1)|1(
e
xypc
c
e
e
ax
ax
ax in the right-hand side of the regression equation in vector form.
.
Logistic regression cont.
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
X Values
P(y
=1|X
)
The logistic regression equation
Figure 9.8 Graphical representation of the logistic regression equation
Regression in Weka
Figure 9.10 Selection of logistic function
Output from logistic regression
Figure 9.12 Output from logistic regression
Visualisation option of the results
Figure 9.13 Visualisation option of the results
Visual impression of data and clusters
Figure 9.14 Visual impression of data and clusters
Particular instance information
Figure 9.15 Information about a particular instance
Strengths and weaknesses
• Regression analysis is a powerful tool suitable for linear relationships, but most real-world problems are nonlinear. Mostly, therefore, the output is not accurate but useful.
• Regression techniques assume normality in the distribution of uncertainty and the instances are assumed to be independent of each other. This is not the case with many real problems.
Applications of regression algorithms
• Financial Markets• Medical Science • Retail Industry • Environment• Social Science