DATA MINING - EVALUATING CLUSTERING ALGORITHM
Post on 16-Jul-2015
114 Views
Preview:
Transcript
MINISTRY OF EDUCATION, YOUTHS, SCIENCE AND SPORTS FRANCE
Ecole Internationale des Sciences du Traitment de
I’Information
Course: DATA MINING
Course Work on CLUSTERS
DEPARTMENT: BUSINESS ANALYTICS (M2)
COURSE TUTOR: PROF. MARIA MALEK
Submitted by: UDEH TOCHUKWU LIVINUS
Cergy Pontoise - 2014
1A. EVALUATING CAR ACCESSIBILITY USING K-MEANS ALGORITHM
S/N
K – MEANS
K = 3
Seed 1 = 10
Seed 2 = 100
Within cluster sum of squared errors: 5709.0
Incorrectly clustered instances : 807.0 46.7014 %
Cluster 0 (unacc)
774 ( 45%)
Cluster 1(acc)
600 ( 35%)
Cluster 2(good)
354 ( 20%)
Within cluster sum of squared errors: 5547.0
Incorrectly clustered instances : 1020.0 59.0278 %
Cluster 0(acc)
813 ( 47%)
Cluster 1(unacc)
555 ( 32%)
Cluster 2 (vgood)
360 ( 21%)
K = 4
Seed = 10
Seed 2 = 100
Within cluster sum of squared errors: 5390.0
Incorrectly clustered instances : 979.0 56.6551 %
Cluster 0 (unacc)
592 ( 34%)
Cluster 1(acc)
557 ( 32%)
Cluster 2(good)
327 ( 19%)
Cluster 3 (vgood)
252 ( 15%)
Within cluster sum of squared errors: 5316.0
Incorrectly clustered instances : 1093.0 63.2523 %
Cluster 0 (acc)
697 ( 40%)
Cluster 1(unacc)
496 ( 29%)
Cluster
2(vgood)
346 ( 20%)
Cluster 3 (good)
189 ( 11%)
K = 5
Within cluster sum of squared errors: 5106.0
Incorrectly clustered instances : 1064.0 61.5741 %
Seed 1 = 10
Seed 2 = 100
Cluster 0 (unacc)
543 ( 31%)
Cluster 1(acc)
430 ( 25%)
Cluster
2(good)
302 ( 17%)
Cluster 3
(No class)
227 ( 13%)
Cluster 4
(vgood)
226 ( 13%)
Within cluster sum of squared errors: 4996.0
Incorrectly clustered instances : 1162.0 67.2454 %
Cluster 0 (acc)
586 ( 34%)
Cluster 1(unacc)
424 ( 25%)
Cluster
2(vgood)
310 ( 18%)
Cluster 3
(No class)
174 ( 10%)
Cluster 4
(good)
234 ( 14%)
ANALYSIS OF THE RESULT:
These Algorithm minimizes the total squared distance from instances to their clustered
centers local, not global minimum. So we tend to get different results when we vary the
seeds. From the table above at K = 3, we discover that we had a lesser squared errors
when we assigned the value of the seed to be 100 compared to when the value of the seed
was 100. However, there is an inverse relation with the values of the incorrectly
clustered instances as we can see it from the table. While we tend to minimize the errors
of the squared distances of the instances, the incorrectly clustered instances increases
for each increase in the number of seed. Vice versa. Hence, we try to compare the
similarities of each clustered instances when we iterate and don not take into account
any values or conditions.
FIGURE 1.0
The Figure below describe the cluster and instances as illustrated in the table above.
The Y – axis represents the class value, while the X- axis represents the instance
number. The color represents the cluster, so we can see the cluster when we select each
of the instance from the menu and then compare the similarity before validating our
decisions.
1 B. ESTIMATING CLUSTERS USING EXPECTATION MAXIMIZATION (EM CLUSTERING)
K = 3; Seed = 10
Incorrectly clustered instances : 519.0
30.0347 %
K = 3; Seed = 100
Incorrectly clustered instances : 545.0
31.5394 %
Cluster
Attribute 0 1 2
(0.51) (0.29) (0.2)
=======================================
buying
vhigh 211.4062 134.0433 89.5505
high 221.0605 127.5401 86.3994
med 230.866 120.6598 83.4743
low 221.0605 127.5401 86.3994
[total] 884.3931 509.7833 345.8236
maint
Cluster
Attribute 0 1 2
(0.48) (0.32) (0.2)
======================================
buying
vhigh 209.4287 139.8151 85.7562
high 198.5154 146.6941 89.7905
med 209.4287 139.8151 85.7562
low 220.8453 132.7588 81.3959
[total] 838.2182 559.083 342.6988
maint
vhigh 220.5171 127.8878 86.595
Clustered Instances
0 1725 (100%)
1 3 ( 0%)
Log likelihood: -7.45474
Class attribute: class
Classes to Clusters:
0 1 <-- assigned to cluster
1208 2 | unacc
383 1 | acc
69 0 | good
65 0 | vgood
Cluster 0 <-- unacc
Cluster 1 <-- acc
vhigh 204.4103 138.1582 92.4315
Clustered Instances
0 1699 ( 98%)
1 29 ( 2%)
Log likelihood: -7.45474
Class attribute: class
Classes to Clusters:
0 1 <-- assigned to cluster
1182 28 | unacc
383 1 | acc
69 0 | good
65 0 | vgood
Cluster 0 <-- unacc
Cluster 1 <-- acc
K = 4; Seed = 10
Incorrectly clustered instances : 556.0
32.1759 %
K = 4; Seed = 100
Incorrectly clustered instances : 528.0
30.5556 %
Cluster
Attribute 0 1 2 3
(0.41) (0.26) (0.18) (0.16)
=============================================
buying
vhigh 171.8456 115.3853 81.9122 66.8569
high 175.9113 111.256 79.4969 69.3358
med 182.9562 108.0125 77.5667 67.4646
low 173.1743 110.3401 79.4592 73.0264
[total] 703.8874 444.9938 318.435 276.6838
Cluster
Attribute 0 1 2 3
(0.42) (0.18) (0.26) (0.14)
==========================================
buying
vhigh 179.4064 76.6192 118.9071 61.0673
high 183.026 77.5366 112.1751 63.2622
med 189.4165 80.2604 106.0872 60.2359
low 183.1553 75.7134 110.2891 66.8422
[total] 735.0042 310.1296 447.4586 251.4076
Clustered Instances
0 1616 ( 94%)
1 112 ( 6%)
Log likelihood: -7.45474
Class attribute: class
Classes to Clusters:
0 1 <-- assigned to cluster
1140 70 | unacc
352 32 | acc
59 10 | good
65 0 | vgood
Cluster 0 <-- unacc
Cluster 1 <-- acc
Clustered Instances
0 1718 ( 99%)
2 10 ( 1%)
Log likelihood: -7.45474
Class attribute: class
Classes to Clusters:
0 2 <-- assigned to cluster
1200 10 | unacc
384 0 | acc
69 0 | good
65 0 | vgood
Cluster 0 <-- unacc
Cluster 2 <-- No class
ANALYSIS OF THIS ALGORITHM:
These algorithm uses probabilistic approach to classify cluster. It uses expectation
maximization to classify the instances of the cluster. From the table we see that each of
the attribute value, has a probability value assigned to it. We can get the probability
of each attribute by dividing the value of each with the total value of all the
attributes given. With this, we can calculate the probability of each cluster. It also
uses the overall quality measure known as log likelihood. The nominal attributes are the
probability of each value while the numeric attributes are the mean and standard
deviation. It is also an unsupervised learning algorithm. Whilst we compare it with that
of K-Means, we discover that the value of the incorrectly classified instances are lesser
compared to K-means.
2 CLASSIFICATION
A. K-NEAREST NEIGHBOUR (KNN)
S/N
USE TRAINIG SET
CROSS VALIDATION (10 FOLD)
PRECISION
RECALL
ERROR
PRECISION
RECALL
ERROR
Type 1
(TPR)
Type 2
(FPR)
Type
1
(TPR)
Type 2
(FPR)
K = 1
Correctly Classified Instances 1728 100%
Incorrectly Classified Instances 0 0%
Correctly Classified Instances
1616 93.5185 %
Incorrectly Classified Instances
112 6.4815 %
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0.973
0.818
1
1
0.94
0.998
0.911
0.188
0.708
0.935
0.998
0.911
0.188
0.708
0.935
0.066
0.058
0
0
0.059
K = 5
Correctly Classified Instances
1664 96.2963 %
Incorrectly Classified Instances
64 3.7037 %
Correctly Classified Instances
1616 93.5185 %
Incorrectly Classified Instances
112 6.4815 %
0.988
0.883
1
1
0.965
1
0.961
0.435
0.846
0.963
1
0.961
0.435
0.846
0.963
0.029
0.036
0
0
0.028
0.973
0.818
1
1
0.94
0.998
0.911
0.188
0.708
0.935
0.998
0.911
0.188
0.708
0.935
0.066
0.058
0
0
0.059
K = 20
Correctly Classified Instances
1337 77.3727 %
Incorrectly Classified Instances
391 22.6273 %
Correctly Classified Instances
1327 76.794 %
Incorrectly Classified Instances
401 23.206 %
0.813
0.531
0
0
0.687
1
0.331
0
0
0.774
1
0.331
0
0
0.774
0.539
0.083
0
0
0.396
0.802
0.528
0
1
0.717
1
0.299
0
0.031
0.768
1
0.299
0
0.031
0.768
0.575
0.077
0
0
0.42
REMARK:
From the above table we had a correctly classified instances with K = 1. So both
precisions and Recall were the same. This means that the algorithm correctly classified
all the misclassified points or instances. We can see it from the Figure below:
IBK 1.0
When we choose a different value for K = 5,and k = 20 we had approximately 4% and 23% of
the misclassified instances. However, this explains the concept of a noisy dataset or
instances. So what we did in this regard was we choose 5 neighboring classes and the
majority of the class to classify the unknown points or instances. In the figure below
those misclassified points are represented with rectangles in colors.
IBK 1.1
Applying Cross Validation Algorithm process which divides the instances in 10 equal sized
sets whereby 90 instances will be used for training and 10 instances will be used for
testing. At the end it averages the performances of the 10 classifiers produced from the
10 equal sized instances. Similar results was obtained when k =1 and k = 5. We had
approximately 6% of the misclassified instances compared to when we use the whole
training set. The figure below shows the visual diagram of using cross validation
algorithm to evaluate the training sets of data.
IBK 1.2
In summary, there is no model in this method. We only test instances to make our
prediction. The More value of k increases the more the percentage of misclassified points
also increase. However, the accuracy of the dataset only improves when we have a more
noisy instances. The base line accuracy can only be achieved by increasing the value of
k. and from the data set giving the base line accuracy is approximately 30%. K method is
a good method although it is very slow method because it has to scan the entire training
instances or data set before it can make prediction.
B DECISION TREE ALGORITHM (ID3 or J48)
ID3 (USE TRAINING SET)
ID3(CROSS VALIDATION FOLD 10)
Correctly Classified Instances
1728 100%
Incorrectly Classified Instances
0 0%
Correctly Classified Instances
1544 89.3519%
Incorrectly Classified Instances
61 3.5301%
=== Confusion Matrix ===
a b c d <-- classified as
1210 0 0 0 | a = unacc
0 384 0 0 | b = acc
0 0 69 0 | c = good
0 0 0 65 | d = vgood
=== Confusion Matrix ===
a b c d <-- classified as
1171 28 3 0 | a = unacc
7 292 9 4 | b = acc
0 5 37 5 | c = good
0 0 0 44 | d = vgood
J48 (USE TRAINING SET) J48 (CROSS VALIDATION)
Correctly Classified Instances
1664 96.2963 %
Incorrectly Classified Instances
64 3.7037 %
Number of Leaves : 131
Size of the tree : 182
Correctly Classified Instances
1596 92.3611 %
Incorrectly Classified Instances
132 7.6389 %
Number of Leaves : 131
Size of the tree : 182
=== Confusion Matrix ===
a b c d <-- classified as
1182 25 3 0 | a = unacc
10 370 2 2 | b = acc
0 9 57 3 | c = good
0 4 6 55 | d = vgood
=== Confusion Matrix ===
a b c d <-- classified as
1164 43 3 0 | a = unacc
33 333 11 7 | b = acc
0 17 42 10 | c = good
0 3 5 57 | d = vgood
ONE R ALGORITHM
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 1210 70.0231 %
Incorrectly Classified Instances 518 29.9769 %
Kappa statistic 0
Mean absolute error 0.1499
Root mean squared error 0.3871
Relative absolute error 65.4574 %
Root relative squared error 114.5023 %
Total Number of Instances 1728
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 1 0.7 1 0.824 0.5 unacc
0 0 0 0 0 0.5 acc
0 0 0 0 0 0.5 good
0 0 0 0 0 0.5 vgood
Weighted Avg. 0.7 0.7 0.49 0.7 0.577 0.5
=== Confusion Matrix ===
a b c d <-- classified as
1210 0 0 0 | a = unacc
384 0 0 0 | b = acc
69 0 0 0 | c = good
65 0 0 0 | d = vgood
PRISM
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 1728 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 1728
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 unacc
1 0 1 1 1 1 acc
1 0 1 1 1 1 good
1 0 1 1 1 1 vgood
Weighted Avg. 1 0 1 1 1 1
=== Confusion Matrix ===
a b c d <-- classified as
1210 0 0 0 | a = unacc
0 384 0 0 | b = acc
0 0 69 0 | c = good
0 0 0 65 | d = vgood
CONCLUSION
The Best algorithm in terms of precision are Prism and ID3. Both algorithm gave exact values of all of the
Instances with no instances misclassified. J48 is also a good algorithm but cannot be used for large
dataset as we see there were about 4% misclassified instances. All of these are examples of supervised
learning and can be used in making various decisions. Although the decision trees makes use of Entropy as
the Basis for making decisions.
2.3 ASSOCIATION RULES
A. Apriori
=======
Minimum support: 0.1 (173 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 18
Generated sets of large itemsets:
Size of set of large itemsets L(1): 23
Size of set of large itemsets L(2): 52
Size of set of large itemsets L(3): 11
Best rules found:
1. persons=2 576 ==> class=unacc 576 conf:(1)
2. safety=low 576 ==> class=unacc 576 conf:(1)
3. persons=2 lug_boot=small 192 ==> class=unacc 192 conf:(1)
4. persons=2 lug_boot=med 192 ==> class=unacc 192 conf:(1)
5. persons=2 lug_boot=big 192 ==> class=unacc 192 conf:(1)
6. persons=2 safety=low 192 ==> class=unacc 192 conf:(1)
7. persons=2 safety=med 192 ==> class=unacc 192 conf:(1)
8. persons=2 safety=high 192 ==> class=unacc 192 conf:(1)
9. persons=4 safety=low 192 ==> class=unacc 192 conf:(1)
10. persons=more safety=low 192 ==> class=unacc 192 conf:(1)
B. Apriori
=======
Minimum support: 0.1 (173 instances)
Minimum metric <confidence>: 0.5
Number of cycles performed: 18
Generated sets of large itemsets:
Size of set of large itemsets L(1): 23
Large Itemsets L(1):
buying=vhigh 432
buying=high 432
buying=med 432
buying=low 432
maint=vhigh 432
maint=high 432
maint=med 432
maint=low 432
doors=2 432
doors=3 432
doors=4 432
doors=5more 432
persons=2 576
persons=4 576
persons=more 576
lug_boot=small 576
lug_boot=med 576
lug_boot=big 576
safety=low 576
safety=med 576
safety=high 576
class=unacc 1210
class=acc 384
Size of set of large itemsets L(2): 52
Large Itemsets L(2):
buying=vhigh class=unacc 360
buying=high class=unacc 324
buying=med class=unacc 268
buying=low class=unacc 258
maint=vhigh class=unacc 360
maint=high class=unacc 314
maint=med class=unacc 268
maint=low class=unacc 268
doors=2 class=unacc 326
doors=3 class=unacc 300
doors=4 class=unacc 292
doors=5more class=unacc 292
persons=2 lug_boot=small 192
persons=2 lug_boot=med 192
persons=2 lug_boot=big 192
persons=2 safety=low 192
persons=2 safety=med 192
persons=2 safety=high 192
persons=2 class=unacc 576
persons=4 lug_boot=small 192
persons=4 lug_boot=med 192
persons=4 lug_boot=big 192
persons=4 safety=low 192
persons=4 safety=med 192
persons=4 safety=high 192
persons=4 class=unacc 312
persons=4 class=acc 198
persons=more lug_boot=small 192
persons=more lug_boot=med 192
persons=more lug_boot=big 192
persons=more safety=low 192
persons=more safety=med 192
persons=more safety=high 192
persons=more class=unacc 322
persons=more class=acc 186
lug_boot=small safety=low 192
lug_boot=small safety=med 192
lug_boot=small safety=high 192
lug_boot=small class=unacc 450
lug_boot=med safety=low 192
lug_boot=med safety=med 192
lug_boot=med safety=high 192
lug_boot=med class=unacc 392
lug_boot=big safety=low 192
lug_boot=big safety=med 192
lug_boot=big safety=high 192
lug_boot=big class=unacc 368
safety=low class=unacc 576
safety=med class=unacc 357
safety=med class=acc 180
safety=high class=unacc 277
safety=high class=acc 204
Size of set of large itemsets L(3): 11
Large Itemsets L(3):
persons=2 lug_boot=small class=unacc 192
persons=2 lug_boot=med class=unacc 192
persons=2 lug_boot=big class=unacc 192
persons=2 safety=low class=unacc 192
persons=2 safety=med class=unacc 192
persons=2 safety=high class=unacc 192
persons=4 safety=low class=unacc 192
persons=more safety=low class=unacc 192
lug_boot=small safety=low class=unacc 192
lug_boot=med safety=low class=unacc 192
lug_boot=big safety=low class=unacc 192
Best rules found:
1. persons=2 576 ==> class=unacc 576 conf:(1)
2. safety=low 576 ==> class=unacc 576 conf:(1)
3. persons=2 lug_boot=small 192 ==> class=unacc 192 conf:(1)
4. persons=2 lug_boot=med 192 ==> class=unacc 192 conf:(1)
5. persons=2 lug_boot=big 192 ==> class=unacc 192 conf:(1)
6. persons=2 safety=low 192 ==> class=unacc 192 conf:(1)
7. persons=2 safety=med 192 ==> class=unacc 192 conf:(1)
8. persons=2 safety=high 192 ==> class=unacc 192 conf:(1)
9. persons=4 safety=low 192 ==> class=unacc 192 conf:(1)
10. persons=more safety=low 192 ==> class=unacc 192 conf:(1)
11. lug_boot=small safety=low 192 ==> class=unacc 192 conf:(1)
12. lug_boot=med safety=low 192 ==> class=unacc 192 conf:(1)
13. lug_boot=big safety=low 192 ==> class=unacc 192 conf:(1)
14. buying=vhigh 432 ==> class=unacc 360 conf:(0.83)
15. maint=vhigh 432 ==> class=unacc 360 conf:(0.83)
16. lug_boot=small 576 ==> class=unacc 450 conf:(0.78)
17. doors=2 432 ==> class=unacc 326 conf:(0.75)
18. buying=high 432 ==> class=unacc 324 conf:(0.75)
19. maint=high 432 ==> class=unacc 314 conf:(0.73)
20. doors=3 432 ==> class=unacc 300 conf:(0.69)
21. safety=high class=unacc 277 ==> persons=2 192 conf:(0.69)
22. lug_boot=med 576 ==> class=unacc 392 conf:(0.68)
23. doors=4 432 ==> class=unacc 292 conf:(0.68)
24. doors=5more 432 ==> class=unacc 292 conf:(0.68)
25. lug_boot=big 576 ==> class=unacc 368 conf:(0.64)
26. buying=med 432 ==> class=unacc 268 conf:(0.62)
27. maint=med 432 ==> class=unacc 268 conf:(0.62)
28. maint=low 432 ==> class=unacc 268 conf:(0.62)
29. safety=med 576 ==> class=unacc 357 conf:(0.62)
30. persons=4 class=unacc 312 ==> safety=low 192 conf:(0.62)
31. buying=low 432 ==> class=unacc 258 conf:(0.6)
32. persons=more class=unacc 322 ==> safety=low 192 conf:(0.6)
33. persons=more 576 ==> class=unacc 322 conf:(0.56)
34. persons=4 576 ==> class=unacc 312 conf:(0.54)
35. safety=med class=unacc 357 ==> persons=2 192 conf:(0.54)
36. class=acc 384 ==> safety=high 204 conf:(0.53)
37. lug_boot=big class=unacc 368 ==> persons=2 192 conf:(0.52)
38. lug_boot=big class=unacc 368 ==> safety=low 192 conf:(0.52)
39. class=acc 384 ==> persons=4 198 conf:(0.52)
From the comparison table, the Best Algorithm depends on the type of problem given to the
researcher. If some datasets were given with a minimum condition or less condition. Then
Association rules can be applied in order to get or generate various possible rules or outcomes. The outcomes depends on the confidence limit and the number of rules you want to
generate. So it is good when you are making unsupervised learning algorithm.
Conversely, Prism Rule is good when you are given some sets of conditions. It is best because it minimizes errors to its lowest proportion and gives out the best alternatives. It
has an artificial intel. That gives out the best possible alternatives. It is limited when it
comes to complex decisions making processes.
J48
This decision making process uses tree method. It is not the best Ideal method because of the
divide and conquer rule it uses. It can’t be used in complex decision making process. Many
errors will be generated in the classification process. It is also an example of Supervised
Learning or Classification Algorithm.
top related