Top Banner
Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above. Copyright © 2007 Cengage Learning Australia Pty Limited Computers are useless. They can only give you answers. –Pablo Picasso Pablo Picasso
131

Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Mar 29, 2015

Download

Documents

Calvin Newhall
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Conventional Data Mining Techniques II

A B M Shawkat Ali

1

PowerPoint permissionsCengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above. Copyright © 2007 Cengage Learning Australia Pty Limited

Computers are useless. They can only give you answers. –Pablo Picasso

Pablo Picasso

Page 2: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

My Request

“A good listener is not only popular everywhere, but after a while he gets to know something”

- Wilson Mizner

Page 3: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Association Rule Mining

PowerPoint permissionsCengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above.

Copyright © 2007 Cengage Learning Australia Pty Limited

Page 4: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Objectives

On completion of this lecture you should know:

• Features of association rule mining.• Apriori: Most popular association rule mining

algorithm.• Association rules evaluation.• Association rule mining using WEKA.• Strengths and weaknesses of association rule

mining.• Applications of association rule mining.

Page 5: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

• Affinity Analysis• Market Basket Analysis: Which products go

together in a basket?– Uses: determine marketing strategy, plan

promotions, shelf layout.• Looks like production rules, but more than one

attribute may appear in the consequent.– IF customers purchase milk THEN they

purchase bread AND sugar

Association rules

Page 6: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Transaction data

Transaction ID

Itemset or Basket

01 {‘webcam’, ‘laptop’, ‘printer’}

02 {‘laptop’, ‘printer’, ‘scanner’}

03 {‘desktop’, ‘printer’, ‘scanner’}

04 {‘desktop’, ‘printer’, ‘webcam’}

Table 7.1. Transactions Data

Page 7: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Rule for Support:

• The minimum percentage of instances in the database that contain all items listed in a given association rule.

Concepts of association rules

Page 8: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example

• 5,000 transactions contain milk and bread in a set of 50,000

• Support => 5,000 / 50,000 = 10%

Page 9: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Rule for Confidence:

Given a rule of the form “If A then B”, rule for confidence is the conditional probability that B is true when A is known to be true.

Concepts of association rules

Page 10: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example

• IF customers purchase milk THEN they also purchase bread:– In a set of 50,000, there are 10,000

transactions that contain milk, and 5,000 of these contain also bread.

– Confidence => 5,000 / 10,000= 50%

Page 11: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Parameters of ARM

1. To find all items that appears frequently in transactions. The level of frequency of appearance is determined by pre-specified minimum support count.

Any item or set of items that occur less frequently than this minimum support level are not included for analysis.

Page 12: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

2. To find strong associations among the frequent items. The strength of the association is quantified by the confidence. Any association below a pre-specified level of confidence is not used to generate rules.

Page 13: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Relevance of ARM

• On Thursdays, grocery store consumers often purchase diapers and beer together.

• Customers who buy a new car are very likely to purchase vehicle extended warranty.

• When a new hardware store opens, one of the most commonly sold items is toilet fittings.

Page 14: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Functions of ARM

• Finding the set of items that has significant impact on the business.

• Collating information from numerous transactions on these items from many disparate sources.

• Generating rules on significant items from counts in transactions.

Page 15: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Single-dimensional association rules

Transaction id

‘webcam’ ‘laptop’ ‘printer’ ‘scanner’ ‘desktop’

01 1 1 1 0 002 0 1 1 1 003 0 0 1 1 104 1 0 1 0 1

Table 7.2 Boolean form of a transaction data.

(cont.)

Page 16: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.
Page 17: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Multidimensional association rules

Page 18: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

General considerations

• We are interested in association rules that show a lift in product sales where the lift is the result

of the product’s association with one or more other products.

• We are also interested in association rules that show a lower than expected confidence for a particular association.

Page 19: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Itemset Supports in %‘Webcam’ 50%

‘Laptop’ 50%‘Printer’ 100%‘Scanner’ 50%‘Desktop’ 50%{ ‘webcam’, ‘laptop’} 25%{‘webcam’, ‘printer’ } 50%{‘webcam’, ‘scanner’} 00%{‘webcam’, ‘desktop’} 25%{‘laptop’, ‘printer’} 50%{‘laptop’, ‘scanner’} 25%{‘laptop’, ‘desktop’} 00%{‘printer’, ‘scanner’} 50%{‘printer’, ‘desktop’} 50%{‘scanner’, ‘desktop’} 25%{‘webcam’, ‘laptop’, ‘printer’} 25%{‘webcam’, ‘laptop’, ‘scanner’} 00%{‘webcam’, ‘laptop’, ‘desktop’} 00%{‘laptop’, ‘printer’, ‘scanner’} 25%{‘laptop’, ‘printer’, ‘desktop’} 00%{‘printer’, ‘scanner’, ‘desktop’} 25%{‘webcam’, ‘laptop’, ‘printer’, ‘scanner’, desktop’} 00%

Page 20: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Enumeration tree

W

WL

L P S D

WP WS WD LP LS LD PS PD SD

LPS LPD LSD PSDWSDWPDWPSWLDWLSWLP

WLPS WLPD WLSD WPSD LPSD

WLPSD

Figure 7.1 Enumeration tree of transaction items of Table 7.1. In theleft nodes, branches reduce by 1 for each downward progression – starting with 5 branches and ending with 1 branch, which is typical

Page 21: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Association models

nCk = The number of combinations of n things

taken k at a time.

Page 22: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Two other parameters• Improvement (IMP) =

• Share (SH) =

where LMV = local measure value and TMV is total measure value.

Support(Antecedent & Consequent)

Support(Antedecent) Support(Consequent)

( , )( , ) i

i

LMV X GSH X G

TMV

Page 23: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

IMP and SH measure

Transaction ID

‘Yogurt’ (A)

‘Cheese’ (B)

‘Rice’ (C)

‘Corn’ (D)

T1 2 0 5 10T2 0 3 0 5T3 5 2 20 0T4 3 10 0 12T5 0 0 10 13

Table 7.4. Market transaction data

(cont.)

Page 24: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

‘Yogurt’(A) ‘Cheese’(B) ‘Rice’(C) ‘Corn’(D) ItemsetItemset LMV SH LMV SH LMV SH LMV SH LMV SHA 10 0.10 10 0.10B 15 0.15 15 0.15C 35 0.35 35 0.35D 40 0.40 40 0.40AB 8 0.08 12 0.12 20 0.20AC 7 0.07 25 0.25 32 0.32AD 5 0.05 22 0.22 27 0.27BC 2 0.02 20 0.20 22 0.22BD 13 0.13 17 0.17 30 0.30CD 15 0.15 23 0.23 38 0.38ABC 5 0.05 2 0.02 20 0.20 27 0.27ABD 3 0.03 10 0.10 12 0.12 25 0.25ACD 2 0.02 5 0.05 10 0.10 17 0.17BCD 0 0.00ABCD 0 0.00

Table 7.5 Share measurement

Page 25: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Taxonomies

• Low-support products are lumped into bigger categories and high-support products are broken up into subgroups.

• Examples are: Different kinds of potato chips can be lumped with other munchies into snacks, and ice cream can be broken down into different flavours.

Page 26: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Large Datasets

• The number of combinations that can generate from transactions in an ordinary supermarket can be in the billions and trillions. The amount of computation thus required for Association Rule Mining can stretch any computer.

Page 27: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

APRIORI algorithm

1. All singleton itemsets are candidates in the first pass. Any item that has a support value of less than a specified minimum is eliminated.

2. Selected singleton itemsets are combined to form two-member candidate itemsets. Again, only the candidates above the pre-specified support value are retained.

(cont.)

Page 28: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

3. The next pass creates three-member candidate itemsets and the process is repeated. The process stops only when all large itemsets are accounted for.

4. Association Rules for the largest itemsets are created first and then rules for the subsets are created recursively.

Page 29: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

T.ID Items

01 2 3

02 1 3 5

03 1 2 4

04 2 3

Itemset support

{1} 2

{2} 3

{3} 3

{4} 1

{5} 1

Large I. sup.

{1} 2

{2} 3

{3} 3

Itemset

{1 2}

{1 3}

{2 3}

Itemset support

{1 2} 1

{1 3} 1

{2 3} 2

Large I. support

{2 3} 2

Database D

Scan D Select

Cre

ate

Scan DSelect

Figure 7.2 Graphical demonstration of the working of the Apriori algorithm

Page 30: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

APRIORI in Weka

Figure 7.3 Weka environment with market-basket.arff data file

Page 31: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 2

Figure 7.4 Spend98 attribute information visualisation.

Page 32: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 3

Figure 7.5 Target attributes selection through Weka

Page 33: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 4

Figure 7.6 Discretisation filter selection

Page 34: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 5

Figure 7.7 Parameter selections for discretisation.

Page 35: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 6

Figure 7.8 Descretisation activation

Page 36: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Discretised data visualisation

Figure 7.9 Discretised data visualisation

Page 37: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 7

Figure 7.10 Apriori algorithm selection from Weka for ARM

Page 38: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 8

Figure 7.11 Apriori output

Page 39: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Associator output

1. ‘Dairy’='(-inf-1088.666667]’ ‘Deli’='(-inf-1169.666667]' 847

==> ‘Bakery’='(-inf-1316.666667]‘ 833 conf:(0.98)

Page 40: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Strengths and weaknesses

• Easy Interpretation• Easy Start• Flexible Data Formats• Simplicity• Exponential Growth in Computations• Lumping• Rule Selection

Page 41: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Applications of ARM

• Store Layout Changes• Cross/Up selling• Disaster Weather forecasting• Remote Sensing• Gene Expression Profiling

Page 42: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Recap

• What is association rule mining?• Apriori: Most popular association rule mining

algorithm.• Applications of association rule mining.

Page 43: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

The Clustering Task

PowerPoint permissionsCengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above.

Copyright © 2007 Cengage Learning Australia Pty Limited

Page 44: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Objectives

On completion of this lecture you should know:

• Unsupervised clustering technique• Measures for clustering performance• Clustering algorithms• Clustering task demonstration using WEKA• Applications, strengths and weaknesses of the

algorithms

Page 45: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Clustering: Unsupervised learning

• Clustering is a very common technique that appears in many different settings (not necessarily in a data mining context)– Grouping “similar products” together to

improve the efficiency of a production line– Packing “similar items” into a basket– Grouping “similar customers” together– Grouping “similar stocks” together

Page 46: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Sl. No. Subjects Code

Marks

1 COIT21002 85

2 COIS11021 78

3 COIS32111 75

4 COIT43210 83

Table 8.1 A simple unsupervised problem

A simple clustering example

Page 47: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Figure 8.1 Basic clustering for data of Table 8.1.The X-axis is the serial number and Y-axis is the marks

Cluster representation

Page 48: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

How many clusters can you form?

A A A AK K K KQ Q Q Q J J J J

Figure 8.2 Simple playing card data

Page 49: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Distance measure

• The similarity is usually captured by a distance measure.

• The original proposed measure of distance is the Euclidean distance.

n

iii yxyxd

yyyYxxxX nn

1

2

2

)(),(

),,,(),,,,( 211

Page 50: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Figure 8.3 Euclidean distance D between two points A and B

Page 51: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Other distance measures

• City-block (Manhattan) distance

• Chebychev distance: Maximum

• Power distance:

i ii

x y

i ix y1rp

i iix y

Minkowski distance when p = r.

Page 52: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Distance measure for categorical data

• Percent disagreement

number of /x y ni i

Page 53: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Types of clustering

• Hierarchical Clustering– Agglomerative– Divisive

Page 54: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Agglomerative clustering

1. Place each instance into a separate partition.

2. Until all instances are part of a single cluster:

a. Determine the two most similar clusters.

b. Merge the clusters chosen into a single cluster.

3. Choose a clustering formed by one of the step 2 iterations as a final result.

Page 55: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Agg

lom

erat

ive

Div

isiv

e

1,2,3…………….,28,29,30

Dendrogram

Page 56: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 8.1

Figure 8.5 Hypothetical data points for agglomerative clustering

Page 57: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 8.1 cont.

C = {{P1}, {P2}, {P3}, {P4}, {P5}, {P6}, {P7}, {P8}}

Step 1

Step 2

1 2 3 4 5 6 1 7 8{{ },{ },{ },{ },{ },{ },{ }}C P P P P P PP P

Step 3

2 3 2 4 5 6 1 7 8{{ },{ },{ },{ },{ },{ }}C P P P P P PP P

Page 58: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 8.1 cont.

Step 4

Step 5

Step 6

3 2 3 4 5 6 1 7 8{{ },{ },{ },{ },{ }}C P P P P P PP P

4 2 3 4 5 1 6 7 8{{ },{ },{ },{ }}C P P P P PP P P

5 2 3 4 5 1 6 7 8{{ },{ },{ }}C P P P P PP P P

Step 76 2 3 4 1 5 6 7 8{{ },{ }}C P P P PP P P P

Page 59: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Agglomerative clustering: An example

Y-

Axis

X-Axis

P3

P1

P7

P6

P5

P8

P4 P2

Figure 8.6 Hierarchical clustering of the data points of Example 8.1

Page 60: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Dendrogram of the example

P3 P4 P2 P8 P7 P1 P6 P5

Figure 8.7 The dendrogram of the data points of Example 8.1

Page 61: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Types of clustering cont.

• Non-Hierarchical Clustering– Partitioning methods– Density-based methods– Probability-based methods

Page 62: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Partitioning methods

The K-Means Algorithm:

1. Choose a value for K, the total number of clusters.

2. Randomly choose K points as cluster centers.

3. Assign the remaining instances to their closest cluster center.

4. Calculate a new cluster center for each cluster.

5. Repeat steps 3-5 until the cluster centers do not change.

Page 63: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

General considerations of K-means algorithm

• Requires real-valued data.• We must pre-select the number of clusters

present in the data.• Works best when the clusters in the data are of

approximately equal size.• Attribute significance cannot be determined.• Lacks explanation capabilities.

Page 64: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 8.2

Let us consider the dataset of Example 8.1 to find two clusters using the k-means algorithm.

Step 1. Arbitrarily, let us choose two cluster centers to be the data points P5 (5, 2) and P7 (1, 2). Their

relative positions can be seen in Figure 8.6. We could have started with any two other points. The initial selection of points does not affect the final result.

Page 65: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 2. Let us find the Euclidean distances of all the data points from these two cluster centers.

Page 66: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 2. (Cont.)

Page 67: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 3. The new cluster centres are:

Page 68: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 4. The distances of all data points from these new cluster centres are:

Page 69: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 4. (cont.)

Page 70: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 5. By the closest centre criteria P5 should be

moved from C2 to C1, and the new clusters are C1 =

{P1, P5, P6, P7, P8} and C2 = {P2, P3, P4}.

The new cluster centres are:

Page 71: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 6. We may repeat the computations of Step 4 and we will find that no data point will switch clusters. Therefore, the iteration stops and the final clusters are C1 = {P1, P5, P6, P7, P8} and C2 =

{P2, P3, P4}.

Page 72: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Density-based methods

C2

C1 C3

-4-2

02

4

-4

-2

0

2

40

0.2

0.4

0.6

0.8

1

Figure 8.8 (a) Three irregular shaped clusters (b) Influence curve of a point

Page 73: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Probability-based methods

• Expectation Maximization (EM) uses a Gaussian mixture model:• Guess initial values of all the parameters until a

termination criterion is achieved• Use the probability density function to compute

the cluster probability for each instance.• Use the probability score assigned to each

instance in the above step to re-estimate the parameters.

)()|()(

)|(iXP

kCiXPkCPiXkCP

Page 74: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Clustering through WekaStep 1.

Figure 8.9 Weka environment with credit-g.arff data

Page 75: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 2.

Figure 8.10 SimpleKMeans algorithm and its parameter selection

Page 76: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 3.

Figure 8.11 K-means clustering performance

Page 77: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 3. (cont.)

Figure 8.12 Weka result window

Page 78: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Cluster visualisation

Figure 8.13 Cluster visualisation

Page 79: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Individual cluster information

Figure 8.14 Cluster0 instances information

Page 80: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Step 4.

Figure 8.15 Cluster 1 instance information

Page 81: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Kohonen neural network

Figure 8.16 A Kohonen network with two input nodes and nine output nodes

Input 1 Input 2

Page 82: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Contains only an input layer and an output layer but no hidden layer.

The number of nodes in the output layer that finally captures all the instances determine the number of clusters in the data.

Kohonen self-organising maps:

Page 83: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 8.3

Input 1 Input 2

Output 1 Output 2

0.3 0.6

0.1

0.4 0.2

0.5

Figure 8.17 Connections between input and output nodes of a neural network

Page 84: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 8.3 Cont.

2( )i iji

I W

2 2(0.3 0.1) (0.6 0.2)

2 2(0.3 0.4) (0.6 0.5)

= 0.447

= 0.141

The scoring for any output node k is done using the formula:

Page 85: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 8.3 cont.

10

)(

where

)()(

r

wnrw

wcurrentwneww

ijiij

ijijij

Page 86: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 8.3 cont.

03.0)4.03.0(3.012 W

03.0)5.06.0(3.022 W

37.003.04.0(new)12 W

53.003.05.0(new)22 W

Assuming that the learning rate is 0.3, we get:

Page 87: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Cluster validation

t-test 2 -test

Validity in Test Cases

Page 88: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Strengths and weaknesses

• Unsupervised Learning• Diverse Data Types • Easy to Apply • Similarity Measures• Model Parameters • Interpretation

Page 89: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Applications of clustering algorithms

• Biology• Marketing research • Library Science • City Planning • Disaster Studies • Worldwide Web • Social Network Analysis • Image Segmentation

Page 90: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Recap

• What is clustering?• K-means: Most popular clustering algorithm• Applications of clustering techniques

Page 91: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

The Estimation Task

PowerPoint permissionsCengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above.

Copyright © 2007 Cengage Learning Australia Pty Limited

Page 92: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Objectives

On completion of this lecture you should know:

• Assess the numeric value of a variable from other related variables.

• Predict the behaviour of one variable from the behaviour of related variables.

• Discuss the reliability of different methods of estimation and perform a comparative study.

Page 93: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

What is estimation?

Finding the numeric value of an unknown attribute from observations made on other related attributes. The unknown attribute is called the dependent (or response or output) attribute (or variable) and the known related attributes are called the independent (or explanatory or input) attributes (or variables).

Page 94: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Scatter Plots and CorrelationWeek ending

ASX BHP RIO

1-1-2006 33.70 23.35 68.808-1-2006 34.95 23.73 70.5015-1-2006 34.14 24.66 74.0022-1-2006 34.72 26.05 76.1029-1-2006 34.61 25.53 74.755-2-2006 34.28 24.75 74.4012-2-2006 33.24 23.88 71.6519-2-2006 33.14 24.55 72.2026-2-2006 31.08 24.34 70.355-3-2006 31.72 23.37 67.5012-3-2006 33.30 24.70 71.2519-3-2006 32.60 25.92 75.2326-3-2006 32.70 28.00 78.852-4-2006 33.20 29.50 83.709-4-2006 32.70 29.75 82.3216-4-2006 32.50 30.68 83.06Table 9.1 Weekly closing stock prices (in dollars) at the Australian Stock

Exchange

Page 95: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Figure 9.1a Computer screen-shots of Microsoft Excel spreadsheets to demonstrate plotting of scatter plot

Page 96: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Figure 9.1b

Page 97: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Figure 9.1c

Page 98: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Figure 9.1d

Page 99: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Scatter Plot

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00

BHP Share Price ($)

RIO

Sh

are

Pri

ce (

$)

Figure 9.1e Computer screen-shots of Microsoft Excel spreadsheets to demonstrate plotting of scatter plot

Page 100: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Correlation coefficient

r = Covariance between the two variables

(Standard deviation of one variable )(Standard deviation of other variable )

2 2

( )( )

( ) . ( )

i i

i i

X X Y Yr

X X Y Y

Page 101: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Scatter plots of X and Y variables and their correlation coefficients

Figure 9.2 Scatter plots of X and Y variables and their correlation coefficients

Page 102: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

CORREL xls function

Figure 9.3 Microsoft Excel command for the correlation coefficient

Page 103: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 9.2

Date Rainfall (mm/day)

Streamflow (mm/day)

23-6-1983

0.00 0.10

24-6-1983

1.64 0.07

25-6-1983

20.03 0.24

26-6-1983

9.20 0.33

27-6-1983

75.37 3.03

28-6-1983

50.13 15.20

29-6-1983

9.81 9.66

30-6-1983

1.02 4.01

1-7-1983 0.00 2.052-7-1983 0.00 1.32

Page 104: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 9.2 cont.

n

XX i

n

YY i

The computations can be done neatly in tabular form as given in the next slide:(a) For the mean values:

= 167.2/10 = 16.72,

= 36.01/10 = 3.601

Page 105: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 9.2 cont.

Therefore, the correlation coefficient, r =

495.080.43

(5983.89) (226.06)

Page 106: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 9.2 cont.

Therefore, the correlation coefficient, r =

1039.060.95

(5673.24) (212.45)

Page 107: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Linear means all exponents (powers) of x must be one, i.e., it cannot be a fraction or a value greater than 1. There cannot be a product term of variables as well.

cnxnaxaxaxanxxxxf .......)...,,( 332211321

Linear regression analysis

Page 108: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Fitting a straight line

y = m x + c

Suppose the line passes through two points A and B, where A is (x1,y1) and B is (x2, y2).

y yy y

x xx x

1

1 2

1

1 2

yy y

x xx

x y x y

x x

1 2

1 2

1 2 2 1

1 2

Eq. 9.3

Page 109: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 9.3

Problem: The number of public servants claiming compensation for stress has been steadily rising in Australia. The number of successful claims in 1989-90 was 800 while in 1994-95 the figure was 1900. How many claims are expected in the year 2006-2007 if the growth continues steadily? If each claim costs an average of $24,000, what should be the budget allocation of Comcare in year 2006-2007 for stress-related compensation?

Page 110: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Therefore, using equation (9.3) we get:

Y X

19001900 800

19951995 1990

Solving, we have Y = 220.X – 437,000. If we now let X = 2007, we get the expected number of claims in the year 2006-2007. So the number of claims in the year 2006-2007 is expected to be 220(2007) – 437,000 = 4,540. At $24,000 per claim, Comcare's budget should be $108,960,000.

Example 9.3 cont.

Page 111: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Simple linear regression

Figure 9.6 Schematic representation of the simple linear regression model

Page 112: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Least squares criteria

( ) [ ( )]e Y Y Y b b Xi i i i i2 2

0 12

bS

Sxy

xx1 XbYb 10

average of all the y values iYYn

n

XX i valuesxtheallofaverage

S X X Y Y X YX Y

nxy i i i ii i sum of cross product deviations ( )( )

( )( )

S X X X XX

nxx i ii

sum of the squared deviations for ( )( )2 2

2

Page 113: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

State No. of Inst., X

Membership, Y

X2 Y2 XY

NSW 17 5 987 289 3.58442x107

101 779

QLD 11 5 950 121 3.54025x107

65 450

SA 10 3 588 100 1.28737x107

35 880

TAS 3 1 356 9 1.83873x106

4 068

VIC 41 14 127 1681 1.99572x108

579 207

WA 9 4 847 81 2.34934x107

43 623

Others 11 3 893 121 1.51554x107

42 823

Total 102 39 748 2402 3.241799x108

872 830

Table 9.2 Unisuper membership by States

Example 9.5

Page 114: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 9.5 cont.

397485678

7iYY

n 102

14.577

iXX

n

Sxy 87283039748 102

7293645

( )( )Sxx 2402

1027

91572( )

.

b1 = Sxy/Sxx = 293 645/915.7 = 320.7

b Y m X0 . = 5 678 - (320.7)(14.57) = 1005

Therefore, the regression equation is Y = 320.7X + 1005.

Page 115: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Type regression under help and then go to linest function. Highlight ‘District office building data’ and copy with cntrl C and paste with cntrl V in your spreadsheet.

Multiple linear regression with Excel

Page 116: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Multiple regression

.....322211...2211 fXeXdXcXbXaXoY

Where Y is the dependent variable; X1, X2, ... are independent variables; 0,1, ... are regression coefficients; and a,b,... are exponents.

Page 117: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 9.6

PeriodNo. of Private Houses

Average weekly earnings ($)

No. of persons in workforce (in millions)

Variable Home loan rate (in %)

1986-87

83 973 428 5.6889 15.50

1987-88

100 069 454 5.8227 13.50

1988-89

128 231 487 6.0333 17.00

1989-90

96 390 521 6.1922 16.50

1990-91

87 038 555 6.0933 13.00

1991-92

100 572 581 5.8846 10.50

1992-93

113 708 591 5.8372 9.50

1993-94

123 228 609 5.9293 8.75

1994-95

111 966 634 6.1190 10.50

= LINEST (A2:A10,B2:D10,TRUE,TRUE)

Page 118: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Example 9.6 cont.

Figure 9.7 Demonstration of use of LINEST function

Hence, from the printout, the regression equation is the following:H = 155914.8 + 232.2498 E – 36463.4 W + 3204.0441 I

The Ctrl and Shift keys must be kept depressed while striking the Enter key to get tabular output.

Page 119: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Coefficient of determination

If the fit is perfect, the R2 value will be one and if there is no relationship at all, the R2 value will be zero.

2

22

)(

)ˆ(1

n variatioTotal

explained Variation

YY

YYR

i

ii

Page 120: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Regression equation cannot model discrete values. We get a better reflection of the reality if we replace the actual values by its probability. The ratio of the probabilities of occurrence and non-occurrence directs us close to the actual value.

Logistic regression

Page 121: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Transforming the linear regression model

Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance.

Page 122: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

The logistic regression model

exp as denoted often logarithms natural of basetheis

where

1)|1(

e

xypc

c

e

e

ax

ax

ax in the right-hand side of the regression equation in vector form.

.

Page 123: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Logistic regression cont.

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

X Values

P(y

=1|X

)

The logistic regression equation

Figure 9.8 Graphical representation of the logistic regression equation

Page 124: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Regression in Weka

Figure 9.10 Selection of logistic function

Page 125: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Output from logistic regression

Figure 9.12 Output from logistic regression

Page 126: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Visualisation option of the results

Figure 9.13 Visualisation option of the results

Page 127: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Visual impression of data and clusters

Figure 9.14 Visual impression of data and clusters

Page 128: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Particular instance information

Figure 9.15 Information about a particular instance

Page 129: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Strengths and weaknesses

• Regression analysis is a powerful tool suitable for linear relationships, but most real-world problems are nonlinear. Mostly, therefore, the output is not accurate but useful.

• Regression techniques assume normality in the distribution of uncertainty and the instances are assumed to be independent of each other. This is not the case with many real problems.

Page 130: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Applications of regression algorithms

• Financial Markets• Medical Science • Retail Industry • Environment• Social Science

Page 131: Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Recap

• What is estimation?

• How to solve the estimation problem?

• Applications of regression analysis.