Sponsored by AIAT.or.th and KINDML, SIIT · techniques for clustering, and association rule mining in order. 4.1. Cluster Analysis or Clustering Unlike classiﬁcation, cluster analysis

Table of Contents

Chapter 4. Clustering and Association Analysis ......................................................................................... 171 4.1. Cluster Analysis or Clustering ........................................................................................................... 171

4.1.1. Distance and similarity measurement ...................................................................................... 173 4.1.2. Clustering Methods .................................................................................................................. 177 4.1.3. Partition-based Methods ......................................................................................................... 179 4.1.4. Hierarchical-based clustering ................................................................................................... 183 4.1.5. Density-based clustering .......................................................................................................... 186 4.1.6. Grid-based clustering ............................................................................................................... 188 4.1.7. Model-based clustering ............................................................................................................ 189

4.2. Association Analysis and Frequent Pattern Mining ......................................................................... 193 4.2.1. Apriori algorithm ...................................................................................................................... 197 4.2.2. FP-Tree algorithm ..................................................................................................................... 202 4.2.3. CHARM algorithm ..................................................................................................................... 206 4.2.4. Association Rules with Hierarchical Structure ......................................................................... 210 4.2.5. Efficient Association Rule Mining with Hierarchical Structure ................................................. 216

4.3. Historical Bibliography ..................................................................................................................... 218 Exercise ......................................................................................................................................................... 221

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

171

Chapter 4. Clustering and Association Analysis

This chapter presents two unsupervised tasks; i.e., clustering and association rule mining. Clustering

groups input data into a number of groups without examples or clues while association rule mining

finds frequently co-occurring events using correlation analysis. This chapter provides basic

techniques for clustering, and association rule mining in order.

4.1. Cluster Analysis or Clustering

Unlike classification, cluster analysis or clustering or data segmentation handles data objects the

class labels of which are unknown or not given. Clustering is known as unsupervised learning

and it does not rely on predefined classes and class-labeled training examples. For this reason,

clustering is a form of learning by observation, rather than learning by examples (like

classification). Although classification is an effective means for distinguishing groups or classes of

objects, it often requires costly process of labeling a large set of training tuples or patterns,which

the classifier uses to model each group. Therefore, it is more desirable to proceed in the reverse

direction: first partition the set of data into groups based on data similarity (e.g., using

clustering), and then assign labels to the relatively small number of groups. In other words, even

we can analyze data objects explicitly if we know the class label for each object (classification

tasks) but it is quite common in large databases that objects may not be assigned any label

(clustering tasks) since the process to assign class labels is very costly. In such situations where

we have a set of objects (or data objects) without any class label given, it is quite often that one

may need arranging them into groups based on their similarity in order to know some structure

within the object collection. Moreover, a cluster of data objects can be treated collectively as one

group and so may be considered as a form of data compression. As one by-product of clustering,

it is also possible to detect outliers, i.e., objects that are deviated for norm. One additional

advantage of such a clustering-based process is that it is adaptable to changes and helps figuring

out useful features that distinguish different groups.

In data mining, efforts have focused on finding methods for efficient and effective cluster

analysis in large databases. Active themes of research focus on the scalability of clustering

methods, the effectiveness of methods for clustering complex shapes and types of data, high-

dimensional clustering techniques, and methods for clustering mixed numerical and categorical

data in large databases.

Clustering attempts to group a set of given data objects into classes or clusters, so that

objects within a cluster have high similarity in comparison to one another but are very dissimilar

to objects in other clusters. Similarities or dissimilarities are assessed based on the attribute

values describing the objects where distance measures are used. There are many clustering

approaches including partitioning methods, hierarchical methods, density-based methods, grid-

based methods, model-based methods, methods for high-dimensional data (such as frequent

pattern–based methods), and constraint-based clustering.

Conceptually, cluster analysis is an important human activity. Early in childhood, we learn

how to distinguish between dogs and cats, between birds and fish, or between animals and plants,

by continuously improving subconscious clustering schemes. With automated clustering, dense

and sparse regions are identified in object space and, therefore, overall distribution patterns and

interesting correlations among data attributes are discovered. Clustering has its roots in many

areas, including data mining, statistics, information theory and machine learning. It has been

widely used in numerous applications, such as market research, pattern recognition, data

analysis, biological discovery, and image processing. In business, clustering can help marketing


172

people discover distinct groups in their customer bases and characterize customer groups based

on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies,

categorize genes with similar functionality, and gain insight into structures inherent in

populations. Clustering may also assist in the identification of areas of similar land usage in an

earth observation database and in the identification of groups of houses in a city according to

house type, value, and geographic location, as well as the identification of groups of automobile

insurance policy holders with a high average claim cost. It can also be used to help classify online

documents, e.g. web documents, for information discovery.

As one major data mining function, cluster analysis can be used as a stand-alone tool to gain

insight into the distribution of data, to observe the characteristics of each cluster, and to focus on

a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step

for other algorithms, such as characterization, attribute subset selection, and classification, which

would then operate on the detected clusters and the selected attributes or features. As the

context of the huge amounts of data collected in databases, cluster analysis has recently become

a highly active topic in data mining research. As a branch of statistics, cluster analysis has been

extensively studied for many years, focusing mainly on distance-based cluster analysis. Cluster

analysis tools based on k-means, k-medoids, and several other methods have also been built into

many statistical analysis software packages or systems, such as DBMiner, Weka, KNIME,

RapidMiner, SPSS, and SAS. In machine learning, clustering is an example of unsupervised

learning.

In clustering, there have been different ways in which the result of clustering can be

expressed. Firstly, the groups that are identified may be exclusive so that any instance belongs to

only one group. Secondly it is also possible to have clusters that are overlapping, i.e., an instance

may fall into several groups. Thirdly, the cluster may be probabilistic in the sense that an

instance may belong to each group with a certain probability. Fourthly the cluster may be

hierarchical, such that there is a crude division of instances into groups at the top level, and each

of these groups is refined further, perhaps all the way down to individual instances. Which choice

among these possibilities should be applied depend on constraints and purposes of clustering.

Two more factors are that clusters are formed in discrete domains (categorical or nominal

attributes), numeric domains (integer- or real-valued attributes) or hybrid domains, and that

clusters can be formed incrementally or statically.

The following are a list of typical topics that we need to consider for clustering in real

applications. Firstly, practical clustering techniques are required to be able to handle large-scale

data, say millions of data objects, since in a real application we usually have such large amount of

data. Secondly, clustering techniques need handling various types of attributes, including

interval-based (numerical) data, binary data, categorical or nominal data, and ordinal data, or

mixtures of these data types. Thirdly, without any constraint, most clustering techniques will

naturally try to find clusters, which are in a spherical shape. However, intuitively and practically,

clusters could be of any shape and cluster techniques should be able to detect clusters of

arbitrary shape. Fourthly, clustering techniques should autonomously determine the suitable

number of clusters and their associated elements without any required extra steps. Fifthly,

clustering should be able to detect and eliminate noisy data which are outliers or erroneous data.

Sixthly, it is necessary to enable incremental clustering which is insensitive to the order of input

data for clustering. Many techniques cannot incorporate newly inserted data into existing

clustering structures but must make clusters from scratch. Moreover, most of them are sensitive

to the order of input data. Sevenly, in many cases, it is inevitable to handle data with high

dimensions since data in many tasks have objects with a large number of attributes. Finding

clusters of data objects in high dimensional space is challenging, especially considering that such


173

data can be sparse and highly skewed. Eighthly, it is practically to perform clustering under

various kinds of constraints. For example, when we are assigned to locate a number of

convenience stores, we may cluster residents by taking their static characteristics into account,

as well as considering geographical constraints such as the city’s rivers, ponds, roads, and

highway networks, and the type and number of residents per cluster (customers for a store). A

challenging task is to find groups of data with good clustering behavior that satisfy specified

constraints. Ninethly, selection of clustering techniques and features have to be done with the

consideration of interpretability and usability. Clustering may need to be tied to specific

purposes, which match semantic interpretations and applications. It is important to study how an

application goal may influence the selection of clustering features and methods.

The next sections describe distance or similarity, which is the basis for clustering similar

objects and then illustrate well-known types of clustering methods, including partition-based

methods, hierarchical methods, density-based methods, grid-based methods, and model-based

methods. Special topics on clustering in high-dimensional space, constraint-based clustering, and

outlier analysis are partially discussed.

4.1.1. Distance and similarity measurement

To enable the process of clustering, a form of distance or similarity need to be defined to

determine which pair of objects is close to each other. In general, a data object is characterized by

a set of attributes or features. They can be in various types, including interval-scaled variables;

by binary variables; by categorical, ordinal, and ratio-scaled variables; or combinations of these

variable types.

Data objects can be represented in the form of a mathematics schema as shown in Section

2.1. It is also shown in brief as follows. Assume that a dataset D is composed of n instance objects;

o1, …, on, each of which oi is represented by a vector of p attributes (namely A1, . . . , Ap) ai1, . . . , aip.

In other words, a dataset can be viewed as a matrix with n rows and p columns, called a data

matrix as in Figure 4-1. Each instance object oi is characterized by the values of attributes that

express different aspects of the instance.

A1 A 2 … A k … A p

o1 a11 a12 … a1k … a1p o2 a21 a22 … a2k … a2p

… … … … … … … oi ai1 ai2 … aik … aip

… … … … … … … on-1 a(n-1)1 a(n-1)2 … a(n-1)k … a(n-1)p

on an1 an2 … ank … anp

D = [ (o1, o2, …, on)T ] … Database: A set of data objects

oi = (ai1, ai2, …, aip) … An object characterized by

attributes

Figure 4-1: A matrix representation of data objects with their attribute values.

Conceptually we can measure dissimilarity (also called distance) or similarity between any

pair of objects. The result can be represented in the form of a two-dimensional matrix, called

dissimilarity (distance) matrix or similarity matrix. This matrix holds object-by-object structure

and stores a collection of proximity values (dissimilarity or similarity values) that are available

for all pairs of objects. It is represented by an -by- matrix as shown in Figure 4-1.

Dataset

D


174

In Figure 4-2 (a), expresses the difference or dissimilarity between objects i and j. In

most works, is defined to be a nonnegative number. It is close to 0 when objects i and j are

very similar or close to each other and it becomes a larger number when they are more different

to each other. In general, the equality is assumed and the difference between two

identical object is set to zero, i.e., d(i, i) = 0. On the other hand, in Figure 4-2 (b), expresses

the similarity between objects i and j. Normally, represents a number that takes a

maximum of 1 when objects i and j are identical, and it becomes a smaller number when they are

farther from each other. Same as dissimilarity measure, it is natural to set the equality of

s and the similarity between two identical objects is set to one, i.e., s(i, i) = 1.

To calculate distance or similarity, since there are various types of attributes, measurement

of the distance between two objects, say and , becomes an important issue. The point is how

to fairly define the distance for interval-scaled attributes, binary attributes, categorical attributes,

ordinal attributes, and ratio-scaled attributes since the distance (or similarity) between two

objects, i.e., or in Figure 4-2, naturally comes from the combination of the distances

of their attributes. As a naïve way, in many works the combination is usually defined by a

summation operator, denoted by (or for similarity,

). While different types of attributes may have different distance (or similarity)

calculations, it is possible to classify them into two main approaches.

o1 o 2 … o i … o n-1 o n

o1 0 d(1,2) … d(1,i) … d(1,n-1) d(1,n) o2 d(2,1) 0 … d(2,i) … d(2,n-1) d(2,n)

… … … 0 … … … … oi d(i,1) d(i,2) … 0 … d(i,n-1) d(i,n)

… … … … … 0 … … on-1 d(n-1,1) d(n-1,2) … d(n-1,i) … 0 d(n-1,n)

on d(n,1) d(n,2) … d(n,i) … d(n,n-1) 0

(a) Dissimilarity (Distance) matrix (minimum value = 0)

o1 o 2 … o i … o n-1 o n

o1 1 s(1,2) … s(1,i) … s(1,n-1) s(1,n) o2 s(2,1) 1 … s(2,i) … s(2,n-1) s(2,n)

… … … 1 … … … … oi s(i,1) s(i,2) … 1 … s(i,n-1) s(i,n)

… … … … … 1 … … on-1 s(n-1,1) s(n-1,2) … s(n-1,i) … 1 s(n-1,n)

on s(n,1) s(n,2) … s(n,i) … s(n,n-1) 1

(b) Similarity matrix (maximum value=1)

Figure 4-2: A dissimilarity (distance) matrix and similarity matrix

The first approach is to normalize all attributes into a fixed standard scale, say 0.0 to 1.0 (or -1.0

to 1.0) and then use a distance measure, such as Euclidean distance or Manhattan distance, or use

a similarity measure, such as cosine similarity, to determine the distance between a pair of

objects. The detail of this approach is described in Section 2.5.2. The second approach is to use

different measurements for different types of attributes as follows.


175

No. Type Method

1. Interval-scaled

attributes

[Normalization Step]

Option 1: transformation of the original value to the standardized value by the absolute deviation of the attribute , denoted by , using the

mean value of the attribute , denoted by .

Option 2: transformation of the original value to the standardized value by the standard deviation of the attribute , denoted by , using the

mean value of the attribute , denoted by .

[Distance Measurement Step] For distance measurement in Figure 4-2, we can use a standard measure such as Euclidean distance or Mahattan distance between two objects, say and , as follows.

Euclidean Distance:

Manhattan Distance:

Both Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance function.

1. : The distance is a nonnegative number.

2. : The distance from an object to itself is zero.

3. : The distance is a symmetric function.

4. : The distance satisfies the triangular

inequality. The distance from object i to object j directly is not larger

than the indirect contour over any other object h.


176

1. Interval-scaled

attributes

(continued)

It is also possible to measure similarity instead of distance. As for Figure

4-2, two common similarity measures are dot product and cosine similarity.

Their formulae are given below. When the object is normalized, the dot

product and the cosine similarity become identical.

Dot Product:

Cosine Similarity:

Note that dot product has no bound but the cosine similarity ranges between -1

and 1 (in this task, 0 and 1).

2. Categorical

attributes

A categorical attribute can be viewed as a generalization of the binary attribute in that it can take more than two states. For example, ‘product brand’ is a categorical attribute that may take one value from a set of more than two possible values, say Oracle, Microsoft, Google, and Facebook. For the distance for a categorical attribute, it is possible to use the same approach with an interval-scaled attribute by setting distance to 1 if two objects have the same value, otherwise 0. When the value of a categorical attribute of two objects, and , is

the same, the dissimilarity and the similiarty between these two objects for that categorical attribute are set to 0 and 1, respectively. Otherwise they are 1 and 0, respectively. If ther

when the object and the object have a same value for attribute A.

when the object and the object have different values for attribute A.

3. Ordinal

attributes

A discrete ordinal attribute positions between a categorical attribute and numeric-valued attribute in the sense that a ordinal attribute has N discrete values (like categorical attributes) but they can be ordered in a meaningful sequence (like numeric-valued attributes). An ordinal attribute is useful for recording subjective assessments of qualities that cannot be measured objectively. For example, ‘height’ can be high, middle and low, ‘weight’ can be heavy, mild and light, etc. This ordinal property presents continuity of an unknown scale but its actual magnitude is not known. To handle the scale of an ordinal attribute, we can treat an ordinal variable by normalization as follows. First, the values of an ordinal variable are mapped to ranks. For example, suppose that each value of an ordinal attribute ( ) has been arranged to ranks ( ), say . Second each value of the ordinal attribute is mapped to a value between 0 and 1 ( ) as follows.

Third the dissimilarity can then be computed using any of the distance measures for interval-scaled variables, using to represent the value for the i-th object.


177

4. Binary

attributes

Even it is possible to use the same approach with interval-scaled attributes, treating binary variables as if they are interval-scaled may not be suitable since it may mislead to improper clustering results. Here, it is necessary to use more suitable methods specific to binary data for computing dissimilarities. As one approach, a dissimilarity matrix can be calculated from the given binary data. Normally we consider all binary variables to have the same weight. By this setting, a 2-by-2 contingency table can be constructed to calculate dissimilarity between object i and object j as follows.

Object j

1 0 Total

Object i

1 a b a+b

0 c d c+d

Total a+c b+d a+b+c+d

Here, two types of dissimilarity measures for binary attributes are symmetric binary dissimilarity and asymmetric binary dissimilarity as follows.

The asymmetric binary dissimilarity is used when the positive and negative outcomes of a binary attribute are not equally important, such as the positive and negative outcomes of a disease test. That is, the value of 1 of an attribute has different importance level to the value of 0 of that attribute. For example, we may give more importance to the outcome of HIV positive (1) which is usually rarely occurred, and less importance to the outcome of HIV negative (0) which is usually detected. As the negation version of this dissimilarity, we can calculate symmetric binary similarity and asymmetric binary similarity , as follows.

5. Ratio-scaled

attributes

A ratio-scaled attribute takes a positive value on a nonlinear scale, such as an exponential scale. The following is the formula for a ratio-scaled value.

Here, A is a positive numeric constant, B is a numeric constant, and t is a focused variable. It is not good to treat ratio-scaled attributes like interval-scaled attributes since it is likely that the scale may be distorted due to exponential scales. There are two common methods to compute the dissimilarity between objects described by ratio-scaled attributes as follows.

1. The first method is to apply logarithmic transformation to a value of a ratio-scaled attribute for object I, say , by using the formula The transformed value

can be treated as an interval-valued attribute. However, it may be suitable to use other transformation, such as log-log transformation.

2. The second method is to treat as a continuous ordinal attribute and treat their ranks as an interval-valued attribute.

4.1.2. Clustering Methods

During there are many existing clustering algorithms, we can classify the major clustering

methods including partition-based methods, hierarchical methods, density-based methods, grid-

based methods, and model-based methods. Their details are discussed below.


178

1. Partitioning methods

A partitioning method divides objects (data tuples) into partitions of the data, where

each partition represents a cluster and . This method needs to specify , the number of

partitions, beforehand. Usually clustering assigns one object to only one cluster but it is

possible to allow it to assign to several clusters, such as fuzzy partitioning techniques. The

steps of partitioning methods are as follows. First, the partitioning method will assign

randomly or heuristically each object to a cluster, as initial partition. Here one cluster has

at least one object assigned from the beginning. Second, the method will relocate objects in

clusters iteratively by attempting to improve the partitioning result by moving objects

from one group to another based on a predefined criterion, which is that objects in the

same cluster are close to each other, whereas they are far apart or very different from

objects in different clusters. At present, there have been a few popular heuristic methods,

such as (1) the k-means algorithm, where each cluster is represented by the mean value of

the objects in the cluster, and (2) the k-medoids algorithm, where each cluster is

represented by one of the objects located near the center of the cluster. The partitioning

methods seem work well to construct a number of spherical-shaped clusters in small- to

medium-sized databases. However, these methods need some modification to deal with

clusters with complex shapes and for clustering very large data sets.

2. Hierarchical methods

Unlike a partitioning method, a hierarchical method does not specify the number of

clusters beforehand but attempts to create a hierarchical structure for the given set of data

objects. Two types of a hierarchical method are agglomerative and divisive. The

agglomerative method is the bottom-up approach where the process starts with each

object forming a separate group and then successively merges the objects or groups that

are close to one another, until all of the groups are merged into one (the topmost level of

the hierarchy), or until a termination condition holds. On the other hand, the divisive

method is the top-down approach where the procedure begins with all of the objects in the

same cluster and then for each successive iteration, a cluster is split up into smaller

clusters, until eventually each object is in one cluster, or until a termination condition holds.

Even hierarchical methods are superior in small computation costs due to avoiding a

combinatorial number of different merging or splitting choices, they may suffer with

erroneous decisions in each margining or splitting choice since once a step is done, it can

never be undone. To avoid this problem, two solutions are (1) to perform careful analysis

of linkages among objects at each hierarchical partitioning, as in Chameleon, or (2) to

integrate hierarchical agglomeration and other approaches by first using a hierarchical

agglomerative algorithm to group objects into microclusters, and then performing

macroclustering on the microclusters using another clustering method such as iterative

relocation, as in BIRCH.

3. Density-based methods

Most methods which use distance between objects for clustering will tend to find clusters

with spherical shape. However, in general clusters can have arbitrary shape. Towards this,

it is possible to apply the notion of density for cluster objects into any shape. The general

idea is to start from a single point in a cluster and then to grow the given cluster as long as

the density (number of data points or objects) in the “neighborhood” exceeds a threshold.

For each data point within a given cluster, the neighborhood of a given radius has to

contain at least a minimum number of points. A density-based method tries to filter out


179

noise (outliers) and discover clusters of arbitrary shape. Examples of density-based

approach are DBSCAN and its extension, OPTICS, and DENCLUE.

4. Grid-based methods

While point-to-point (object pair similarity) calculation in most clustering methods seems

slow, a grid-based method first divides the object space into a finite number of cells as a

grid structure. Then clustering operations are applied on this grid structure. Grid-based

methods are superior in its fast processing time. Rather than the number of data objects,

the time complexity depends only on the number of cells in each dimension in the

quantized space. A typical grid-based method is STING. It is possible to combine grid-

based and density-based as done in WaveCluster.

5. Model-based methods:

Instead of using a simple similarity definition, a model-based method predefines a suitable

model for each of the clusters and then find the best fit of the data to the given model. The

model may form clusters by constructing a density function that reflects the spatial

distribution of the data points. With statistics criteria, we can automatically detect the

number of clusters and obtain more robustness. Some well-known model-based methods

are EM, COBWEB and SOM.

It is hard to say which type of clustering fixes the best with the current task. The solution

depends both on the type of data available and on the particular purpose of the application. It is

possible to explore one-by-one to see their result clusters and compare to get the most practical

one. Some clustering methods may use the ideas of several clustering methods and become

mixed-typed clustering. Moreover, some clustering tasks, such as text clustering or DNA

microarray clustering, may have high dimension, causing difficulty in clustering since the data

become sparse. Clustering high-dimensional data is challenging due to the curse of

dimensionality. Many dimensions may not be relevant. As the number of dimensions increases,

the data become increasingly sparse so that the distance measurement between pairs of points

become meaningless and the average density of points anywhere in the data is likely to be low.

For this task, two influential subspace clustering methods are CLIQUE and PROCLUS. Rather than

searching over the entire data space, they search for clusters in subspaces (or subsets of

dimensions) of the data. Frequent pattern–based clustering, another clustering methodology,

extracts distinct frequent patterns among subsets of dimensions that occur frequently. It uses

such patterns to group objects and generate meaningful clusters. pCluster is an example of

frequent pattern–based clustering that groups objects based on their pattern similarity. Beyond

simple clustering, constraint-based clustering performs clustering under user-specified or

application-oriented constraints. Users may have some preferences on clustering data and

specify them as constraints in clustering process. A constraint is a user’s expectation or describes

“properties” of the desired clustering results. For example, objects in a space are clustered under

the existence of obstacles or they are clustered when some objects are known to be in or not in

the same cluster.

4.1.3. Partition-based Methods

A partitioning method divides objects (data tuples) into partitions of the data, where each

partition represents a cluster and . This method needs to specify , the number of partitions,

beforehand. As the most classic clustering, the k-means method receives in advance the number

of clusters k we construct. With this parameter, k points are chosen at random as cluster centers.


180

Next all instances are assigned to their closest cluster center according to the ordinary distance

metric, such as Euclidean distance. Next, the centroid, or mean, of the instances in each cluster is

calculated to be a new centroid. These centroids are taken to be new center values for their

respective clusters. The whole process is repeated with the new cluster centers. Finally, iteration

continues until the same points are assigned to each cluster in consecutive rounds, at which stage

the cluster centers have stabilized and will remain the same forever. In summary, four steps of

the k-means algorithm are as follows.

1. Partition objects into k non-empty subsets

2. Compute seed points as the centroids of the clusters of the current partition. The

centroid is the center (mean point) of the cluster.

3. Assign each object to the cluster with the nearest seed point.

4. Go back to Step 2, stop when no more new assignment that is no member changes its

group.

The formal description of k-means can be described as follows. Given a training set of objects and

their associated class labels, denoted by , each object is represented by an

n-dimensional attribute vector, depicting the measure values of n

attributes, , without any class label. Here, suppose that has possible values,

. That is, . Algorithm 4.1 shows a pseudocode of the k-

means method.

Algorithm 4.1. k-means Algorithm

Input: T is a dataset, where and k is the number of clusters.

Output: A set of clusters ,

where each element is a cluster with its members

Procedure:

(1) FOREACH T { ; } // Randomly assign to a class (2) WHILE some members change their groups {

(3) FOREACH {

; } // Calc centroid of each cluster

(4) FOREACH T {

(5) FOREACH {

; } // Calc distance to each cluster

(6)

// Select the best cluster for

(7) ; } // Assign to the best cluster

(8) }


181

ROUND 1

ROUND 2

ROUND 4

ROUND 3

Figure 4-3: A graphical example of k-means clustering

For more clarity, Figure 4-3 and Figure 4-4 shows an graphical example of the k-means

algorithm and its calculation for each step, respectively. Compared with other methods, the k-

means clustering method is simple and effective. Its principle is to swap the clusters’ members

among clusters until the total distance from each of the cluster’s points to its center becomes

minimal and no more swapping is needed. For example, it is possible to have situations in which

k-means fails to find a good clustering like in Figure 4-5. This figure shows the local optimal due

to improper initial clusters. We can view that these four objects are arranged at the vertices of a

rectangle in two-dimensional space. In the figure, two initial clusters are A and B, where P1 and

P3 are in the cluster A, and P2 and P4 are grouped in the cluster B. Graphically the two initial

cluster centers fall at the middle points of the long sides. This clustering result seem stable.

However, the two natural clusters should be formed by grouping together the two vertices at

either end of a short side. That is, P1 and P2 are in the cluster A, and P3 and P4 are in the cluster

B. In the k-means method, the final clusters are quite sensitive to the initial cluster centers.

Completely different clustering results may be obtained even slight changes are made in the

initial random cluster assignment. To increase the chance of finding a global minimum, one can

execute this algorithm several times with different initial choices and choose the best final result,

the one with the smallest total distance.


182

ROUND ONE

No. x y Cluster

1 1 8 A

A 3.40 6.30

2 2 7 A

B 8.40 5.00

3 2 10 A 4 3 2 A 5 3 9 A 6 4 3 A 7 4 10 A 8 5 2 A 9 5 4 A 10

8 A

11 6 3 B 12 7 1 B 13 7 11 B 14 8 2 B 15 8 3 B 16 8 11 B 17 9 3 B 1 10 4 B 19 10 10 B 20 11 2 B

ROUND TWO

No. x y A B Cluster

1 1 8 2.94 7.98 A

A 3.60 7.20

2 2 7 1.57 6 71 A

B 8.20 4.10

3 2 10 3.96 8.12 A 4 3 2 4.32 6.18 A 5 3 9 2.73 6.72 A 6 4 3 3.35 4.83 A 7 4 10 3.75 6.66 A 8 5 2 4.59 4.53 B 9 5 4 2.80 3.54 A 10 5 8 2.33 4.53 A 11 6 3 4.20 3.12 B 12 7 1 6.41 4.24 B 13 7 11 5.92 6.16 A 14 8 2 6.30 3.03 B 15 8 3 5.66 2.04 B 16 8 11 6.58 6.01 B 17 9 3 6.50 2.09 B 18 10 4 6.99 1.89 B 19 10 10 7.57 5.25 B 20 11 2 8.73 3.97 B

ROUND THREE

No. x y A B Cluster

1 1 8 2.72 8.19 A

A 3.90 7.90

2 2 7 1.61 6.84 A

B 7.90 3.40

3 2 10 3.22 8.56 A 4 3 2 5.23 5.61 A 5 3 9 1.90 7.14 A 6 4 3 4.22 4.34 A 7 4 10 2.83 7.24 A 8 5 2 5.39 3.83 B 9 5 4 3.49 3.20 B 10 5 8 1.61 5.04 A 11 6 3 4.84 2.46 B 12 7 1 7.07 3.32 B 13 7 11 5.10 7.00 A 14 8 2 6.81 2.11 B 15 8 3 6.08 1.12 B 16 8 11 5.81 6.90 A 17 9 3 6.84 1.36 B 18 10 4 7.16 1.80 B 19 10 10 6.99 6.17 B 20 11 2 9.04 3.50 B

ROUND FOUR

No. x y A B Cluster

1 1 8 2.90 8.29 A

A 4.67 9.33

2 2 7 2.10 6.91 A

B 6.91 2.64

3 2 10 2.83 8.85 A 4 3 2 5.97 5.10 B 5 3 9 1.42 7.44 A 6 4 3 4.90 3.92 B 7 4 10 2.10 7.67 A 8 5 2 6.00 3.22 B 9 5 4 4.05 2.96 B 10 5 8 1.10 5.44 A 11 6 3 5.33 1.94 B 12 7 1 7.56 2.56 B 13 7 11 4.38 7.65 A 14 8 2 7.18 1.40 B 15 8 3 6.39 0.41 B 16 8 11 5.14 7.60 A 17 9 3 7.07 1.17 B 18 10 4 7.24 2.18 B 19 10 10 6.45 6.93 A 20 11 2 9.23 3.40 B

Figure 4-4: Numerical calculation of the k-mean clustering in Figure 4-3


183

Figure 4-5: Local optimal due to improper initial clusters.

There are many variants of the basic k-means method developed. Some of them try to produce a

hierarchical clustering result (shown in the next section) with a cutting point at k groups and

then perform k-means clustering on the result. However, all methods still left one question on

how large the k should be. It is hard to estimate the likely number of clusters. One solution is to

try different values of k, and choose the best clustering result, i.e., one with the largest intra-

cluster similarity and the smallest inter-cluster similarity. Another solution of finding k is to

begin by finding a few clusters and determining whether it is worth splitting them. For example,

we choose k=2 and then perform k-means clustering until it terminates, and then consider

splitting each cluster.

4.1.4. Hierarchical-based clustering

As one of the common clustering methods, hierarchical clustering groups data objects into a tree

of clusters, without a predefined number of clusters. Their basic operation is to merge the similar

objects or object groups or to split dissimilar objects or object groups to different groups.

Anyway, the most serious drawback of a pure hierarchical clustering method is its inability to

reassign once a merge or split decision has been executed. If a particular merge or split decision

is later known to be a wrong decision, the method cannot make it correct later. To solve this, it is

possible to incorporate some iterative relocation mechanism into the original version. In general,

there are two types of hierarchical clustering methods are agglomerative and divisive, depending

on whether the hierarchical structure (tree) is formed in either bottom-up (merging) or top-

down (splitting) style. As for the bottom-up fashion, the agglomerative hierarchical clustering

approach starts with having each object in its own cluster and then merges these atomic clusters

into larger and larger clusters, until all of the objects are included in a single cluster or until some

certain termination conditions are satisfied. most hierarchical clustering methods belong to this

type. However, there may be various definitions of intercluster similarity. On the other hand, as

for the top-down fashion, the divisive hierarchical clustering approach performs the reverse of

agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the

cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it

satisfies certain termination conditions, such as a desired number of clusters is obtained or the

diameter of each cluster is within a certain threshold.


184

Agglomerative versus divisive hierarchical clustering

While two directions of hierarchical clusterings are top-down and bottom-up, it is possible to

represent both of them in the form of a tree structure. In general, such tree structure is called a

dendrogram. It is commonly used to represent the process of hierarchical clustering. It shows

how objects are grouped together step by step. Figure 4-6 shows a dendrogram for seven objects

in (a) agglomerative clustering and (b) divisive clustering. Here, Step 0 is the initial stage while

Step 6 is the final stage when a single cluster is constructed. The agglomerative hierarchical

clustering places each object into a cluster with its own. Then it tries to merge step-by-step

according to some criterion. For example, in Figure 4-6 (a) at the step 4, the agglomerative

hierarchical clustering method attempts to merge {a,b,c} with {d} and form {a,b,c,d} in the

bottom-up manner. It is also known as AGNES (AGglomerative NESting). In Figure 4-6 (b) at the

step 3, the divisive hierarchical clustering method tries to divide {e,f,g} into {e,f} and {g}. This

approach is also called DIANA (DIvisive ANAlysis). However, in either agglomerative or divisive

hierarchical clustering, the user can specify the desired number of clusters as a termination

condition. That is, it is possible to terminate at any step to obtain clustering results. If the user

requests three clusters, the agglomerative method will terminate at Step 4 while the divisive

method will terminate at Step 2.

Bottom-up (Agglomerative)

Top-down (divisive)

Figure 4-6: Dendrogram: Agglomerative vs. Divisive Clustering

Distance measurement among clusters

As stated in Section 4.1.1, there are several possible distance definition to express distance

between two single objects. However, in hieratchical clustering, an additional requirement is to

define the distance between two clusters which may include more than one object. Four widely-

used measures are single linakage, complete linkage, centroid comparision and element


185

comparion. Figure 4-7 shows the graphical representation of these four methods. The

formulation of each measure can be defined as follows.

No. Cluster Distance Distance definition

1. Single linkage

(minimum distance)

2. Complete linkage

(maximum distance)

3. Centroid comparison

(mean distance)

where, is the centroid of and is the centroid of .

4. Element comparison

(average distance)

Figure 4-7: Graphical representation of four definitions of cluster distances

Firstly, for the single linkage, if we use the minimum distance, , to measure the

distance between clusters, it is sometimes called a nearest-neighbor clustering algorithm. In this

method, the clustering process is terminated when the distance between nearest clusters

exceeds an arbitrary threshold. It is possible to view the data points as nodes of a graph where

edges form a path between the nodes in a cluster. When two clusters, and , are merged, an

edge is added between the nearest pair of nodes in and . This merging process will result in a

tree-like graph. An agglomerative hierarchical clustering algorithm that uses the minimum

distance measure is also known as a minimal spanning tree algorithm.


186

Secondly, for the complete linkage, an algorithm uses the maximum distance, , to

measure the distance between clusters. Also called a farthest-neighbor clustering algorithm, the

clustering process is terminated when the maximum distance between the nearest clusters

exceeds a predefined threshold. In this method, each cluster can be viewed as a complete

subgraph where there exist edges connecting all of the nodes in the clusters. The distance

between two clusters is determined by the most distant nodes in the two clusters. These farthest-

neighbor algorithms aim to minimize the increase in diameter of the clusters at each iteration as

little as possible. It performs well when the true clusters are compact and approximately equal in

size. Otherwise, the clusters produced can be meaningless. The nearest-neighbor clustering and

the farthest-neighbor clustering express two extreme cases for defining the distance between

clusters. They are quite sensitive to outliers or noisy data.

Rather than these two methods, sometimes it is better to use mean or average distance

instead, in order to compromise between the minimum and maximum distances and to overcome

the outlier and noisy problem. The mean distance comes for the calculation of the centroid of

each cluster and then the measurement of the distance between a pair of centroids. Also known

as centroid comparison, this method is computationally simple and cheap. Every time two

clusters are merged (for agglomerative) or a cluster is split (for divisive), a new centroid will be

calculated for the newly merged cluster or two centroids will be calculated for the newly split

two cluters. However, agglomerative clustering is slightly simpler than divisive clustering since it

is possible to use a so-called weighted combination technique to calculate the centroid of the

newly merged cluster but it cannot be applied for the divisive approach. For both methods, it is

necessary to calculate the distance between the new centroid(s) with the other centroids.

Compared to the centroid comparison, the element comparison is done to calculate the

distance between two clusters by finding the average distance among all elements in those two

clusters. This step is much more computational expensive than the centroid comparison.

Moreover, the average distance is advantageous in that it can handle categoric as well as numeric

data. The computation of the mean vector for categoric data can be difficult or impossible to

define but it is possible for average distance.

Problems in the hierarchical approach

While hierarchical clustering is simple, it has a drawback in how to select points to merge or split.

Each merging or splitting step is important since the next step will process based on the newly

generated clusters and it will never reconsider the result of the previous steps or swap objects

between clusters. If the previous merge or split decisions are not well determined, low-quality

clusters may be generated. To solve this problem, it is possible to perform multiple-phase

clustering by incorporating other clustering techniques into hierarchical clustering. Three

common methods are BIRCH, ROCK and Chameleon. The BIRCH partitions objects hierarchically

using tree structures whose leaf (or low-level nonleaf nodes) are treated as microclusters,

depending on the scale of resolution. After that, it applies other clustering algorithms to perform

macroclustering on the microclusters. The ROCK merges clusters based on their

interconnectivity after hierarchical clustering. The Chameleon explores dynamic modeling in

hierarchical clustering.

4.1.5. Density-based clustering

Unlike partition-based or hierarchical-based clustering which tends to discover clusters with a

spherical shape, density-based clustering methods have been designed to discover clusters with

arbitrary shape. In this approach, dense regions of objects in the data space will be separated by

regions of low density. As density-based methods, DBSCAN grows clusters according to a density-


187

based connectivity analysis. OPTICS extends DBSCAN to produce a cluster ordering obtained

from a wide range of parameter settings. DENCLUE clusters objects based on a set of density

distribution functions. In this book, we describe DBSCAN.

DBSCAN: Density-Based Spatial Clustering of Applications with Noise

DBSCAN is a density-based clustering algorithm, finding regions which includes objects with

sufficiently high density into clusters. By this nature, It discovers clusters of arbitrary shape in

spatial databases with noise. In this method, a cluster is defined as a maximal set of density-

connected points. The following are definitions of the concepts used in the methods.

-neighborhood: The neighborhood objects within a radius of a given object is called the

-neighborhood of the object.

Core object: When an object has an -neighborhood containing objects more than a

minimum number, MinObjs, it will be called a core object.

Directly density-reachable: Given a set of objects D, an object p is directly density-

reachable from the object q if p is within the -neighborhood of q, and q is a core object.

Density-reachable: An object p is density-reachable from object q with respect to and

MinObjs in a set of objects, D, if there is a chain of objects , where and

such that is directly density-reachable from with respect to and MinObjs,

for .

Density-Connected: An object p is density-connected to object q with respect to and

MinObjs in a set of objects, D, if there is an object such that both p and q are density-

reachable from o with respect to and MinObjs.

A density-based cluster: A density-based cluster is a set of density-connected objects that

is maximal with respect to density-reachability. Every object not contained in any cluster is

considered to be noise.

Note that density reachability is the transitive closure of direct density reachability, and this

relationship is asymmetric. Only core objects are mutually density reachable. Density

connectivity, however, is a symmetric relation.

Figure 4-8: An example of density-based clustering


188

For example, given twenty objects (a - t) as shown in Figure 4-8, the neighbor elements for

each object are shown in the left list in the figure. They are derived, basing on .

Moreover, the core objects (marked with ‘*’) are a, b, d, f, g, i, j, k, m, o, p, q and s since they include

more than three neighbors (MinObjs = 3).

The directly density-reachable objects of each core object (q) can be listed as follows.

Object (q) Directly density reachable objects Object (q) Directly density reachable

objects

a* b, c, d k* i, j, m, o, p

b* a, d, f m* k, o, p

d* a, b, c, f, h o* k, m, p, q, s

f* b, d, h p* k, m, o, q

g* e, i, j, k q* o, p, s, t

i* g, j, k s* o, q, t

j* e, g, i, k, m

The density reachable objects of each core object (q) can be listed as follows.

Object (q) Density reachable objects Object (q) Density reachable objects

a* b, c, d, f, h k* e, g, i, j, m, o, p , q, s, t

b* a, c, d, f, h m* e, g, i, j, k, o, p, q, s, t

d* a, b, c, f, h o* e, g, i, j, k, m, p, q, s, t

f* a, b, c, d, h p* e, g, i, j, k, m, o, q, s, t

g* e, i, j, k, m, o, p, q, s, t q* e, g, i, j, k, m, o, p, s, t

i* e, g, j, k, m, o, p, q, s, t s* e, g, i, j, k, m, o, p, q, t

j* e, g, i, k, m, o, p, q, s, t

The clustering result can be defined by the density-connected property. In the example, the

result are two clusters as follows. Moreover, the objects that become noises are l, n and r.

Cluster Cluster members 1 a, b, c, d, f, h 2 g, e, i, j, k, m, o, p, q, s, t

When we apply a sort of indexing, the computational complexity of DBSCAN will be ,

where n is the number of database objects. Without any index, it is . With appropriate

settings of the user-defined parameters and MinObjs, the algorithm is effective at finding

arbitrary-shaped clusters.

4.1.6. Grid-based clustering

In constrast with partition-based, hierarchical-based and density-based methods, the grid-based

clustering approach uses a multiresolution grid data structure to quantize the object space into a

finite number of cells that form a grid structure on which all of the operations for clustering are

performed. This approach aims to improve the processing time. Its time complexity is typically

independent of the number of data objects, instead it depends onthe number of cells in each

dimension in the quantized space. Some typical grid-based methods are STING, WaveCluster and

CLIQUE. Here, STING explores statistical information stored in the grid cells, WaveCluster

clusters objects using a wavelet transformation method, and CLIQUE represents a grid-and

density-based approach for clustering in high-dimensional data space.

STING: STatistical INformation Grid

STING is a grid-based multiresolution clustering technique in which the spatial area is divided

into rectangular cells. There are usually several levels of such rectangular cells corresponding to


189

different levels of resolution, and these cells form a hierarchical structure: each cell at a high level

is partitioned to form a number of cells at the next lower level. Statistical information regarding

the attributes in each grid cell (such as the mean, maximum, and minimum values) is

precomputed and stored. These statistical parameters are useful for query processing, as

described below.

Figure 4-9: Grid structure in grid-based clustering

Figure 4-9 shows a hierarchical structure for STING clustering. Statistical parameters and

characteristics of higher-level cells can easily be computed from those of the lower-level cells.

These parameters can be the attribute-independent parameters such as count; the attribute-

dependent parameters such as mean, stdev (standard deviation), min (minimum), max

(maximum); and the distribution type of the attribute value in the cell such as normal, uniform,

exponential, or none (if the distribution is unknown). When the data are loaded into the database,

the parameters (e.g., count, mean, stdev, min, and max) of the bottom-level cells can be calculated

directly from the data.

Since STING uses a multiresolution approach to cluster analysis, the quality of STING

clustering depends on the granularity of the lowest level of the grid structure. When the

granularity is too fine, the cost of processing will increase substantially. On the other hand, when

the bottom level of the grid structure is too coarse, it may reduce the quality of cluster analysis.

While STING does not take the spatial relationship between the children and their neighboring

cells for construction of a parent cell into consideration, the shapes of the resultant clusters are

isothetic; that is, all of the cluster boundaries are either horizontal or vertical, and no diagonal

boundary is detected. By this characteristics, the quality and accuracy of the clusters may be

lower, tradeoff with the fast processing time.

4.1.7. Model-based clustering

Instead of using a simple similarity definition, a model-based method predefines a suitable model

for each of the clusters and then find the best fit of the data to the given model. Some well-known

model-based methods are EM, COBWEB and SOM.


190

EM Algorithm: Expectation-Maximization method

The EM (Expectation-Maximization) algorithm (Algorithm 4.2) is a popular iterative refinement

algorithm used in several applications, such as speech recognition, image processing. It was

developed to estimate the suitable values of unknown parameters by maximizing expectation.

Algorithm 4.2. The EM Algorithm

1. Intitialization step: To obtain the seed for probability calculation, we start with

making an initial guess of the parameter vector. Even there are several possible

choices on this. As one simple approach, the method randomly partition the objects

into k groups and then for each group (cluster) calculate its means (its center) (similar

to k-means partitioning).

2. Repetetion step: To improve the initial cluster parameter, we iteratively refine the

parameters (or clusters) based on the two steps of expectation step and maximization

step.

(a) Expectation Step

The probability that each object is in a cluster is defined as follows.

Here, is the probability that the object occurs in the cluster . We can

define it as which follows the normal distribution (i.e., Gaussian

distribution) under the mean ( ) and the standard deviation of the cluster ( ).

The probability that the object belongs to the cluster is defined as follows.

is the priori probability that the cluster will occur. Without bias, we can set all clusters to have the same probability. For the , it does not depend on any cluster. Therefore, it can be ignored for consideration. Finally, it is possible to use only as the probability that the object belongs to the cluster .

(b) Maximization Step

By the above probability estimates, we can re-estimate or refine the model

parameters as follows.

As its name, this step is the maximization of the likelihood of the distributions given

the data.


191

Although there have been several variants of EM methods. most of them can be viewed as an

extension of the k-means algorithm, which assigns an object to the cluster that is the closet to it,

based on the cluster mean or the cluster representative. Instead of assigning each object to a

single dedicated cluster, the EM method assigns each object to a cluster according to a weight

representing the probability of membership. That is, no regid boundaries between clusters are

defined. Afterwards, new means are computed based on weighted measures.

Since we do not know which objects should be grouped into the same group beforehand, the

EM begins with an initial estimate (or guess) for each parameter in the mixture model

(collectively referred to as the parameter vector). The parameter can be set by randomly

grouping objects into k clusters, or just selecting k objects in the set of objects to be the means of

the clusters. After this intial setting, the EM algorithm will iteratively rescore the objects against

the mixture density, produced by the parameter vector. The rescored objects are then used to

update the parameter estimates. During calculation, each object is assigned a probability of how

likely it belonged to a given cluster. The detail of the algorithm is described as follows.

While the EM algorithm is simple and easy to implement and converges fast, sometimes it

falls into a local optima. The convergence is guaranteed for certain forms of optimization

functions with the computational complexity of O( where is the number of input

features, is the number of objects, and is the number of iterations.

Sometimes known as Bayesian clustering, the method focuses on the computation of class-

conditional probability density. They are commonly used in the statistics community. In industry,

AutoClass is a popular Bayesian clustering method that uses a variant of the EM algorithm. The

best clustering maximizes the ability to predict the attributes of an object given the correct

clusterof the object. AutoClass can also estimate the number of clusters. It has been applied to

several domains and was able to discover a new class of stars based on infrared astronomy data.

Conceptual Clustering

Unlike conventional clustering which does not focus on detailed description of a cluster,

conceptual clustering forms a conceptual tree by also considering characteristic descriptions for

each group, where each group corresponds to a node (concept or class) in the tree. In other

words, conceptual clustering has two steps; clustering and characterization. Clustering quality is

not solely a function of the individual objects but also the generality and simplicity of the derived

concept descriptions. Most conceptual clustering methods uses probability measurements to

determine the concepts or clusters. As an example of this type, COBWEB is a popular and simple

method of incremental conceptual clustering. Its input objects are expressed by categorical

attribute-value pairs. COBWEB creates a hierarchical clustering in the form of a classification tree.

Figure 4-10 shows an example of a classification tree for a set of animal data. A classification

tree differs from a decision tree in the sense that intermediate nodes in a classification tree

specifies a concept but those in a decision indicates an attribute test. In a classification, each node

refers to a concept and contains a probabilistic description of that concept, which summarizes

the objects classified under the node. To be summarized, the COBWEB works as follows. Given a

set of objects, , each object is represented by an n-dimensional attribute

vector, depicting the measure values of n attributes

of the object. The Bayesian (statistical) classifier assigns (or

predicts) a class to the object when the class has the highest posterior probability over the

others, conditioned on the object’s attribute values . That is, the Bayesian

classifier predicts that the object belongs to the class .


192

Figure 4-10: A classification tree. This figure is based on (Fisher, 1987)

The probabilistic description includes the probability of the concept and conditional

probabilities of the form , where is an attribute-value pair

(that is, the i-th attribute takes its j-th possible value) and is the concept class. Normally, the

counts are accumulated and stored at each node for probability calculation. The sibling nodes at

a given level of a classification tree form a number of partitions. To classify an object using a

classification tree, a partial matching function is employed to descend the tree along a path of

best matching nodes. COBWEB uses a heuristic evaluation measure called category utility to

guide construction of the tree. Category utility (CU) is defined as follows.

where n is the number of nodes (also called concepts, or categories) forming a partition,

, at the given level of the tree. In other words, category utility can be used to

measure the increase in the expected number of attribute values that can be correctly guessed

given a partition

over the expected number of correct guesses

with no such knowledge

.

As an incremental approach, when a new object comes to COBWEB, it descends the tree

along an appropriate path, updates counts along the way and tries to search for the best node to

place the object in. This decision is done by selecting the situation that has the highest category

utility of the resulting partition. Indeed, COBWEB also computes the category utility of the

partition that would result if a new node were to be created for the object. Therefore, the object

is placed in an existing class, or a new class is created for it, based on the partition with the

highest category utility value. COBWEB has the ability to automatically adjust the number of

classes in a partition. That is there is no need to provide the number of clusters like k-means or

hierarchical clustering.

However, the COBWEB operators are highly sensitive to the input order of the object. To

solve this problem, COBWEB has two additional operators, called merging and splitting. When an

object is incorporated, the two best hosts are considered for merging into a single class.

Moreover, COBWEB considers splitting the children of the best host among the existing

categories, based on category utility. The merging and splitting operators implement a

bidirectional search. That is, a merge can undo a previous split as well as a split can be undone by

a later merging process.


193

Howover, COBWEB still has a number of limitations. Firstly, it assumes that probability

distributions of an attributes are statistically independent of one another, which is not always

true. Secondly, it is expensive to store the probability distribution representation of clusters,

especially when the attributes have a large number of values. The time and space complexities

depend not only on the number of attributes, but also on the number of values for each attribute.

Moreover, the classification tree is not height-balanced for skewed input data, which may cause

the time and space complexity to degrade dramatically.

As an extension to COBWEB, CLASSIT deals with continuous (or real-valued) data. It stores a

continuous normal distribution (i.e., mean and standard deviation) for each individual attribute

in each node and applies a generalized category utility measure by an integral over continuous

attributes instead of a sum over discrete attributes as in COBWEB. While conceptual clustering is

popular in the machine learning community, both COBWEB and CLASSIT suffers in cases of

clustering large database data.

4.2. Association Analysis and Frequent Pattern Mining

Another form of knowledge that we can mine from data is frequent patterns or associations

which include frequent itemsets, subsequences, substructures and association rules. For example,

in a supermart database, a set of items such as bread and better are likely to appear frequently

together. This set is called a frequent itemset. Moreover, in an electronics shop database, there

may be a frequent subsequence that a PC, then digital camera, and a memory card are bought in

sequence frequently. This is called a frequent subsequence. A substructure is a more complex

pattern. It can refer to different structural forms, such as subgraphs, subtrees, or sublattices,

which may be combined with itemsets or subsequences. If a substructure is found frequently,

that substructure is called a frequent structured pattern. Such frequent patterns are important in

mining associations, correlations, and many other interesting relationships among data.

Moreover, again in a supermarket database, if one buys ice, he (or she) is likely to buy water at

the supermarket. This is called an association rule.

After performing frequent itemset mining, we can use the result to discover associations and

correlations among items in large transactional or relational data sets. The discovery of

interesting correlation relationships among huge amounts of business transaction records can

help in many business decision-making processes, such as catalog design, cross-marketing, and

customer shopping behavior analysis. A typical example of frequent itemset mining and

association rule mining is market basket analysis. Its formulation can be summarized as follows.

Let be a set of possible items, be a set of database

transactions where each transaction includes a number of items, and an itemset is a

set of items. A transaction is said to contain the itemset A if and only if . Let be a

set of the transactions that contain . The support of an itemset A can be defined as the number

of transactions that include all items in , devided by the number of all possible items. It

correspond to the probability that will occur in a transaction (i.e., ) as follows.

An itemset A is called a frequent itemset if and only if the support of A is greater than or equal to a

threshold called minimum support (minsup), i.e. . Note that an itemset

represents a set of items. When the itemset contains k items, it is called k-itemset. For example,

in supermarket database, the set is a 2-itemset. The occurrence frequency of an

itemset is the number of transactions that contain the itemset and the support is its ratio,

compared to the total number of transactions.


194

An association rule is an implication of the form , where , and

(i.e., and are two non-overlap itemsets. The support of the rule is denoted by

. It corresponds to , (i.e., the union of sets A and B). This is taken

to be the probability, . Moreover, as another measure, the confidence of the rule

is denoted by , specifying the percentage of transactions in T containing A

that also contain B. It corresponds to and then imply . Its

formal description is as follows.

An association rule is called a frequent rule if and only if is a frequent itemset and its

confidence greater than or equal to a threshold called minimum confidence (minconf), i.e.

.

Besides support and confidence, another important measure is lift. Theoretically. if the value

of lift is lower than one, i.e., , the probability that the conclusion (B) will occur

under the condition (A), i.e., , is lower than the probability of conclusion without the

precondition, i.e., . This makes a meaningless rule. Therefore, we always expect an

association rule with a lift larger than or equal to a threshold called minimum lift,

. The fomal description of life is shown below.

In general, as a process to find frequent association rules, association rule mining can be

viewed as a two-step process.

1. Find all frequent itemsets.

2. Generate strong association rules from the frequent itemsets.

The first process, from the transactional database, find a set of frequent itemsets which occur

at least as frequently as a predetermined minimum support count (i.e., minsup) while the second

process generates strong association rules from the frequent itemsets obtained from the first

process. The strong association rules are supposed to have support greater than minimum

support , confidence greater than minimum confidence and lift greater than

minimum lift .

In principle, the second step is much less costly than the first, the overall performance of

mining association rules is normally dominated by the second step. One research topic in mining

frequent itemsets from a large data set is to efficiently generate a huge number of itemsets

satisfying a low minimum support. Furthermore, given a frequent itemset, each of its subsets is

frequent as well. A long itemset will contain a combinatorial number of shorter, frequent sub-

itemsets. For example, a frequent itemset with a length of 30 includes up to

subsets as follows.

This first term comes from the frequent 1-itemsets, , the second term is

, and so on. To solve the problem of such extreme large

number of itemsets, the concepts of closed frequent itemset and maximal frequent itemset are

introduced. Here, we describe frequent itemsets, association rules as well as closed frequent

itemsets and maximal frequent itemsets, using the example in Figure 4-11,


195

Transaction ID Items 1 coke, ice, paper, shoes, water 2 ice, orange, shirt, water 3 paper, shirt, water 4 coke, orange, paper, shirt, water 5 ice, orange, shirt, shoes, water 6 paper, shirt, water

(a) An toy example of a transactional database for retailing ( )

Itemset Trans Freq.

Itemset Trans Freq.

Itemset Trans Freq. coke 14 2

ice, orange 25 2

orange, shirt, water 245 3

ice 125 3

ice, paper 1 1

paper, shirt, water 346 3 orange 245 3

ice, shirt 25 2

paper 1346 4

ce, water 125 3

shirt 23456 5

orange, paper 4 1

shoes 15 2

orange, shirt 245 3

water 123456 6

orange, water 245 3

paper, shirt 346 3

paper, water 1346 4

shirt, water 23456 5

(b) Frequent itemsets (minimum support = 3 (i.e., 50%))

Here,

No. Left Items Right Items support confidence Lift 1 ice water 3/6=0.50 3/3=1.00 (3/3)/(6/6)=1.0 2 water ice 3/6=0.50 3/6=0.50 (3/6)/(3/6)=1.0 3 orange shirt 3/6=0.50 3/3=1.00 3/3)/(5/6)=1.2 4 shirt orange 3/6=0.50 3/5=0.60 (3/5)/(3/6)=1.2 5 orange water 3/6=0.50 3/3=1.00 (3/3)/(6/6)=1.0 6 water orange 3/6=0.50 3/6=0.50 (3/6)/(3/6)=1.0 7 paper shirt 3/6=0.50 3/4=0.75 (3/4)/(5/6)=0.9 8 shirt paper 3/6=0.50 3/5=0.60 (3/5)/(4/6)=0.9 9 paper water 4/6=0.67 4/4=1.00 (4/4)/(6/6)=1.0

10 water paper 4/6=0.67 4/6=0.67 (4/6)/(4/6)=1.0 11 shirt water 5/6=0.83 5/5=1.00 (5/5)/(6/6)=1.0 12 water shirt 5/6=0.83 5/6=0.83 (5/6)/(5/6)=1.0 13 orange, shirt water 3/6=0.50 3/3=1.00 (3/3)/(6/6)=1.0 14 orange, water shirt 3/6=0.50 3/4=0.75 (3/4)/(5/6)=0.9 15 shirt, water orange 3/6=0.50 3/5=0.60 (3/5)/(3/6)=1.2 16 water orange, shirt 3/6=0.50 3/6=0.50 (3/6)/(3/6)=1.0 17 shirt orange, water 3/6=0.50 3/5=0.60 (3/5)/(3/6)=1.2 18 orange shirt, water 3/6=0.50 3/3=1.00 (3/3)/(5/6)=1.2 19 paper, shirt water 3/6=0.50 3/3=1.00 (3/3)/(6/6)=1.0 20 paper, water shirt 3/6=0.50 3/4=0.75 (3/4)/(5/6)=0.9 21 shirt, water paper 3/6=0.50 3/5=0.60 (3/5)/(4/6)=0.9 22 water paper, shirt 3/6=0.50 3/6=0.50 (3/6)/(3/6)=1.0 23 shirt paper, water 3/6=0.50 3/5=0.60 (3/5)/(4/6)=0.9 24 paper shirt, water 3/6=0.50 3/4=0.75 (3/4)/(5/6)=0.9

(c) Association rules (minimum confidence = 66.67%) ( )

=

Figure 4-11: Frequent itemset and frequent rules (association rules)


196

Given the transaction database in Figure 4-11 (a), we can find a set of frequent itemsets shown in

Figure 4-11 (b) when the minimum support is set to 0.5. The set of frequent rules (association

rules) is found as displayed in Figure 4-11 (c) when the minimum confidence is set to 0.66.

Moreover, a valid rule need to have a lift of at least 1.0. In this example, the thirteen frequent

itemsets can be summarized in Figure 4-12.

No. Frequent Itemset Transaction Set Support

1. ice 125 3/6

2. orange 245 3/6

3. paper 1346 4/6

4. shirt 23456 5/6

5. water 123456 6/6

6. ice, water 125 3/6

7. orange, shirt 245 3/6

8. orange, water 245 3/6

9. paper, shirt 346 3/6

10. paper, water 1346 4/6

11. shirt, water 23546 5/6

12. orange, shirt, water 245 3/6

13. paper, shirt, water 346 3/6

Figure 4-12: Summary of frequent itemsets with their transaction sets and supports.

The concepts of closed frequent itemsets and maximal frequent itemsets are defined as follows.

[Closed Frequent Itemset]

An itemset X is closed in a data set S if there exists no proper super-itemset such that Y has

the same support count as X in S. An itemset X is a closed frequent itemset in set S if X is both

closed and frequent in S.

[Maximal Frequent Itemset]

An itemset X is a maximal frequent itemset (or max-itemset) in set S if X is frequent, and

there exists no super-itemset Y such that and Y is frequent in S.

Normally it is possible to find the whole set of frequent itemsets from the set of closed frequent

itemsets but it is not possible to find all frequent itemsets from the set of maximal itemsets. The

set of closed frequent itemsets contains complete information regarding its corresponding

frequent itemsets but the set of maximal frequent itemsets registers only the support of the

maximal frequent itemsets.

Using the above example, the frequent closed itemsets are shown in Figure 4-13. Here, the

frequent itemsets which are not closed was indicated by strikethrough. There are six frequent

closed itemsets. Here, the column ‘Transaction Set’ indicates the set of transactions that include

that frequent itemset. For example, the frequent itemset {orange, water} is in the 2nd, 4th and 5th

transactions in Figure 4-11 (a). Moreover, the frequent itemset {orange} is not closed since it

have the same transaction set with its superset, i.e., the frequent itemset {orange, shirt, water}.

Note that the shortest superset that includes the frequent itemset {orange} is the closed frequent

itemset {orange, shirt, water}. Therefore, the closed frequent itemset of {orange} is {orange, shirt,

water}.


197

No. No. (closed) Frequent Itemset Transaction Set Support

1. ice 125 3/6

2. orange 245 3/6

3. paper 1346 4/6

4. shirt 23456 5/6

5. 1 water 123456 6/6

6. 2 (ice, water), (ice) 125 3/6

7. orange, shirt 245 3/6

8. orange, water 245 3/6

9. paper, shirt 346 3/6

10. 3 (paper, water), (paper) 1346 4/6

11. 4 (shirt, water), (shirt) 23546 5/6

12. 5 (orange, shirt, water),

(orange, water), (orange,

shirt), (orange)

245 3/6

13. 6 (paper, shirt, water),

(paper, shirt)

346 3/6

Figure 4-13: Six closed itemsets with their transaction sets and supports.

In general, the set of closed frequent itemsets contains complete information regarding the

frequent itemsets. For example, the transaction set and support of the frequent itemset {orange},

{orange, shirt} and {orange, water} can be found to be equivalent to those of the closed frequent

itemset {orange, shirt, water} since there is no shorter frequent closed itemset that include them,

except {orange, shirt, water}. That is, TransactionSet({orange})= TransactionSet({orange,shirt})=

TransactionSet({orange,water})=TransactionSet({orange,shirt,water}).

Moreover, in this example, the maximal frequent itemsets are {orange, shirt, water} and

{paper, shirt, water} since they are the longest frequent itemsets and no superset of these

itemsets are frequent.

For the association rules (Figure 4-11 (c)), we can disgard the 2nd, 4th, 6th, 8th, 15th, 16th, 17th,

21st, 22nd, 23rd rules since their confidences are lower than the minimum confidences, i.e., 0.67.

Moreover, the 7th , 8th , 14th , 20th , 21st , 23rd and 24th rules are disgarded since their lifts are

lower than 1.0. Finally we have ten frequent rules left.

As stated above, to find frequent association rules, association rule mining can be viewed as a

two-step process; (1) Find all frequent itemsets, and (2) Generate strong association rules from

the frequent itemsets. The second process is trivial while the first process consumes more time

and space. The following subsections describes a set of existing algorithms, namely Apriori,

CHARM and FP-Tree, to efficiently discover the frequent itemsets.

4.2.1. Apriori algorithm

The Apriori algoroithm was proposed in 1994 by R. Agrawal and R. Srikant in 1994 for mining

frequent itemsets for Boolean association rules. As its name, the algorithm uses prior knowledge

of frequent itemset properties to prune out some dominant infrequent itemsets, and then to

ignore the process of counting their supports. It explores the space of itemsets iteratively in

level-wise style, where k-itemsets are used to explore (k+1)-itemsets. At the first step, the set of

frequent 1-itemsets is found by scanning the database to accumulate the count for each item.

Here, the items satisfy minimum support are collected as a set denoted by . Next, is used to

find , the set of frequent 2-itemsets, which is used to find ,and so on, until no more frequent

k-itemsets can be found. For each step to find , one full database scan is required.


198

To improve the efficiency of the level-wise generation of frequent itemsets, an important

property called the Apriori property is used to reduce the search space. The Apriori property

states “All non-empty subsets of a frequent itemset must also be frequent.” Semantically, if an

itemset does not satisfy the minimum support threshold, minsup ( is not frequent, i.e.,

), any of its superset will never be frequent. In other words, even we add an item

to the itemset , the resulting itemset will never have larger frequent than , i.e.,

. In conclusion, is not frequent if X is not frequent. That is,

.

This Apriori property has partial order property. It is called anti-monotone in the sense that

if a set cannot pass a test, all of its supersets will fail the same test as well. The property is

monotonic in the context of failing a test. Exploiting this property, the two-step process can be

defined as follows.

[Join Step]

Generate , a set of candidate k-itemsets by joining with itself.

Then calculate which members in are frequent and find all frequent k-itemset .

Here, if the items within a transaction or itemset are sorted in lexicographic order,

then we can easily generate from by joining two itemsets in which have

common prefix. For example, suppose that and are in . Since these

two itemsets share the same prefix , we can generate as a candidate in .

Note that this join step implicitly uses the Apriori property.

[Prune Step]

After the first step (join step), a set of candidate k-itemsets will be generated.

Among them, we have to identify which ones are frequent (count no less than the

minimum support) and which are not. In other words, is a superset of

(frequent itemsets). While members in may or may not be frequent, all of the

frequent k-itemsets are included in . That is,

A scan of the database to determine the count of each candidate in would result in

the determination of . In many cases, the set of candidates is huge, resulting in heavy

computation. To reduce the size of , the Apriori property is also used in addition to the

join step. That is, “All non-empty subsets of a frequent itemset must also be frequent.” This

subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.

Algorithm 4.3 shows the pseudocode of the Apriori algorithm for discovering frequent itemsets

for mining Boolean association rules. In the pseudocode of the Apriori algorithm, the function

GenerateCandidate corresponds to the join step while the function

CheckNoInfrequentSubset equals to the pruning step. The following shows an example of

the Apriori algorithm with six retailing transactions where minsup = 0.5 and minconf = 0.67.


199

Algorithm 4.3: The Apriori Algorithm

OBJECTIVE: Find frequent itemsets using an iterative level-wise

approach based on candidate generation.

Input: D # a database of transactions;

minsup # minimum support count

Output: L # frequent itemsets in D.

Procedure Main():

1: L1 = Find_frequent_1_Itemset(D) # Find frequent 1-itemsets from

D

2: for (k = 2 ; Lk-1 ; k++) {

3: Lk = ;

4: Ck = GenerateCandidate(Lk-1); # Generate Candidates

5: foreach transaction t D { # Scan D for counts

6: Ct = Subset(Ck, t); # Find which candidates are in t

7: foreach candidate c Ct

8: count[c] = count[c]+1; # Add one to the count of c

9: }

10: foreach c in Ck { # check all c in Ck

11: if (count[c] minsup) # If count[c] > minsup

12: Lk = Lk {c}; # Add c to set of frequent itemset

13: }

14: }

15: return L = k Lk # return all sets of large k-itemsets

Procedure GenerateCandidate (Lk-1) # Lk-1 is frequent (k-1)-itemsets

1: Ck = ;

2: foreach itemset x1 Lk-1 { # itemset in large (k-1)-itemsets

3: foreach itemset x2 Lk-1 { # itemset in large (k-1)-itemsets

4: if( (x1[1]=x2[1]) & (x1[2]=x2[2]) &...& (x1[k-2]=x2[k-2]) &

5: (x1[k-1]<x2[k-1]) )

6: for (i=1 ; i<=k-2; i++) {

7: c[i] = x1[i]; # set the same prefix

8: }

9: c[k-1] = x1[k-1];

10: c[k] = x2[k-1];

11: }

12: if( CheckNoInfrequentSubset(c,Lk-1) ) {

13: Ck = Ck {c};

14: }

15: }

16: }

17: return Ck;

Procedure CheckNoInfrequentSubset(c,Lk-1) # c: a candidate k-itemset,

# Lk-1 is frequent (k-1)-itemsets

1: foreach (k-1)-itemsets s of c {

2: if s Lk-1

3: return FALSE

4: return TRUE;


200

Transaction ID Items 1 coke, ice, paper, shoes, water 2 ice, orange, shirt, water 3 paper, shirt, water 4 coke, orange, paper, shirt, water 5 ice, orange, shirt, shoes, water 6 paper, shirt, water

Step 1: Scan the database to count the frequency of the 1-itemsets and generate the set of the candidate 1-itemset, C1.

Output: C1

Itemset Support {coke} 2

{ice} 3

{orange} 3

{paper} 4

{shirt} 5

{shoes} 2

{water} 6

Step 2: From the set of the candidate 1-itemset C1, identify the frequent ones and generate the frequent 1-itemset, L1. Here the infrequent ones are omitted.

Output: C1 L1 Itemset Support Itemset Support {coke} 2 {ice} 3 {ice} 3 {orange} 3 {orange} 3 {paper} 4 {paper} 4 {shirt} 5 {shirt} 5 {water} 6 {shoes} 2 {water} 6

Step 3: From the set of the frequent1-itemset L1, generate the set of the candidate 1-itemset, C2 and count their frequencies.

Output: L1 C2 Itemset Support Itemset Support {ice} 3 {ice,orange} 2 {orange} 3 {ice,paper} 1 {paper} 4 {ice,shirt} 2 {shirt} 5 {ice,water} 3 {water} 6 {orange,paper} 1 {orange,shirt} 3 {orange,water} 3 {paper,shirt} 3 {paper,water} 4 {shirt,water} 5


201

Step 4: From the set of the candidate 2-itemset C2, identify the frequent ones and generate the frequent 2-itemset, L2. Here the infrequent ones are omitted.

Output: C2 L2 Itemset Support Itemset Support {ice,orange} 2 {ice,water} 3 {ice,paper} 1 {orange,shirt} 3 {ice,shirt} 2 {orange,water} 3 {ice,water} 3 {paper,shirt} 3 {orange,paper} 1 {paper,water} 4 {orange,shirt} 3 {shirt,water} 5 {orange,water} 3 {paper,shirt} 3 {paper,water} 4 {shirt,water} 5

Step 5: From the set of the frequent 2-itemset L2, generate the set of the candidate 3-itemset, C3 and count their frequencies. Here, perform join and prune steps as shown above.

Output: L2 C3 Itemset Support Itemset Support {ice,water} 3 {orange,shirt,water} 3 {orange,shirt} 3 {paper,shirt,water} 3 {orange,water} 3 {paper,shirt} 3 {paper,water} 4 {shirt,water} 5

For the join step, {orange,shirt} and {orange,water} can be used to generate {orange,shirt,water}, and {paper,shirt} and {paper,water} can be used to generate {paper,shirt,water}. Moreover, for the prune step, since {shirt,water} is frequent (it exists in L2), both {orange,shirt,water} and {paper,shirt,water} are frequenct.

Step 6: From the set of the candidate 3-itemset C3, identify the frequent ones and generate the frequent 3-itemset, L3. Here, all 3-itemsets are frequent.

Output: C3 L3 Itemset Support Itemset Support {orange,shirt,water} 3 {orange,shirt,water} 3 {paper,shirt,water} 3 {paper,shirt,water} 3

Step 7: From the set of the frequent 3-itemset L3, try to generate the set of the

candidate 4-itemset, C3 but no candidate can be generated. Output:

L3 C4 Itemset Support Itemset Support {orange,shirt,water} 3 {paper,shirt,water} 3


202

Finally, the set of frequent itemsets are They can be listed as follows.

Itemset Support {ice} 3 {orange} 3 {paper} 4 {shirt} 5 {water} 6 {ice,water} 3 {orange,shirt} 3 {orange,water} 3 {paper,shirt} 3 {paper,water} 4 {shirt,water} 5 {orange,shirt,water} 3 {paper,shirt,water} 3

To improve the efficiency of Apriori-based mining, there have been many variations of the

Apriori algorithm proposed that focus on improving the efficiency of the original algorithm. Some

common methods are hash-based frequency counting, transaction reduction, partitioning,

samping, and dynamic itemset counting.

4.2.2. FP-Tree algorithm

Although Apriori algorithm can reduce the size of candidates generated, it still generates a large

number of candidates. Moreover, it may need to repeatedly scan the database and check a large

set of candidates by pattern matching. It is costly to go over each transaction in the database to

determine the support of the candidate itemsets. To solve these issues, there has been a unique

idea that propose a method, called frequent-pattern growth (simply FP-growth), to avoid

generating candidates. Applying a divide-and-conquer strategy, the FP-growth processes as

follows.

1. Scan the database once to find which 1-itemsets are frequent and which ones are not.

2. Eliminate the infrequent 1-itemsets from each transaction.

3. Compress the transactions in the database into a frequent-pattern tree (FP-tree).

4. Divide the compressed database into a set of conditional databases, a special kind of

projected database. Each conditional database is associated with one frequent item or

“pattern fragment.”

5. For each conditional database, iteratively mine a set of conditional databases.

The following displays an example of FP-growth, given eight transactions of office material sales.

Here, minsup is set to 50%.

Transaction ID Items

1 Binder, Clip, Memo, Paper, Scissors, Stapler

2 Card, Clip, Pad, Paper, Punch, Tape

3 Binder, Clip, Pad, Paper, Pin, Ruler, Tape

4 Clip, Memo, Paper, Pin, Stapler, Tape

5 Card, Memo, Pad, Ruler, Scissors, Tape

6 Card, Memo, Pad, Paper, Punch, Ruler, Stapler

7 Binder, Clip, Memo, Paper, Stapler, Tape

8 Clip, Pad, Paper, Ruler, Scissors, Stapler, Tape


203

The database can be transformed to vertical format as follows.

Item Transaction ID Set

Binder 137

Card 256

Clip 123478

Memo 14567

Pad 23568

Paper 1234678

Pin 34

Punch 26

Ruler 3568

Scissors 158

Stapler 14678

Tape 234578

The infrequent 1-itemsets are Binder, Card, Pin, Punch and Scissors. They are eliminated from

the database as follows.


1 Clip Memo, Paper, Stapler

2 Clip, Pad, Paper, Tape

3 Clip, Pad, Paper, Ruler, Tape

4 Clip, Memo, Paper, Stapler, Tape

5 Memo, Pad, Ruler, Scissors, Tape

6 Memo, Pad, Paper, Ruler, Stapler

7 Clip, Memo, Paper, Stapler, Tape

8 Clip, Pad, Paper, Ruler, Stapler, Tape

Then for each transaction, the remaining items are ordered by their frequencies, i.e.,,

.


1 Paper, Clip, Memo, Stapler

2 Paper, Clip, Tape, Pad

3 Paper, Clip, Tape, Pad, Ruler

4 Paper, Clip, Tape, Memo, Stapler

5 Tape, Memo, Pad, Ruler

6 Paper, Memo, Pad, Ruler, Stapler

7 Paper, Clip, Tape, Memo, Stapler

8 Paper, Clip, Tape, Pad, Ruler, Stapler

Each reduced transaction is used to construct a frequent-pattern (FP) tree as follows.

1. Create the root of the tree with the label of (null).

2. Scan database D a second time.

3. The items in each transaction are processed in the transaction order and a branch is

created for each transaction. Finally the tree are constructed.

4. Later, the links are created among nodes with the same labels.

The following snapshots (a)-(h) illustrates the process to create the FP-tree from the first

transaction to the eigth transaction in order.


204

(a)

(b)

(c)

(d)

(e)

(f)


205

(g)

(h)

Then the links are created from the header table to nodes and among nodes with the same label.

From the FP-tree, we can mine frequent itemsets as follows. First of all, we start from each

frequent 1-itemset, construct its conditional pattern base, then construct its (conditional) FP-

tree, and perform mining recursively on such a tree. An example of conditional pattern bases

and conditional FP-trees are shown in the following tables.


206

1-itemset Conditional Pattern Base Conditional FP-

tree

Frequent Patterns

Generated

Ruler (R) {PP,C,T,P:2}, {PP,M,P:1}, {T,M,P:1} {P:4} {P, R:4}

Stapler (S) {PP,C,M:1}, {PP,C,T,P,R:1},

{PP,C,T,M:2}, {PP,M,P,R:1}

{PP:4, C:4} {PP,S:4},{C,S:4},{PP,C,S:4}

Pad (P) {PP,C,T:3}, {PP,M:1}, {T,M:1} {PP:4, T:3}, {T:1} {PP,P:4},{T,P:4}

Memo (M) {PP,C:1}, {PP,C,T:2}, {PP:1}, {T:1} {PP:4} {PP,M:4}

Tape (T) {PP,C:5} {PP:5, C:5} {PP,T:5},{C,T:5},{PP,C,T:5}

Clip (C) {PP:6} {PP:6} {PP,C:6}

Paper (PP) - - -

For each frequent 1-itemset, the conditional pattern bases can be selected as the second column.

From the conditional pattern bases, we can construct conditional FP-trees as shown in the third

column. From the conditional FP-tree, we can iteratively mine the frequent patterns as shown in

the last column. If the conditional FP-tree includes a single path, all combinations of the items

will be generated. For example, the Stapler (S) has a single conditional FP-tree {PP:4, C:4}, three

possible combinations are {PP,S}, {C,S} and {PP,C,S}.

In general, it was known that the FP-growth method could transform the problem of finding

long frequent patterns to searching for shorter ones recursively and then concatenating the suffix.

With frequency order in the tree construction, it offers good selectivity and then substantially

reduces the search costs. However, when the database is large, it is sometimes unrealistic to

construct a main FP-tree on memory. One common solution is to partition the database into a set

of projected databases, and then construct an FP-tree and mine it in each projected database.

Such a process can be recursively applied to any projected database if its FP-tree still cannot fit in

main memory. The performance of the FP-growth is known to be an order of magnitude faster

than the Apriori algorithm.

4.2.3. CHARM algorithm

As stated previously, there have usually been a huge number of frequent itemsets generated,

especially when the minsup threshold is set low or when there exist long patterns in the data set.

As our previous example, Figure 4-13 showed that even closed frequent itemsets are the reduced

set of patterns generated in frequent itemset mining, it preserves the complete information

regarding the set of frequent itemsets. That is, from the set of closed frequent itemsets, we can

easily derive the set of frequent itemsets and their support. Therefore, it is more practical to

mine the set of closed frequent itemsets rather than the set of all frequent itemsets in most cases.

A well-known method is called the CHARM algorithm. The CHARM algorithm utilizes vertical DB

format, rather than the horizontal one. Figure 4-14 shows the horizontal, vertical formats of the

database and the reduced set of items when the minsup is set to 50%, i.e., three transactions.

The pseudo-code of the CHARM algorithm is summarized in Algorithm 4.4. In the algorithem,

the main function (CHARM function) calls the CHARM_EXTEND function to extend the lattice by

adding new nodes. The CHARM_PROPERTY function checks whether the newly created node have

the same tidset (transaction ID set) with those of the parents or not. If it is the same, the parent

needs to be replaced with the new node or the parent will be removed from the lattice.


207


1 coke, ice, paper, shoes, water 2 ice, orange, shirt, water 3 paper, shirt, water 4 coke, orange, paper, shirt, water 5 ice, orange, shirt, shoes, water 6 paper, shirt, water

(a) Horizontal format

Itemset Trans Freq. Itemset Trans Freq. coke 14 2 ice 125 3 ice 125 3 orange 245 3 orange 245 3 paper 1346 4 paper 1346 4 shirt 23456 5 shirt 23456 5 water 123456 6 shoes 15 2 water 123456 6

(b) Vertical format

(c) Reduced set of items

Figure 4-14: A database with horizontal format and/or vertical format with its reduced set.

Algorithm 4.4. The CHARM algorithm

1.

2. }

1. for each in Nodes 2. NewN =

3. for each in Nodes, with

4. and and

5. if then 6.

7. if N then 8.

}

1. ;

2. if then

3. Remove from Nodes

4. Replace all with 5. else if then

6. Replace all with 7. else if then

8. Remove from Nodes

9. Add to NewN 10. else 11. Add to NewN


208

The following is an example of the complete lattice.

The CHARM algorithm will start from the root node and then try to process the leftmost branches

in order. During going down the branch, the CHARM_PROPERTY function is used for checking

whether the lower nodes should be created or the parent nodes should be replaced or eliminated.

The following illustrates the CHARM algorithm to mine closed frequent itemsets from the

supermartket database. First, five nodes are created under the root node as follows.

Next, the left node in the first level is expanded to generate nodes in the second level. Some nodes

have a support lower than minsup (=3). The node {I,W} has the same tidset with its father {I}.

Therefore, its father {I} will be replaced by the child {I,W}.


209

Next, the second node in the first level is expanded to generate nodes in the second level. The last

node {S,W} has the same tidset with its father {S}. Therefore, its father {S} will be replaced by the

child {S,W}. Moreover, the first node {S,O} has the same tidset with its mother {O}. Therefore, its

mother {O} was eliminated.

The first node in the second level is expanded to generate a node {S,PO,P,W} in the third level.

However, its support is lower than the minsup.

Next, the fourth node in the first level is expanded to generate a node in the second level. The

node {P,W} has the same tidset with its father {P}. Therefore, its father {P} will be replaced by the

child {P,W}.


210

Finally, the final tree includes six nodes (closed itemsets). Their closed frequent itemsets, tidsets,

frequent itemsets and their supports are shown as a table.

The CHARM algorithm will generate a reduced set of itemsets in the form of closed itemsets.

Intuitively, when the database is densed, more compacted itemsets as closed itemsets will be

derived.

4.2.4. Association Rules with Hierarchical Structure

While mining frequent itemsets and association rules is a common task in association analysis,

another interesting task is to mine multilevel association where hierarchy is assumed. Tasks

related to multilevel association rules involve concepts at different levels of abstraction. In many

cases, it is hard to find strong associations among data items at low or primitive levels of

abstraction with enough support, due to the sparsity of data at those levels. As alternative, it is


211

possible to discover strong associations at high levels of abstraction. This high-level association

may represent common-sense knowledge which may be novel. At this point, we may need to

handle multiple levels of abstraction with sufficient flexibility for easy traversal among different

abstraction spaces. Figure 4-15 shows an example of concept hierarchy of products, sales

transactions and their modified transactions. Figure 4-16 shows the process of mining

association rules with the concept hierarchy. A concept hierarchy defines a sequence of

mappings from a set of low-level concepts to higher level, more general concepts. Items (data)

can be generalized by replacing low-level concepts by their higher-level concepts, or ancestors,

from a concept hierarchy.

(a) Concept hierarchy of products

TID Items 1 apple, orange, paper, shirt, water 2 coke, orange, paper, ruler, water 3 apple, coke, paper, shoes, water 4 coke, orange, paper, shirt 5 apple, orange, water 6 apple, ruler, shirt, shoes, water 7 coke, orange, paper, shirt, shoes 8 coke, orange, paper, shirt 9 ruler, shirt, shoes, water A coke, orange, paper, ruler

(b) An example of sales transactions

TID Items 1 fruit, stationary, clothing, drink 2 drink, fruit, stationary 3 fruit, drink, stationary, clothing 4 drink, fruit, stationary, shirt 5 fruit, drink 6 fruit, stationary, clothing drink 7 drink, orange, stationary, clothing 8 drink, orange, stationary, clothing 9 stationary, clothing, drink A drink, fruit, stationary

(c) The modified sales transactions by replacing each item with its higher-level concept

Figure 4-15: An example of concept hierarchy of products, sales transactions and their

modified transactions

In this example, assume that the minimum support and the minimum confidence to 0.5 and

0.8, respectively. In Figure 4-15, (a) illustrates the concept hierarchy of products, (b) shows the


212

transactions, and (c) is the modified transactions where each item is replaced by its higher

concept. From the modified transactions, it is possible to mine frequent itemsets as listed in

Figure 4-16 (a) and then the frequent rules can be found as shown in Figure 4-16 (b). Since all

single higher-level items (e.g., clothing, drink, fruit, and stationary) are frequent, we will mine

frequent itemsets and association rules in its lower level. From the original transactions in Figure

4-15 (a), the frequent itemsets and rules can be found as listed in Figure 4-16 (c) and (d),

respectively.

Items Count Items Count clothing 136789 (1-1) clothing, drink 136789 (2-1) drink 123456789A (1-2) clothing, fruit 136 (2-2) fruit 123456A (1-3) clothing, stationary 136789 (2-3) stationary 12346789A (1-4) drink, fruit 123456A (2-4)

drink, stationary 12346789A (2-5) fruit, stationary 12346A (2-6) clothing, drink, stationary 136789 (2-7) drink, fruit, stationary 12346A (2-8) clothing, drink, fruit, stationary 136 (2-9)

(a) Frequent itemsets when the minimum support is set to 50% (one-level higher)

Rule Confidence

Rule Confidence clothing drink 6/6=1.00 fruit stationary 6/7=0.86 drink clothing 6/10=0.60 stationary fruit 6/9=0.67 clothing stationary 6/6=1.00 clothing, drink stationary 6/6=1.00 stationary clothing 6/9=0.67 clothing, stationary drink 6/6=1.00 drink fruit 7/10=0.70 drink, stationary clothing 6/9=0.67 fruit drink 7/7=1.00 drink, fruit stationary 6/7=0.86 drink stationary 9/10=0.90 drink, stationary fruit 6/9=0.67 stationary drink 9/9=1.00 fruit, stationary drink 6/6=1.00

(b) Possible association rules when the minimum confidence is set to 80% (one-level higher)

Items Count Items Count apple 1356 (1-1) coke, orange 2478A (2-1) coke 23478A (1-2) coke, paper 23478A (2-2) orange 124578A (1-3) coke, shirt 478 (2-3) paper 123478A (1-4) coke, water 23 (2-4)

ruler 269A (1-5) orange, paper 12478A (2-5) shirt 146789 (1-6) orange, shirt 1478 (2-6) shoes 3679 (1-7) orange, water 125 (2-7) water 123569 (1-8) paper, shirt 1478 (2-8) paper, water 123 (2-9) shirt, shoes 679 (2-10) shirt, water 169 (2-11) coke, orange, paper 2478A (3-1)

(c) Frequent itemset when the minimum support is set to 50% (infrequent items are strikethrough)

Rule Confidence

Rule Confidence

Rule Confidence coke orange 5/6=0.83 orange coke 5/7=0.71 coke, orange paper 5/5=1.00 coke paper 6/6=1.00 paper coke 6/7=0.86 coke, paper orange 5/6=0.83 orange paper 6/7=0.86 paper orange 6/7=0.86 orange, paper coke 5/6=0.83

(d) Possible association rules when the minimum confidence is set to 80%

Figure 4-16: An example of mining association rules with the hierarchical structure


213

This multiple-level or multilevel association rules can be generated from mining data at

multiple levels of abstraction efficiently using concept hierarchies under a support-confidence

framework. As shown above, in general, a top-down strategy can be used together with existing

association rule mining algorithms, such as Apriori, FP-tree, CHARM and their variations. That is,

during mining, the counts are accumulated for the calculation of frequent itemsets at each

concept level, by starting at the top concept level, and working downward in the hierarchy

toward the more specific concept levels, until no more frequent itemsets can be found. That is,

for an example in Figure 4-16, the frequent itemsets and rules in (a)-(b) will be mined first before

mining the frequent itemsets and rules in (c)-(d). A number of variations to this approach can

applied. However, three major ones are enumerated as follows.

1. Uniform Minimum Support

By this uniform minimum support for all levels, the same minimum support threshold is

used when mining at each level of abstraction. For the above example, a minimum support

threshold is set to 0.5 for all levels. The mining results for the level 1 and the level 2 are

shown in Figure 4-16.

In this example, when a minimum support threshold of 0.5 is used for ‘apple,’ ‘orange,’

‘coke,’ ‘water,’ ‘ruler,’ ‘paper,’ ‘shoes,’ and ‘shirt,’ the infrequent items, i.e., ‘apple,

(sup=4/10)’ ‘ruler (sup=4/10),’ and ‘shoes (sup=4/10)’ are eliminated. Moreover, when

the same threshold is applied for ‘fruit’, ‘drink’, ‘stationary’ and ‘clothing’, all general

concepts are found to be frequent, even these three subitems are not.

Search in the condition of a uniform minimum support threshold is simple. When we apply

an Apriori-like optimization technique, we can control the search to avoid examining

itemsets containing any item whose ancestors do not have minimum support. It is possible

since we will find the frequent itemsets in a higher level before those in a lower level.

However, the uniform support approach has a number of drawbacks. For example, if the

minimum support threshold is set too high, it could miss some meaningful associations

occurring at low abstraction levels. If the threshold is set too low, it may generate many

uninteresting associations occurring at high abstraction levels.

2. Reduced Minimum Support

Using reduced minimum support at lower levels (referred to as reduced support), each

level of abstraction has its own minimum support threshold. The deeper the level of

abstraction, the smaller the corresponding threshold is. For example, given the following


214

hierarchy with reduced minimum support, we can mine items in different levels with

different thresholds.

In this hierarchy, the minimum support thresholds for levels 1, 2 and 3, are 0.5, 0.8 and 0.9,

respectively. By this threshold, as shown in Figure 3.52, two higher concepts, ‘drink’ and

‘stationary,’ are considered. In Figure 3.52 (d)-(e), only frequent itemsets and frequent

rules with ‘drink’ and ‘stationary’ are consider. Moreover, the level-1 items can be mined in

the same way but with a lower support (i.e., 0.5), as shown in Figure 3.52 (f)-(g).

3. Level-cross filtering by single items

Similar to the reduced minimum support at lower levels, each level of abstraction has its

own minimum support threshold where the deeper the level of abstraction, the smaller the

corresponding threshold is. But we use the higher-level mining as filtering. For example,

given the following hierarchy with reduced minimum support, we can eliminate some

trivial itemsets and rules as follows.

In this hierarchy, the minimum support thresholds for levels 1, 2 and 3, are 0.5, 0.8 and 0.9,

respectively. By this threshold, only two higher concepts, ‘drink’ and ‘stationary,’ are

considered and ‘fruit’ and ‘clothing’ are eliminated. Therefore, ‘apple,’ ‘orange,’ ‘shoes,’ and

‘shirt’ are not considered for mining. By this, a pruning mechanism can be provided.


215

4. Level-cross filtering by k-itemset

This case is similar to level-cross filtering by single item but it filters k-itemsets instead of

single items. If the k-itemset at the higher level does not pass minimum support, the k-

itemsets at the lower level under that concepts will not be examined. For example, given

the following hierarchy with reduced minimum support, we examine all 2-itemsets at the

lower level in the case (a). Since ‘drink, stationary’ passes the minimum support, ‘coke,

ruler,’ ‘coke, paper,’ ‘water, ruler’ and ‘water, paper’ will be examined. On the other hand,

we eliminate all 2-itemsets at the lower level in the case (b). Since ‘drink, fruit’ does not

passes the minimum support, ‘coke, apple,’ ‘coke, orange,’ ‘water, apple’ and ‘water, orange’

are pruned out without examination. The level-cross filtering by k-itemset is similar to

level-cross filtering by single items but it filters k-itemsets instead of single items.

(a) The higher concept passes minimum support, all lower items are examined.

(b) The higher concept does not pass minimum support, all lower items are

pruned out.

5. Controlled level-cross filtering by 1-itemset (single item) or k-itemset

This case is similar to level-cross filtering by single item or k-itemset but there is a separate

threshold for setting the condition to examine the lower level. This threshold is set

separately from the minimum support. The following shows the case of 1-itemset (single

item) and the case of 2-itemset.


216

(a) Controlled level-cross filtering by 1-itemset (single item)

(b) Controlled level-cross filtering by k-itemset

4.2.5. Efficient Association Rule Mining with Hierarchical Structure

While association rule mining (ARM) is a process to find the set of all subsets of items (called

itemsets) that frequently occur in the database records or transactions, and then to extract the

rules telling us how a subset of items influences the presence of another subset. However,

association rules may not provide desired knowledge in the database. It may be limited with the

granularity over the items. For example, a rule “5% of customers who buy wheat breads, also buy

chocolate milk” is less expressive and less useful than a more general rule “30% of customers

who buy bread, also buy milk”. For this purpose, generalized association rule mining (GARM) was

developed using the information of a pre-defined taxonomy over the items. The taxonomy is a

piece of knowledge, e.g. the classification of the products (or items) into brands, categories,

product groups, and so forth. Given a taxonomy where only leaf nodes (leaf items) present in the

transactional database, more informative, initiative and flexible rules (called generalized

association rules) can be mined from the database


217

Generalized Association Rules and Generalized Frequent Itemsets

With the presence of a concept hierarchy or taxonomy, the formal problem description of

generalized association rule is different from that of association rule mining. For clarity, all

explanations in the section are illustrated using an example shown in Figure 4-17. Let T be a

concept hierarchy or taxonomy, a directed acyclic graph on items, which represents is-a

relationship by edges, e.g. Figure 4-17 (a). The items in T are composed of a set of leaf items ( )

and a set of non-leaf items ( ).

(a) concept hierarchy or taxonomy

TID Items TID Items TID It ms 1 ACDE A 1245 A 1245 2 ABC B 2356 B 2356 3 BCDE C 123456 C 123456 4 ACD D 13456 D 13456 5 ABCDE E 1356 E 1356 6 BCDE U 123456 V 123456 W 13456

(b) horizontal database (left) vs. vertical database (middle) vs. extended vertical database (right)

Figure 4-17: An example of mining generalized association rules

Let be a set of distinct items where , and let be a set

of transaction identifiers (tids). In this example, ,

and . A subset of I is called an itemset and a subset of T is

called a tidset. Normally, a transactional database is represented in the horizontal database

format, where each transaction corresponds to an itemset shown in the left table of Figure 4-17

(b). An alternative to the horizontal database format is the vertical database format, where each

item corresponds to a tidset which contains that item shown in the middle table of Figure 4-17

(b). Note that the original database contains only leaf items. It is possible to represent an original

vertical database by extending it to cover non-leaf items where a transaction of which item also

supports its related items from taxonomy shown in the right table of Figure 4-17 (b). Let the

binary relation be an extended database. For any and , can be denoted

when x is related to y in the database (called x is supported by y). Here, except the elements in I,

lower case letters are used to denote items and upper case letters for itemsets. For , is

an ancestor of (conversely is a descendant of ) when there is a path from to in . For any

, a set of its ancestors (descendants) is denoted by ( ). For


218

example, and . A generalized itemset is an itemset each

element of which is not an ancestor of the others, . For example,

( for short), ( for short) are generalized itemsets. Let be a

finite set of all generalized itemsets. Note that, for and (power of I).

The support of G, denoted by , is defined by a percentage of the number of transactions in

which occurs as a subset to the total number of transactions, thus . Any is

called generalized frequent itemset (GFI) when its support is at least a user-specified minimum

support (minsup) threshold.

In GARM, a meaningful rule is an implication of the form , where

and no item in is an ancestor of any items in . For example, and are

meaningful rules, while is a meaningless rule because its support is redundant with

. The support of a rule , defined as = ,

is the percentage of the number of transactions containing both and to the total number of

transactions. The confidence of a rule, defined as , is the conditional probability

that a transaction contains , given that it contains . For example, the support of is

and the confidence is =1 or 100%. The

meaningful rule is called a generalized association rule (GAR) when its confidence is at least a

user-specified minimum confidence (minconf) threshold. The task of GARM is to discover all

GARs the supports and confidences of which are at least minsup and minconf, respectively.

Here, it is possible to consider two relationships, namely subset-superset and ancestor-

descendant relationships, based on lattice theory. Similar to ARM, GARM occupies the subset-

superset relationship which represents a lattice of generalized itemsets. As the second

relationship, an ancestor-descendant relationship is originally introduced to represent a set of k-

generalized itemset taxonomies. By these relationships, it is possible to mine a smaller set of

generalized closed frequent itemsets instead of mining a large set of conventional generalized

frequent itemsets. Two algorithms called SET and cSET are introduced to mine generalized

frequent itemsets and generalized closed frequent itemsets, respectively. By a number of

experiments, SET and cSET outperform the previous well-known algorithms in both

computational time and memory utilization. The number of generalized closed frequent itemsets

is much more smaller than the number of generalized frequent itemsets.

4.3. Historical Bibliography

As unsupervised learning, clustering has been studied extensively in many disciplines due to its

broad applications. Several textbooks are dedicated to the methods of cluster analysis, including

Hartigan (1975), Jain and Dubes (1988), Kaufman and Rousseeuw (1990), and Arabie, Hubert,

and De Sorte (1996). Many survey articles on different aspects of clustering methods include

those done by Jain, Murty and Flynn (1999) and Parsons, Haque, and Liu (2004). As a

partitioning method, Lloyd (1957) and later MacQueen (1967) introduce the k-means algorithm.

Later, Bradley, Fayyad, and Reina (1998) proposed a k-means–based scalable clustering

algorithm. Instead of using means, Kaufman and Rousseeuw (1990) proposed to use the nearest

object to the means as the cluster center and call the method the k-medoids algorithm with two

versions of PAM and CLARA. To cluster categorical data (in contrast with numerical data),

Chaturvedi, Green, and Carroll (1994, 2001) proposed the k-modes clustering algorithm. Later

Haung (1998) proposed independently the k-modes (for clustering categorical data) and k-

prototypes (for clustering hybrid data) algorithms. As an extension to CLARA, the CLARANS


219

algorithm was later proposed by Ng and Han (1994). Ester, Kriegel, and Xu (1995) proposed

efficient spatial access techniques, such as R*-tree and focusing techniques, to improve the

performance of CLARANS. An early survey of agglomerative hierarchical clustering algorithms

was conducted by Day and Edelsbrunner (1984). Kaufman and Rousseeuw (1990) also

introduced agglomerative hierarchical clustering, such as AGNES, and divisive hierarchical

clustering, such as DIANA. Later, Zhang, Ramakrishnan, and Livny (1996) proposed the BIRCH

algorithm which integrates hierarchical clustering with distance-based iterative relocation or

other nonhierarchical clustering methods to improve the clustering quality of hierarchical

clustering methods. The BIRCH algorithm partitions objects hierarchically using tree structures

whose leaf (or low-level nonleaf nodes) are treated as microclusters, depending on the scale of

resolution. After that, it applies other clustering algorithms to perform macroclustering on the

microclusters. Proposed by Guha, Rastogi, and Shim (1998, 1999), CURE and ROCK utilized

linkage or nearest-neighbor analysis and its transformation to improve the conventional

hierarchical clustering. Exploring dynamic modeling in hierarchical clustering, the Chameleon

was proposed by Karypis, Han, and Kumar (1999). As an early density-based clustering method,

Ester, Kriegel, Sander, and Xu (1996) proposed DBSCAN, which is the first algorithm to utilize the

density in clustering with some parameters needed to be specified. After that, Ankerst, Breunig,

Kriegel, and Sander [ABKS99] have proposed a cluster-ordering method, namely OPTICS, which

facilitates density-based clustering without consideration of parameter setting. Almost at the

same time, Hinneburg and Keim (1998) proposed the DENCLUE algorithm, which use a set of

density distribution functions to glue similar objects together. As a grid-based multi-resolution

approach, STING was proposed by Wang, Yang, and Muntz [WYM97] to cluster objects using

statistical information collected in grid cells. Instead of the original feature space, Sheikholeslami,

Chatterjee, and Zhang (1998) applied wavelet transform to implement a multi-resolution

clustering method, namely WaveCluster, which is a combination of grid- and density-based

approach. As another hybrid of a grid- and density-based approach, CLIQUE was designed based

on Apriori by Agrawal, Gehrke, Gunopulos, and Raghavan (1998) to cope with high-dimensional

clustering using dimension-growth subspace clustering. As model-based clustering, Dempster,

Laird, and Rubin (1977) proposed a well-known statistics-based method, namely the EM

(Expectation-Maximization) algorithm. Handling missing data in EM methods was presented by

Lauritzen (1995). As a variant of the EM algorithm, AutoClass was proposed by Cheeseman and

Stutz (1996) with incorporation of Bayesian Theory. While conceptual clustering was first

introduced by Michalski and Stepp (1983), the first example is COBWEB invented by Fisher

(1987). A succeeding version is CLASSIT by Gennari, Langley, and Fisher (1989). The task of

association rule mining was first introduced by Agrawal, Imielinski, and Swami (1993). The

Apriori algorithm for frequent itemset mining and a method to generate association rules from

frequent itemsets was presented by Agrawal and Srikant (1994a, 1994b). Agrawal and Srikant

(1994b), Han and Fu (1995), and Park, Chen, and Yu (1995) described transaction reduction

techniques in their papers. Later, Pasquier, Bastide, Taouil, and Lakhal (1999) proposed a

method to mine frequent closed itemsets, namely A-Close, based on Apriori algorithm. Later, Pei,

Han, and Mao (2000) proposed CLOSET, an efficient closed itemset mining algorithm based on

the frequent pattern growth method. As a further refined algorithm, CLOSET+ was invented by

Wang, Han, and Pei (2003). Savasere, Omiecinski, and Navathe (1995) introduced the

partitioning technique. Toivonen (1996) explored the sampling techniques while Brin, Motwani,

Ullman, and Tsur (1997) provided a dynamic itemset counting approach. Han, Pei, and Yin

(2000) proposed the FP-growth algorithm, a pattern-growth approach for mining frequent

itemsets without candidate generation. Grahne and Zhu (2003) introduced FPClose, a prefix-tree-

based algorithm for mining closed itemsets using the pattern-growth approach. Zaki (2000)


220

proposed an approach for mining frequent itemsets by exploring the vertical data format, called

ECLAT. Zaki and Hsiao (2002) presented an extension for mining closed frequent itemsets with

the vertical data format, called CHARM. Bayardo (1998) gave the first study on mining max-

patterns. Multilevel association mining was studied in Han and Fu (1995) and Srikant and

Agrawal (1995, 1997). In Srikant and Agrawal (1995, 1997), five algorithms named Basic,

Cumulate, Stratify, Estimate and EstMerge were proposed. These algorithms apply the horizontal

database and breath-first search strategy like Apriori-based algorithm. Later, Hipp, A. Myka, R.

Wirth, and U. Guntzer (1998) proposed a method, namely Prutax, to use hash tree checking with

vertical database format to avoid generating meaningless itemsets and then to reduce the

computational time needed for multiple scanning the database. Lui and Chung (2000) proposed

an efficient method to discover generalized association rules with multiple minimum supports. A

parallel algorithm for generalized association rule mining (GARM) has also been proposed by

Shintani and Kitsuregawa (1998). Some recent applications that utilize a GARM are shown by

Michail (2000) and Hwang and Lim (2002). Later, Sriphaew and Theeramunkong (2002, 2003,

2004) introduced two types of constraints on two generalized itemset relationships, called

subset-superset and ancestor-descendant constraints, to mine only a small set of generalized

closed frequent itemsets instead of mining a large set of conventional generalized frequent

itemsets. Two algorithms, named SET and cSET, are proposed by Sriphaew and Theeramunkong

(2004) to efficiently find generalized frequent itemsets and generalized closed frequent itemsets,

respectively.


221

Exercise

1. Apply k-means algorithm to cluster the following data.

AreaType Humidity Temperature

Sea 75 40

Mountain 40 20

Mountain 45 25

Mountain 70 40

Sea 70 25

Here, use the following approaches as the method for clustering

Distance: Euclidean distance ))(...)()( 22

22

2

11 jpipjijiij xxxxxxd

For nominal attribute: use 0 and 1 For numeric attributes: decimal scaling normalization

2. Apply the hierarchical-based clustering for the previous problem.

3. Explain the merits and demerits of the grid-based clustering.

4. Apply the DBSCAN to cluster the following data points. Here, set the minimum number of

objects to three and the radius of neighborhood objects to .

5. Explain the concept of conceptual clustering and EM algorithms, including their merits and

demerits?

6. Assume that the database is as follows. Find frequent itemsets and frequent rules when

minimum support and minimum confidence are set to 60% and 80%, respectively. Show the

process of Apriori, FP-Tree and CHARM methods.

TID ITEMS

1 apple, bacon, bread, pizza, potato, tuna, water

2 bacon, bread, cookie, corn, nut, shrimp, water

3 apple, bread, cookie, nut, pizza, potato, shrimp, tuna, water

4 apple, bacon, bread, cookie, nut, potato, water

5 bread, cookie, corn, nut, pizza, potato, water

6 apple, cookie, corn, nut, potato, shrimp, water

7 bacon, bread, cookie, corn, nut, tuna, water

8 apple, bread, nut, potato, shrimp, water


222

7. From the table in the previous question, describe how to calculate supports, confidence,

interestingness (correlation) and conviction. Enumerate a number of frequent rules and a

number of interesting rules

8. Explain applications of generalized association rules in sales databases and mobile

environment.


Sponsored by AIAT.or.th and KINDML, SIIT · techniques for clustering, and association rule mining in order. 4.1. Cluster Analysis or Clustering Unlike classiﬁcation, cluster analysis

Documents