HINDUSTHAN COLLEGE OF ARTS AND SCIENCE AUTONOMOUS (Affiliated to Bharathiar University) Behind Nava India, Coimbatore - 641028. DEPARTMENT OF COMPUTER APPLICATIONS (BCA) III BCA DATAMINING NOTES – UNIT III Prepared By Dr.Sasikala,Asst Prof Mr.Aravind,Asst Prof Mrs.Saranya,Asst Prof
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
HINDUSTHAN COLLEGE OF ARTS AND SCIENCE AUTONOMOUS
(Affiliated to Bharathiar University)
Behind Nava India, Coimbatore - 641028.
DEPARTMENT OF COMPUTER APPLICATIONS (BCA)
III BCA
DATAMINING NOTES – UNIT III
Prepared By
Dr.Sasikala,Asst Prof
Mr.Aravind,Asst Prof
Mrs.Saranya,Asst Prof
UNIT III
Clustering: Introduction – Similarity and Distance Measures – Outliers – Hierarchical Algorithms - Partitional Algorithms. Association
advanced association rules techniques – measuring the quality of rules.
CLUSTERING
INTRODUCTION
Clustering Example
Clustering Houses
Clustering vs. Classification
No prior knowledge
o Number of clusters
o Meaning of clusters
Unsupervised learning
Clustering Issues
Outlier handling
Dynamic data
Interpreting results
Evaluating results
Number of clusters
Data to be used
Scalability
Impact of Outliers on Clustering
Clustering Problem: Definition
Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:Dg{1,..,k} where
each ti is assigned to one cluster Kj, 1<=j<=k. A Cluster, Kj, contains precisely those tuples mapped to it.Unlike classification problem,
clusters are not known a priori.
Types of Clustering
Hierarchical – Nested set of clusters created.
Partitional – One set of clusters created.
Incremental – Each element handled one at a time.
Simultaneous – All elements handled together.
Overlapping/Non-overlapping
Clustering Approaches
Cluster Parameters
SIMILARITY AND DISTANCE MEASURES Since clustering is the grouping of similar instances/objects, some sort of measure that can determine whether two objects are
similar or dissimilar is required. There are two main type of measures used to estimate this relation: distance measures and similarity
measures.
Many clustering methods use distance measures to determine the similarity or dissimilarity between any pair of objects. It is useful
to denote the distance between two instances xi and xj as: d(xi ,xj). A valid distance measure should be symmetric and obtains its minimum
value (usually zero) in case of identical vectors. The distance measure is called a metric distance measure if it also satisfies the following
properties:
1. Triangle inequality d(xi,xk) <= d(xi,xj) + d(xj,xk) Vxi,xj,xk E S.
2. d(xi,xj) => xi = xj Vxi,xj E S.
An alternative concept to that of the distance is the similarity functions(xi; xj) that compares the two vectors xi and xj (Duda et al.,
2001). This function should be symmetrical (namely s(xi; xj) = s(xj; xi)) and have a large value when xi and xj are somehow “similar” and
constitute the largest value for identical vectors.
A similarity function where the target range is [0,1] is called a dichotomous similarity function. In fact, the methods described in the
previous sections for calculating the “distances” in the case of binary and nominal attributes may be considered as similarity functions,
rather than distances.
OUTLIERS Very often, there exist data objects that do not comply with the general behavior or model of the data. Such data objects, which are
grossly different from or inconsistent with the remaining set of data are called Outliers.
Outlier detection and analysis is an interesting data mining task referred to as outlier mining or outlie analysis.
Distance Between Clusters
Single Link: smallest distance between points
Complete Link: largest distance between points
Average Link: average distance between points
Centroid: distance between centroids
HIERARCHICAL CLUSTERING
Clusters are created in levels actually creating sets of clusters at each level.
1. Agglomerative
Initially each item in its own cluster
Iteratively clusters are merged together
Bottom Up
2. Divisive
Initially all items in one cluster
Large clusters are successively divided
Top Down
Hierarchical Algorithms
Single Link
MST Single Link
Complete Link
Average Link
Dendrogram
Dendrogram: a tree data structure which illustrates hierarchical clustering techniques.
Each level shows clusters for that level.
Leaf – individual clusters
Root – one cluster
A cluster at level i is the union of its children clusters at level i+1.
Levels of Clustering
Agglomerative Example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
MST Example
Agglomerative Algorithm
Single Link
View all items with links (distances) between them. Finds maximal connected components in this graph. Two clusters are merged if
there is at least one edge which connects them. Uses threshold distances at each level. Could be agglomerative or divisive.
Single Link Clustering
PARTITIONING ALGORITHMS
Partitional Clustering Nonhierarchical
Creates clusters in one step as opposed to several steps.Since only one set of clusters is output, the user normally has to input the desired
number of clusters, k. Usually deals with static sets.
MST Algorithm
Squared Error
Minimized squared error
Squared Error Algorithm
K-Means
Initial set of clusters randomly chosen. Iteratively, items are moved among sets of clusters until the desired set is reached. High
degree of similarity among elements in a cluster is obtained. Given a cluster Ki={ti1,ti2,…,tim}, the cluster mean is mi = (1/m)(ti1 + … +