Copyright Jiawei Han, mod ified by Charles Ling for 1 Course Outline Introduction Data warehousing and OLAP Data preprocessing for mining and warehousing Concept description: characterization and discrimination Classification and prediction Association analysis Clustering analysis Mining complex data and advanced mining techniques Trends and research issues
44
Embed
1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Copyright Jiawei Han, modified by Charles Ling for CS411a
1
Course Outline
Introduction Data warehousing and OLAP Data preprocessing for mining and warehousing Concept description: characterization and
discrimination Classification and prediction Association analysis Clustering analysis Mining complex data and advanced mining
techniques Trends and research issues
Copyright Jiawei Han, modified by Charles Ling for CS411a
2
Data Mining and Warehousing: Session 7
Clustering Analysis
Copyright Jiawei Han, modified by Charles Ling for CS411a
3
Clustering analysis
What is Clustering Analysis?
Clustering in Data Mining Applications
Handling Different Types of Variables
Major Clustering Techniques
Outlier Discovery
Problems and Challenges
Copyright Jiawei Han, modified by Charles Ling for CS411a
5
What Is Clustering ?
Clustering is a process of partitioning a set of data (or objects)
into a set of meaningful sub-classes, called clusters.
May help users understand the natural grouping or
structure in a data set.
Cluster: a collection of data objects that are “similar” to one
another and thus can be treated collectively as one group.
Clustering: unsupervised classification: no predefined classes.
Used either as a stand-alone tool to get insight into data
distribution or as a preprocessing step for other algorithms.
Copyright Jiawei Han, modified by Charles Ling for CS411a
6
What Is Good Clustering?
A good clustering method will produce high quality
clusters in which:
the intra-class (that is, intraintra-cluster) similarity is high.
the inter-class similarity is low.
The quality of a clustering result also depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Copyright Jiawei Han, modified by Charles Ling for CS411a
7
Requirements of Clustering in Data Mining
Scalability
Dealing with different types of attributes
Discovery of clusters with arbitrary shape
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Interpretability and usability.
Copyright Jiawei Han, modified by Charles Ling for CS411a
8
Clustering analysis
What is Clustering Analysis?
Clustering in Data Mining Applications
Handling Different Types of Variables
Major Clustering Techniques
Outlier Discovery
Problems and Challenges
Copyright Jiawei Han, modified by Charles Ling for CS411a
9
Applications of Clustering
Clustering has wide applications in
Pattern Recognition
Spatial Data Analysis:
– create thematic maps in GIS by clustering feature spaces
– detect spatial clusters and explain them in spatial data mining.
Image Processing
Economic Science (especially market research)
WWW:
– Document classification
– Cluster Weblog data to discover groups of similar access patterns
Copyright Jiawei Han, modified by Charles Ling for CS411a
10
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.
Land use: Identification of areas of similar land use in an earth observation database.
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost.
City-planning: Identifying groups of houses according to their house type, value, and geographical location.
Copyright Jiawei Han, modified by Charles Ling for CS411a
11
Clustering analysis
What is Clustering Analysis?
Clustering in Data Mining Applications
Handling Different Types of Variables
Major Clustering Techniques
Outlier Discovery
Problems and Challenges
Copyright Jiawei Han, modified by Charles Ling for CS411a
12
Similarity and Dissimilarity Between Objects
Distances are normally used to measure the similarity or dissimilarity between two data objects.
Some popular ones include: Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional
data objects, and q is a positive integer.
If q = 1, d is Manhattan distance.
If q = 2, d is Euclidean distance:)||...|||(|),( 22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
qq
pp
qq
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211
||...||||),(2211 pp jxixjxixjxixjid
Copyright Jiawei Han, modified by Charles Ling for CS411a
13
Measure Similarity
The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.
Values should be scaled (normalized to 0-1) Weights should be associated with different variables based
on applications and data semantics. It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
Copyright Jiawei Han, modified by Charles Ling for CS411a
14
Binary, Nominal, Continuous variables
Binary variable: d = 0 of x=y; d=0 otherwise
Nominal variables: > 2 states, e.g., red, yellow, blue, green. Simple matching: u: # of matches, p: total # of variables.
Also, one can use a large number of binary variables.
Continuos variables: d = |x-y| Scaling and normalization
pupjid ),(
Copyright Jiawei Han, modified by Charles Ling for CS411a
15
Clustering analysis
What is Clustering Analysis?
Clustering in Data Mining Applications
Handling Different Types of Variables
Major Clustering Techniques
Outlier Discovery
Problems and Challenges
Copyright Jiawei Han, modified by Charles Ling for CS411a
16
Five Categories of Clustering Methods
Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion.
Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion.
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other.
Copyright Jiawei Han, modified by Charles Ling for CS411a
17
Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion. Global optimal: exhaustively enumerate all partitions.
Heuristic methods: k-means and k-medoids algorithms.
k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster.
Copyright Jiawei Han, modified by Charles Ling for CS411a
18
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of
the current partition. The centroid is the center (mean point) of the cluster.
Assign each object to the cluster with the nearest seed point.
Go back to Step 2, stop when no more new assignment.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Copyright Jiawei Han, modified by Charles Ling for CS411a
19
Comments on the K-Means Method
Strength of the k-means: Relatively efficient: O(tkn), where n is # of objects, k is # of
clusters, and t is # of iterations. Normally, k, t << n. Often terminates at a local optimum.
Weakness of the k-means: Applicable only when mean is defined, then what about
categorical data? Need to specify k, the number of clusters, in advance. Unable to handle noisy data and outliers. Not suitable to discover clusters with non-convex shapes.
Copyright Jiawei Han, modified by Charles Ling for CS411a
20
The K-Medoids Clustering Method
Find representative objects, called medoids, in clusters To achieve this goal, only the definition of distance from
any two objects is needed. PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering.
PAM works effectively for small data sets, but does not scale well for large data sets.
Copyright Jiawei Han, modified by Charles Ling for CS411a
21
Two Types of Hierarchical Clustering Algorithms
Agglomerative (bottom-up): merge clusters iteratively. start by placing each object in its own cluster merge these atomic clusters into larger and larger clusters until all objects are in a single cluster. Most hierarchical methods belong to this category. They
differ only in their definition of between-cluster similarity. Divisive (top-down): split a cluster iteratively.
It does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces.
Divisive methods are not generally available, and rarely have been applied.
Copyright Jiawei Han, modified by Charles Ling for CS411a
22
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition.
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
Copyright Jiawei Han, modified by Charles Ling for CS411a
23
More on Hierarchical Clustering Methods
between-cluster similarity Minimal distance Maximal distance Center distance
Major weakness of agglomerative clustering methods: do not scale well: time complexity of at least O(n2), where
n is the number of total objects can never undo what was done previously.
Integration of hierarchical clustering with distance-based method:
Copyright Jiawei Han, modified by Charles Ling for CS411a
24
Clustering analysis
What is Clustering Analysis?
Clustering in Data Mining Applications
Handling Different Types of Variables
Major Clustering Techniques
Outlier Discovery
Problems and Challenges
Copyright Jiawei Han, modified by Charles Ling for CS411a
25
What Is Outlier Discovery?
What are outliers? The set of objects are considerably dissimilar from the
remainder of the data Example: Sports: Michael Jordon, Wayne Gretzky, ...
Problem Given: Data points Find top n outlier points
Copyright Jiawei Han, modified by Charles Ling for CS411a
33
Session 6: Association Analysis
What is association analysis?
Mining single-dimensional Boolean association
rules in transactional databases
Mining multi-level association rules
Copyright Jiawei Han, modified by Charles Ling for CS411a
34
What Is an Association Rule?
Given A database of customer transactions Each transaction is a list of items (purchased by a
customer in a visit) Find all rules that correlate the presence of one set of items
with that of another set of items Example: 98% of people who purchase tires and auto
accessories also get automotive services done Any number of items in the consequent/antecedent of rule Possible to specify constraints on rules (e.g., find only rules
involving Home Laundry Appliances).
Copyright Jiawei Han, modified by Charles Ling for CS411a
35
Application Examples
Market Basket Analysis * Maintenance Agreement
What the store should do to boost Maintenance Agreement sales
Home Electronics *
What other products should the store stocks up on if the store has a sale on Home Electronics
Attached mailing in direct marketing Detecting “ping-pong”ing of patients
transaction: patientitem: doctor/clinic visited by a patientsupport of a rule: number of common patients
Copyright Jiawei Han, modified by Charles Ling for CS411a
36
Rule Measures: Support and Confidence
Find all the rules X & Y Z with minimum confidence and support support, s, probability that a
transaction contains {X, Y, Z} confidence, c, conditional
probability that a transaction having {X, Y} also contains Z.
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
Copyright Jiawei Han, modified by Charles Ling for CS411a
37
Mining Association Rules -- Example
For rule A C:support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
The Apriori principle:Any subset of a frequent itemset must be frequent.
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F