Top Banner
Chapter 8 The k-Means Algorithm and Genetic Algorithm
41
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 8

The k-Means Algorithm and Genetic Algorithm

Page 2: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 82Data Warehouse and Data Mining

Contents

• k-Means algorithm

• Genetic algorithm

• Rough set approach

• Fuzzy set approaches

Page 3: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 83Data Warehouse and Data Mining

The K-Means AlgorithmThe K-Means AlgorithmThe K-Means AlgorithmThe K-Means Algorithm

The K-Means algorithm is a simple yet effective statistical clustering technique.

Here is the algorithm:

1. Choose a value for K, the total number of clusters to be determined.

2. Choose K instances (data points) within the dataset at random. These are the initial cluster centers.

3. Use simple Euclidean distance to assign the remaining instances to their closest cluster center.

Page 4: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 84Data Warehouse and Data Mining

4. Use the instances in each cluster to calculate a new mean for each cluster.

5. If the new mean values are identical to the mean values of the previous iteration the process terminates. Otherwise, use the new means as cluster centers and repeat steps 3-5.

The K-Means AlgorithmThe K-Means AlgorithmThe K-Means AlgorithmThe K-Means Algorithm

Page 5: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 85Data Warehouse and Data Mining

The K-Means AlgorithmThe K-Means AlgorithmAn Example Using K-Means

The K-Means AlgorithmThe K-Means AlgorithmAn Example Using K-Means

Page 6: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 86Data Warehouse and Data Mining

The K-Means AlgorithmThe K-Means AlgorithmAn Example Using K-Means

The K-Means AlgorithmThe K-Means AlgorithmAn Example Using K-Means

Page 7: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 87Data Warehouse and Data Mining

The K-Means AlgorithmThe K-Means AlgorithmGeneral Considerations

The K-Means AlgorithmThe K-Means AlgorithmGeneral Considerations

Page 8: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 88Data Warehouse and Data Mining

The K-Means AlgorithmThe K-Means AlgorithmGeneral Considerations

The K-Means AlgorithmThe K-Means AlgorithmGeneral Considerations

Page 9: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 89Data Warehouse and Data Mining

The k-Nearest Neighbor Algorithm

• All instances correspond to points in the n-D space.• The nearest neighbor are defined in terms of

Euclidean distance.• The target function could be discrete- or real- valued.

.

_+

_ xq

+

_ _+

_

_

+

.

..

. .

Page 10: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 810Data Warehouse and Data Mining

The k-Nearest Neighbor Algorithm

• For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq.

• Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples.

.

_+

_ xq

+

_ _+

_

_

+

.

..

. .

Page 11: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 811Data Warehouse and Data Mining

Discussion on the k-NN Algorithm

• The k-NN algorithm for continuous-valued target functions– Calculate the mean values of the k nearest

neighbors

• Distance-weighted nearest neighbor algorithm– Weight the contribution of each of the k

neighbors according to their distance to the query point xq

• giving greater weight to closer neighbors– Similarly, for real-valued target functions

wd xq xi

12( , )

Page 12: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 812Data Warehouse and Data Mining

Genetic LearningGenetic LearningHere we present a basic genetic learning algorithm.

1. Initialize a population P of n elements, often referred to as chromosomes, as a potential solution.

2. Until a specified termination condition is satisfied:

a. Use a fitness function to evaluate each element of the current solution. If an element passes the fitness criteria, it remains in P.

b. The population now contains m elements (m<=n). Use genetic operators to create (n-m) new elements. Add the new elements to the population.

Page 13: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 813Data Warehouse and Data Mining

Genetic LearningGenetic LearningGenetic Algorithms and Supervised Learning

Genetic LearningGenetic LearningGenetic Algorithms and Supervised Learning

Page 14: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 814Data Warehouse and Data Mining

Genetic LearningGenetic LearningGenetic Algorithms and Supervised Learning

Genetic LearningGenetic LearningGenetic Algorithms and Supervised Learning

Page 15: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 815Data Warehouse and Data Mining

Genetic LearningGenetic LearningGenetic Algorithms and Supervised Learning

Genetic LearningGenetic LearningGenetic Algorithms and Supervised Learning

Page 16: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 816Data Warehouse and Data Mining

Genetic LearningGenetic LearningGenetic Algorithms and Supervised Learning

Genetic LearningGenetic LearningGenetic Algorithms and Supervised Learning

Page 17: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 817Data Warehouse and Data Mining

Genetic LearningGenetic LearningGenetic Algorithms and... Supervised Learning

Genetic LearningGenetic LearningGenetic Algorithms and... Supervised Learning

Page 18: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 818Data Warehouse and Data Mining

Genetic LearningGenetic LearningGenetic Algorithms and ..Unsupervised Clustering

Genetic LearningGenetic LearningGenetic Algorithms and ..Unsupervised Clustering

Page 19: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 819Data Warehouse and Data Mining

Genetic LearningGenetic LearningGenetic Algorithms and Unsupervised Clustering

Genetic LearningGenetic LearningGenetic Algorithms and Unsupervised Clustering

Page 20: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 820Data Warehouse and Data Mining

Genetic LearningGenetic LearningGeneral Considerations

Genetic LearningGenetic LearningGeneral Considerations

Here is a list of considerations when using a problem-solving approach based on genetic learning:

Genetic algorithms are designed to find globally optimized solutions. However, there is no guarantee that any given solution is not the result of a local rather than a global optimization.

The fitness function determines the computational complexity of a genetic algorithm. A fitness function involving several calculations can be computationally expensive.

Page 21: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 821Data Warehouse and Data Mining

Genetic LearningGenetic LearningGeneral Considerations

Genetic LearningGenetic LearningGeneral Considerations

Genetic algorithms explain their results to the extent that the fitness function is understandable.

Transforming the data to form suitable for a genetic algorithm can be a challenge.

Page 22: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 822Data Warehouse and Data Mining

Genetic Algorithms• GA: based on an analogy to biological evolution• Each rule is represented by a string of bits• An initial population is created consisting of

randomly generated rules• Based on the notion of survival of the fittest, a

new population is formed to consists of the fittest rules and their offsprings

• The fitness of a rule is represented by its classification accuracy on a set of training examples

• Offsprings are generated by crossover and mutation

Page 23: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 823Data Warehouse and Data Mining

Genetic Algorithms

• Population-based technique for discovery of ....knowledge structures

• Based on idea that evolution represents search for optimum solution set

• Massively parallel

Page 24: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 824Data Warehouse and Data Mining

The Vocabulary of GAs

• Population– Set of individuals, each represented

by one or more strings of characters

• Chromosome– The string representing an individual

Page 25: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 825Data Warehouse and Data Mining

011010

Chromosome

Gene(Allele="0")

Locus=5

•Locus : The ordinal place... on a chromosome where a specific gene is found

•Allele :The value of a specific gene

•Gene The basic informational unit on a chromosome

The vocabulary of GAs, contd.

Page 26: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 826Data Warehouse and Data Mining

Genetic operators

• Reproduction– Increase representations of strong

individuals

• Crossover– Explore the search space

• Mutation– Recapture “lost” genes due to crossover

Page 27: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 827Data Warehouse and Data Mining

Genetic operators illustrated...

011010Parent 1:

Parent 2:

Offspring 1:

Offspring 2:000110011010000110

Simple reproduction

011010 Offspring 1:Offspring 2:

000110011110000010

Reproduction with crossover at locus 3

011010 Offspring 1:

Offspring 2:000110010010000110

Simple reproduction with mutation at locus 3 for offspring 1

Parent 1:

Parent 2:

Parent 1:

Parent 2:

Page 28: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 828Data Warehouse and Data Mining

GAs rely on the concept of “fitness”

• Ability of an individual to survive into the next generation

• “Survival of the fittest”• Usually calculated in terms of an

objective fitness function• Maximization• Minimization• Other functions

Page 29: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 829Data Warehouse and Data Mining

Genetic Programming

• Based on adaptation and evolution

• Structures undergoing adaptation are computer programs of varying size and shape

• Computer programs are genetically “bred” over time

Page 30: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 830Data Warehouse and Data Mining

The Learning Classifier System

• Rule-based knowledge discovery and concept learning tool

• Operates by means of evaluation, credit assignment, and discovery applied to a population of “chromosomes” (rules) each with a corresponding “phenotype” (outcome)

Page 31: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 831Data Warehouse and Data Mining

Components of a Learning Classifier System

• Performance– Provides interaction between environment and rule

base– Performs matching function

• Reinforcement– Rewards accurate classifiers– Punishes inaccurate classifiers

• Discovery– Uses the genetic algorithm to search for plausible rules

Page 32: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 832Data Warehouse and Data Mining

Rough Set Approach• Rough sets are used to approximately or “roughly”

define equivalent classes • A rough set for a given class C is approximated by

two sets: – a lower approximation (certain to be in C) – an upper approximation (cannot be described as not

belonging to C)

Page 33: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 833Data Warehouse and Data Mining

Fuzzy Set Approaches

• Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph)

• Attribute values are converted to fuzzy values– e.g., income is mapped into the discrete categories

{low, medium, high} with fuzzy values calculated

Page 34: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 834Data Warehouse and Data Mining

Fuzzy Set Approaches

• For a given new sample, more than one fuzzy value may apply

• Each applicable rule contributes a vote for membership in the categories

• Typically, the truth values for each predicted category are summed.

Page 35: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 835Data Warehouse and Data Mining

Chapter SummaryChapter SummaryChapter SummaryChapter Summary

• The K-Means algorithm is a statistical unsupervised clustering technique.

•All input attributes to the algorithm must be numeric and the user is required to make a decision about..... how many clusters are to be discovered.

•The algorithm begins by randomly choosing one data point to represent each cluster.

•Each data instance is then placed in the cluster to which it is most similar.

•New cluster centers are computed and the process continues until .....the cluster centers do not change.

Page 36: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 836Data Warehouse and Data Mining

Chapter SummaryChapter SummaryChapter SummaryChapter Summary

•The K-Means algorithm is easy to implement and understand. However,

•the algorithm is not guaranteed to converge to a globally optimal solution,

•lacks the ability to explain what has been found,

•unable to tell which attributes are significant in determining the formed clusters.

•Despite these limitations, the K-Means algorithm is among the most widely used clustering techniques.

Page 37: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 837Data Warehouse and Data Mining

Chapter SummaryChapter SummaryChapter SummaryChapter Summary

• Genetic algorithms apply the theory of evolution to inductive learning.

• Genetic learning can be supervised ...or ...unsupervised

• typically used for problems that cannot be solved with traditional techniques.

•A standard genetic approach to learning applies a fitness function to a set of data elements to determine...... which elements survive from one generation to the next.

Page 38: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 838Data Warehouse and Data Mining

Chapter SummaryChapter SummaryChapter SummaryChapter Summary

•Those elements not surviving are used to create new instances to replace deleted elements.

•In addition to being used for supervised learning and unsupervised clustering, genetic techniques can be employed in conjunction with other learning techniques.

Page 39: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 839Data Warehouse and Data Mining

Key TermsKey TermsKey TermsKey Terms

Affinity analysis. The process of determining which things are typically grouped together.

Confidence. Given a rule of the form “If A then B,” confidence is defined as the conditional

probability that B is true when A is known to be true.

Crossover. A genetic learning operation that creates new population elements by combining parts of

two or more elements from the current population.

Page 40: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 840Data Warehouse and Data Mining

Key TermsKey TermsKey TermsKey Terms

Genetic algorithm. A data mining technique based on the theory of evolution.

Mutation. A genetic learning operation that creates a new population element by randomly modifying a portion of an existing element.

Selection. A genetic learning operation that adds copies of current population elements with high fitness scores to the next generation of the population.

Page 41: Chapter 8 The k-Means Algorithm and Genetic Algorithm.

Chapter 841Data Warehouse and Data Mining

Reference

Data Mining: Concepts and Techniques (Chapter 7 Slide for textbook), Jiawei Han and Micheline Kamber, Intelligent Database Systems Research Lab, School of Computing Science, Simon Fraser University, Canada