DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions June 27, 202 2 DATA MINING CSE@HCST 1
DATA MINING: CONCEPTS AND TECHNIQUES
UNIT-III
Part-I Classification and PredictionsApril 21, 2023
DATA MINING CSE@HCST 1
Classification and Prediction
What is classification? What is
prediction?
Issues regarding classification and
prediction
Classification by decision tree
induction
Bayesian classification
*Rule-based classification
Classification by back propagation
Neural Network
*Support Vector Machines (SVM)
*Associative classification
Lazy learners (or learning from
your neighbors)
Other classification methods
*Prediction
*Accuracy and error measures
*Ensemble methods
*Model selection
Summary
April 21, 2023
2
DATA MINING CSE@HCST
Objectives
April 21, 2023DATA MINING CSE@HCST
3
Classification vs. Prediction
April 21, 2023DATA MINING CSE@HCST
4 Classification-
Predicts categorical class labels (discrete or nominal). Classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in classifying new data.
Prediction- Models continuous-valued functions, i.e., predicts unknown or
missing values. Typical applications-
Credit approval. Document categorization. Target marketing. Medical diagnosis. Treatment effectiveness analysis. Fraud detection.
Classification types
April 21, 2023DATA MINING CSE@HCST
5
Classification—A Two-Step Process
April 21, 2023DATA MINING CSE@HCST
6 Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute.
The set of tuples used for model construction is training set. The model is represented as classification rules, decision trees, or
mathematical formulae. Model usage: for classifying future or unknown objects
Estimate accuracy of the model The known label of test sample is compared with the classified result
from the model. Accuracy rate is the percentage of test set samples that are correctly
classified by the model. Test set is independent of training set, otherwise over-fitting will
occur. If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known.
Classification—A Two-Step Process
April 21, 2023DATA MINING CSE@HCST
7
Example-1 : Model Construction
April 21, 2023DATA MINING CSE@HCST
8
9
Example-1: Using the Model in Prediction April 21, 2023DATA MINING CSE@HCST
Example-2 Process (1): Model Construction
April 21, 2023DATA MINING CSE@HCST
10
TrainingData
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
Process (2): Using the Model in Prediction
April 21, 2023DATA MINING CSE@HCST
11
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
How does classification work ?
April 21, 2023DATA MINING CSE@HCST
12
Supervised vs. Unsupervised Learning
April 21, 2023DATA MINING CSE@HCST
13
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations.
New data is classified based on the training set.
Unsupervised learning (clustering)
The class labels of training data is unknown.
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data.
Issues: Data Preparation
April 21, 2023DATA MINING CSE@HCST
14
Data cleaning Preprocess data in order to reduce noise and handle missing
values Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes Data transformation
Generalize and/or normalize data
Issues: Evaluating Classification Methods
April 21, 2023DATA MINING CSE@HCST
15
Accuracy classifier accuracy: predicting class label. predictor accuracy: guessing value of predicted attributes.
Speed time to construct the model (training time). time to use the model (classification/prediction time).
Robustness: handling noise and missing values. Scalability: efficiency in disk-resident databases. Interpretability
understanding and insight provided by the model. Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules.
Decision Tree Induction: Training Dataset
April 21, 2023DATA MINING CSE@HCST
16
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows an example of Quinlan’s ID3 (Playing Tennis)
Decision Tree
April 21, 2023DATA MINING CSE@HCST
17
Decision Tree
April 21, 2023DATA MINING CSE@HCST
18
Decision Tree
April 21, 2023DATA MINING CSE@HCST
19
Decision Tree
April 21, 2023DATA MINING CSE@HCST
20
Decision Tree
April 21, 2023DATA MINING CSE@HCST
21
Decision Tree
April 21, 2023DATA MINING CSE@HCST
22
Decision Tree
April 21, 2023DATA MINING CSE@HCST
23
Decision Tree
April 21, 2023DATA MINING CSE@HCST
24
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
25
Decision Tree Boundary
April 21, 2023DATA MINING CSE@HCST
26
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
27
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
28
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
29
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
30
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
31
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
32
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
33
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
34
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
35
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
36
37
Attribute Selection Measure: Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
Information needed (after using A to split D into v partitions) to classify D:
Information gained by branching on attribute A
)(log)( 21
i
m
ii ppDInfo
)(||
||)(
1j
v
j
jA DInfo
D
DDInfo
(D)InfoInfo(D)Gain(A) AApril 21, 2023DATA MINING CSE@HCST
April 21, 2023DATA MINING CSE@HCST
38
Gain Ratio for Attribute Selection (C4.5)
April 21, 2023DATA MINING CSE@HCST
39
Gini Index (CART, IBM IntelligentMiner)
April 21, 2023DATA MINING CSE@HCST
40
Comparisons Of Attribute Selection Measures
April 21, 2023DATA MINING CSE@HCST
41
42
Other Attribute Selection Measures
CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence.
C-SEP: performs better than info. gain and gini index in certain cases.
G-statistic: has a close approximation to χ2 distribution.
MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
The best tree as the one that requires the fewest # of bits to both (1) encode
the tree, and (2) encode the exceptions to the tree.
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than othersApril 21, 2023DATA MINING CSE@HCST
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
43
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
44
Decision Tree Induction [IMPORTANT]
April 21, 2023DATA MINING CSE@HCST
45
April 21, 2023DATA MINING CSE@HCST
46
EXAMPLE: Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
47
EXAMPLE: Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
48
April 21, 2023DATA MINING CSE@HCST
49
EXAMPLE: Calculating Gain Ratio
April 21, 2023DATA MINING CSE@HCST
50
Gini Index
April 21, 2023DATA MINING CSE@HCST
51
Calculating Gini Index
April 21, 2023DATA MINING CSE@HCST
52
53
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data- Too many branches, some may reflect anomalies due to noise
or outliers. Poor accuracy for unseen samples.
Two approaches to avoid overfitting- Prepruning: Halt tree construction early ̵ do not split a node
if this would result in the goodness measure falling below a threshold. Difficult to choose an appropriate threshold.
Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees. Use a set of data different from the training data to decide
which is the “best pruned tree”. April 21, 2023DATA MINING CSE@HCST
Overfitting and Tree Pruning
April 21, 2023DATA MINING CSE@HCST
54
55
Enhancements to Basic Decision Tree Induction
Allow for continuous-valued attributes- Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of intervals.
Handle missing attribute values- Assign the most common value of the attribute. Assign probability to each of the possible values.
Attribute construction- Create new attributes based on existing ones that are sparsely
represented. This reduces fragmentation, repetition, and replication.
April 21, 2023DATA MINING CSE@HCST
56
Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Foundation: Based on Bayes’ Theorem. Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and selected neural network classifiers.
Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data.
Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured.April 21, 2023DATA MINING CSE@HCST
57
Bayes’ Theorem: Basics
Total probability Theorem:
Bayes’ Theorem:
Let X be a data sample (“evidence”): class label is unknown. Let H be a hypothesis that X belongs to class C. Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X. P(H) (prior probability): the initial probability-
E.g., X will buy computer, regardless of age, income, …… P(X): probability that sample data is observed. P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds. E.g., Given that X will buy computer, the prob. that X is 31…40,
medium income.
)()1
|()( iAPM
i iABPBP
)(/)()|()(
)()|()|( XXX
XX PHPHPP
HPHPHP
April 21, 2023DATA MINING CSE@HCST
58
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem-
Informally, this can be viewed as-
posteriori = [likelihood * prior/evidence]
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes.
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost.
)(/)()|()(
)()|()|( XXX
XX PHPHPP
HPHPHP
April 21, 2023DATA MINING CSE@HCST
59
Classification Is to Derive the Maximum Posteriori
Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn).
Suppose there are m classes C1, C2, …, Cm. Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X). This can be derived from Bayes’ theorem-
Since P(X) is constant for all classes, only
needs to be maximized.
)()()|(
)|(X
XX
PiCPiCP
iCP
)()|()|( iCPiCPiCP XX
60
Naïve Bayes Classifier A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
This greatly reduces the computation cost: Only counts the class distribution.
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D).
If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is-
)|(...)|()|(1
)|()|(21
CixPCixPCixPn
kCixPCiP
nk
X
2
2
2
)(
2
1),,(
x
exg
),,()|(ii CCkxgCiP X
61
Naïve Bayes Classifier: Training Dataset
Class:C1:buys_computer = ‘yes’C2:buys_computer = ‘no’
Data to be classified: X = (age <=30, Income = medium,Student = yesCredit_rating = Fair)
age incomestudentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
April 21, 2023DATA MINING CSE@HCST
62
Naïve Bayes Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357 Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
age income studentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
April 21, 2023DATA MINING CSE@HCST
63
Avoiding the Zero-Probability Problem
Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero.
Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10).
Use Laplacian correction (or Laplacian estimator) Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003 The “corrected” prob. estimates are close to their
“uncorrected” counterparts.
n
kCixkPCiXP
1)|()|(
64
Naïve Bayes Classifier: Comments
Advantages- Easy to implement. Good results obtained in most of the cases.
Disadvantages- Assumption: class conditional independence, therefore loss of
accuracy. Practically, dependencies exist among variables-
E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc. Dependencies among these cannot be modeled by Naïve Bayes
Classifier. How to deal with these dependencies? Bayesian Belief Networks.
April 21, 2023DATA MINING CSE@HCST
Classification by Backpropagation
April 21, 2023DATA MINING CSE@HCST
65
Backpropagation: A neural network learning algorithm.
Started by psychologists and neurobiologists to develop and
test computational analogues of neurons.
A neural network: A set of connected input/output units
where each connection has a weight associated with it.
During the learning phase, the network learns by adjusting
the weights so as to be able to predict the correct class label of
the input tuples.
Also referred to as connectionist learning due to the
connections between units.
Neural Network as a Classifier
April 21, 2023DATA MINING CSE@HCST
66 Weakness
Long training time. Require a number of parameters typically best determined empirically,
e.g., the network topology or ``structure." Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of ``hidden units" in the network.
Strength High tolerance to noisy data. Ability to classify untrained patterns. Well-suited for continuous-valued inputs and outputs. Successful on a wide array of real-world data. Algorithms are inherently parallel. Techniques have recently been developed for the extraction of rules from
trained neural networks.
A Neuron (= a perceptron)
The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping
)sign(y
ExampleFor n
0ikii xw
April 21, 2023
DATA MINING CSE@HCST
67
k-
f
weighted sum
Inputvector x
output y
Activationfunction
weightvector w
w0
w1
wn
x0
x1
xn
A Multi-Layer Feed-Forward Neural Network
April 21, 2023DATA MINING CSE@HCST
68
Output layer
Input layer
Hidden layer
Output vector
Input vector: X
wij
i
jiijj OwI
jIje
O
1
1
))(1( jjjjj OTOOErr
jkk
kjjj wErrOOErr )1(
ijijij OErrlww )(jjj Errl)(
How A Multi-Layer Neural Network Works?
April 21, 2023DATA MINING CSE@HCST
69
The inputs to the network correspond to the attributes measured for each
training tuple .
Inputs are fed simultaneously into the units making up the input layer.
They are then weighted and fed simultaneously to a hidden layer.
The number of hidden layers is arbitrary, although usually only one.
The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network's prediction.
The network is feed-forward in that none of the weights cycles back to an
input unit or to an output unit of a previous layer.
From a statistical point of view, networks perform nonlinear regression:
Given enough hidden units and enough training samples, they can closely
approximate any function.
Defining a Network Topology
April 21, 2023DATA MINING CSE@HCST
70
First decide the network topology: # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer.
Normalizing the input values for each attribute measured in the training tuples to [0.0—1.0].
One input unit per domain value, each initialized to 0. Output, if for classification and more than two classes, one
output unit per class is used. Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different network topology or a different set of initial weights.
Backpropagation
April 21, 2023DATA MINING CSE@HCST
71
Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value.
For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target value.
Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation”.
Steps- Initialize weights (to small random #s) and biases in the network. Propagate the inputs forward (by applying activation function). Backpropagate the error (by updating weights and biases). Terminating condition (when error is very small, etc.).
April 21, 2023DATA MINING CSE@HCST
72
April 21, 2023DATA MINING CSE@HCST
73
Multilayer Neural Network
April 21, 2023DATA MINING CSE@HCST
74
April 21, 2023DATA MINING CSE@HCST
75
April 21, 2023DATA MINING CSE@HCST
76
April 21, 2023DATA MINING CSE@HCST
77
Lazy vs. Eager Learning
April 21, 2023DATA MINING CSE@HCST
78 Lazy vs. eager learning
Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple.
Eager learning (the above discussed methods): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify.
Lazy: less time in training but more time in predicting. Accuracy-
Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function.
Eager: must commit to a single hypothesis that covers the entire instance space.
Lazy Learner: Instance-Based Methods
April 21, 2023DATA MINING CSE@HCST
79 Instance-based learning:
Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified.
Typical approaches- k-nearest neighbor approach
Instances represented as points in a Euclidean space. Locally weighted regression
Constructs local approximation. Case-based reasoning
Uses symbolic representations and knowledge-based inference.
The k-Nearest Neighbor Algorithm
April 21, 2023DATA MINING CSE@HCST
80
All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2). Target function could be discrete- or real- valued. For discrete-valued, k-NN returns the most common value
among the k training examples nearest to xq.
Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples.
.
_+
_ xq
+
_ _+
_
_
+
.
..
. .
April 21, 2023DATA MINING CSE@HCST81
April 21, 2023DATA MINING CSE@HCST82
April 21, 2023DATA MINING CSE@HCST83
April 21, 2023DATA MINING CSE@HCST84
April 21, 2023DATA MINING CSE@HCST85
April 21, 2023DATA MINING CSE@HCST86
April 21, 2023DATA MINING CSE@HCST87
April 21, 2023DATA MINING CSE@HCST88
April 21, 2023DATA MINING CSE@HCST89
April 21, 2023DATA MINING CSE@HCST90
April 21, 2023DATA MINING CSE@HCST91
April 21, 2023DATA MINING CSE@HCST92
April 21, 2023DATA MINING CSE@HCST93
April 21, 2023DATA MINING CSE@HCST94
April 21, 2023DATA MINING CSE@HCST95
Example : -Nearest Neighbors
K-Nearest Neighbor K-Nearest Neighbor ClassifierClassifier
CustomerCustomer AgAgee
IncomIncomee
No. No. credit credit cardscards
ResponsResponsee
JohnJohn 3535 35K35K 33 NoNo
RachelRachel 2222 50K50K 22 YesYes
HannahHannah 6363 200K200K 11 NoNo
TomTom 5959 170K170K 11 NoNo
NellieNellie 2525 40K40K 44 YesYes
DavidDavid 3737 50K50K 22 ??
April 21, 2023DATA MINING
CSE@HCST96
ExampleExample
K-Nearest Neighbor K-Nearest Neighbor ClassifierClassifier
CustomerCustomer AgAgee
IncomIncomee
(K)(K)
No. No.
cardcardss
JohnJohn 3535 3535 33
RachelRachel 2222 5050 22
HannahHannah 6363 200200 11
TomTom 5959 170170 11
NellieNellie 2525 4040 44
DavidDavid 3737 5050 22
ResponsResponsee
NoNo
YesYes
NoNo
NoNo
YesYes
Distance from Distance from DavidDavid
sqrt [(35-37)sqrt [(35-37)22+(35-+(35-50)50)2 2 +(3-2)+(3-2)22]=]=15.1615.16
sqrt [(22-37)sqrt [(22-37)22+(50-+(50-50)50)2 2 +(2-2)+(2-2)22]=]=1515
sqrt [(63-37)sqrt [(63-37)22+(200-+(200-50)50)2 2 +(1-+(1-2)2)22]=]=152.23152.23
sqrt [(59-37)sqrt [(59-37)22+(170-+(170-50)50)2 2 +(1-2)+(1-2)22]=]=122122
sqrt [(25-37)sqrt [(25-37)22+(40-+(40-50)50)2 2 +(4-2)+(4-2)22]=]=15.7415.74YesApril 21, 2023DATA MINING CSE@HCST
97
Genetic Algorithms (GA-Part-I)
April 21, 2023DATA MINING CSE@HCST
98
Genetic Algorithm: based on an analogy to biological evolution.
An initial population is created consisting of randomly generated rules- Each rule is represented by a string of bits
E.g., if A1 and ¬A2 then C2 can be encoded as 100
If an attribute has k > 2 values, k bits can be used
Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offsprings.
The fitness of a rule is represented by its classification accuracy on a set of training examples.
Offsprings are generated by crossover and mutation.
The process continues until a population P evolves when each rule in P satisfies a prespecified threshold.
Slow but easily parallelizable.
Genetic Algorithms (GA)
April 21, 2023DATA MINING CSE@HCST
99
Genetic Algorithms (GA)
April 21, 2023DATA MINING CSE@HCST
100
Genetic Algorithms (GA)
April 21, 2023DATA MINING CSE@HCST
101
Genetic Algorithms (GA)
To use a genetic algorithm, you must encode solutions to your problem in a structure that can be stored in the computer.
This object is a genome (or chromosome). The genetic algorithm creates a population of genomes then
applies crossover and mutation to the individuals in the population to generate new individuals.
It uses various selection criteria so that it picks the best individuals for mating (and subsequent crossover).
Your objective function determines how 'good' each individual is.
April 21, 2023DATA MINING CSE@HCST
102
Genetic Algorithms (GA)
The genetic algorithm is very simple, yet it performs well on many different types of problems.
But there are many ways to modify the basic algorithm, and many parameters that can be 'tweaked'.
Basically, if you get the objective function right, the representation right and the operators right, then variations on the genetic algorithm and its parameters will result in only minor improvements.
April 21, 2023DATA MINING CSE@HCST
103
Representation
You can use any representation for the individual genomes in the genetic algorithm.
Holland worked primarily with strings of bits, but you can use arrays, trees, lists, or any other object.
But you must define genetic operators- (initialization, mutation, crossover, comparison) for any representation that you decide to use.
Remember that each individual must represent a complete solution to the problem you are trying to optimize.
April 21, 2023DATA MINING CSE@HCST
104
April 21, 2023DATA MINING CSE@HCST
105
Mutation operators
These are some sample tree mutation operators. You can use more than one operator during an
evolution. The mutation operator introduces a certain amount
of randomness to the search. It can help the search find solutions that crossover
alone might not encounter.
April 21, 2023DATA MINING CSE@HCST
106
April 21, 2023DATA MINING CSE@HCST
107
Crossover operators
These are some sample tree crossover operators. Typically crossover is defined so that two
individuals (the parents) combine to produce two more individuals (the children).
But you can define asexual crossover or single-child crossover as well.
The primary purpose of the crossover operator is to get genetic material from the previous generation to the subsequent generation.
April 21, 2023DATA MINING CSE@HCST
108
April 21, 2023DATA MINING CSE@HCST
109
Mutation operators
These are some sample list mutation operators. Notice that lists may be fixed or variable length. Also common are order-based lists in which the sequence is
important and nodes cannot be duplicated during the genetic operations.
You can use more than one operator during an evolution. The mutation operator introduces a certain amount of
randomness to the search. It can help the search find solutions that crossover alone
might not encounter.
April 21, 2023DATA MINING CSE@HCST
110
April 21, 2023DATA MINING CSE@HCST
111
April 21, 2023DATA MINING CSE@HCST112
Notice that lists may be fixed or variable length. Also common are order-based lists in which the sequence is important and nodes cannot be duplicated during the genetic operations. You can use more than one operator during an evolution. The mutation operator introduces a certain amount of randomness to the search. It can help the search find solutions that crossover alone might not encounter.
April 21, 2023DATA MINING CSE@HCST
113
Genetic Algorithms (GA)
Two of the most common genetic algorithm implementations are 'simple' and 'steady state'.
In simple state- It is a generational algorithm in which the entire population is replaced each generation.
The steady state genetic algorithm is used by the Genitor program. In this algorithm, only a few individuals are replaced each 'generation'. This type of replacement is often referred to as overlapping populations.
April 21, 2023DATA MINING CSE@HCST
114
April 21, 2023DATA MINING CSE@HCST
115
http://lancet.mit.edu/mbwall/presentations
April 21, 2023DATA MINING CSE@HCST
116
Genetic AlgorithmGenetic Algorithms (GA-Part-I)
117
April 21, 2023DATA MINING CSE@HCST
Outline
Introduction to Genetic Algorithm (GA) GA Components
Representation Recombination Mutation Parent Selection Survivor selection
Example
118
April 21, 2023DATA MINING CSE@HCST
Introduction to GA (1)119
Calculus Base Techniques
Fibonacci
Search Techniqes
Guided random search techniqes
Enumerative Techniqes
BFSDFS Dynamic Programmin
g
Tabu Search
Hill Climbi
ng
Simulated Anealing
Evolutionary Algorithms
Genetic Programming
Genetic Algorithms
Sort
April 21, 2023DATA MINING CSE@HCST
Introduction to GA (2)
“Genetic Algorithms are good at taking large, potentially huge search spaces and navigating them, looking for optimal combinations of things, solutions you might not otherwise find in a lifetime.”- Salvatore Mangano, Computer Design, May 1995.
Originally developed by John Holland (1975) The genetic algorithm (GA) is a search heuristic that
mimics the process of natural evolution Uses concepts of “Natural Selection” and “Genetic
Inheritance” (Darwin 1859)
120
April 21, 2023DATA MINING CSE@HCST
Use of GA
Widely-used in business, science and engineering Optimization and Search Problems Scheduling and Timetabling
121
April 21, 2023DATA MINING CSE@HCST
Let’s Learn Biology (1)
Our body is made up of trillions of cells. Each cell has a core structure (nucleus) that contains your chromosomes.
Each chromosome is made up of tightly coiled strands of deoxyribonucleic acid (DNA). Genes are segments of DNA that determine specific traits, such as eye or hair color. You have more than 20,000 genes.
A gene mutation is an alteration in your DNA. It can be inherited or acquired during your lifetime, as cells age or are exposed to certain chemicals. Some changes in your genes result in genetic disorders.
122
April 21, 2023DATA MINING CSE@HCST
Let’s Learn Biology (2) 123
Source: http://www.riversideonline.com/health_reference/Tools/DS00549.cfm
1101101April 21, 2023DATA MINING CSE@HCST
Let’s Learn Biology (3) 124
April 21, 2023DATA MINING CSE@HCST
Let’s Learn Biology (4)
Natural Selection Darwin's theory of evolution: only the organisms best
adapted to their environment tend to survive and transmit their genetic characteristics in increasing numbers to succeeding generations while those less adapted tend to be eliminated.
125
Source: http://www.bbc.co.uk/programmes/p0022nyy
April 21, 2023DATA MINING CSE@HCST
GA is inspired from Nature
A genetic algorithm maintains a population of candidate solutions for the problem at hand,and makes it evolve by iteratively applying a set of stochastic operators
126
April 21, 2023DATA MINING CSE@HCST
Nature VS GA
The computer model introduces simplifications (relative to the real biological mechanisms),
BUT
surprisingly complex and interesting structures have emerged out of evolutionary algorithms
127
April 21, 2023DATA MINING CSE@HCST
High-level Algorithm
produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
128
April 21, 2023DATA MINING CSE@HCST
GA Components129
Source: http://www.engineering.lancs.ac.ukApril 21, 2023DATA MINING CSE@HCST
GA Components With Example
The MAXONE problem : Suppose we want to maximize the number of ones in a string of L binary digits
It may seem trivial because we know the answer in advance
However, we can think of it as maximizing the number of correct answers, each encoded by 1, to L yes/no difficult questions`
130
April 21, 2023DATA MINING CSE@HCST
GA Components: Representation
Encoding An individual is encoded (naturally) as a string of l
binary digits Let’s say L = 10. Then, 1 = 0000000001 (10 bits)
131
April 21, 2023DATA MINING CSE@HCST
Algorithm
produce an initial population of individuals
evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
132
April 21, 2023DATA MINING CSE@HCST
Initial Population
We start with a population of n random strings. Suppose that l = 10 and n = 6
We toss a fair coin 60 times and get the following initial population:
s1 = 1111010101
s2 = 0111000101
s3 = 1110110101
s4 = 0100010011
s5 = 1110111101
s6 = 0100110000
133
April 21, 2023DATA MINING CSE@HCST
Algorithm
produce an initial population of individuals
evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
134
April 21, 2023DATA MINING CSE@HCST
Fitness Function: f()
We toss a fair coin 60 times and get the following initial population:
s1 = 1111010101 f (s1) = 7
s2 = 0111000101 f (s2) = 5
s3 = 1110110101 f (s3) = 7
s4 = 0100010011 f (s4) = 4
s5 = 1110111101 f (s5) = 8
s6 = 0100110000 f (s6) = 3 --------------------------------------------------- =
34
135
April 21, 2023DATA MINING CSE@HCST
Algorithm
produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do
select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
136
April 21, 2023DATA MINING CSE@HCST
Selection (1)
Next we apply fitness proportionate selection with the roulette wheel method:
We repeat the extraction as many times as the number of individuals
we need to have the same parent population size (6 in our case)
137
Individual i will have a probability to be chosen Individual i will have a probability to be chosen
i
if
if
)(
)(
2211nn
33
Area is Proportional to fitness value
44
April 21, 2023DATA MINING CSE@HCST
Selection (2)
Suppose that, after performing selection, we get the following population:
s1` = 1111010101 (s1) s2` = 1110110101 (s3) s3` = 1110111101 (s5) s4` = 0111000101 (s2) s5` = 0100010011 (s4) s6` = 1110111101 (s5)
138
April 21, 2023DATA MINING CSE@HCST
Algorithm
produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction
recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
139
April 21, 2023DATA MINING CSE@HCST
Recombination (1)
aka Crossover For each couple we decide according to
crossover probability (for instance 0.6) whether to actually perform crossover or not
Suppose that we decide to actually perform crossover only for couples (s1`, s2`) and (s5`, s6`).
For each couple, we randomly extract a crossover point, for instance 2 for the first and 5 for the second
140
April 21, 2023DATA MINING CSE@HCST
Recombination (2)141
April 21, 2023DATA MINING CSE@HCST
Algorithm
produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals
mutate individuals evaluate the fitness of the modified individuals generate a new population End while
142
April 21, 2023DATA MINING CSE@HCST
Mutation (1)143
Before applying mutation:
s1`` = 1110110101
s2`` = 1111010101
s3`` = 1110111101
s4`` = 0111000101
s5`` = 0100011101
s6`` = 1110110011
After applying mutation:
s1``` = 1110100101
s2``` = 1111110100
s3``` = 1110101111
s4``` = 0111000101
s5``` = 0100011101
s6``` = 1110110001 April 21, 2023DATA MINING CSE@HCST
Mutation (2)
The final step is to apply random mutation: for each bit that we are to copy to the new population we allow a small probability of error (for instance 0.1)
Causes movement in the search space(local or global)
Restores lost information to the population
144
April 21, 2023DATA MINING CSE@HCST
High-level Algorithm
produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals
evaluate the fitness of the modified individuals
generate a new population End while
145
April 21, 2023DATA MINING CSE@HCST
Fitness of New Population
After Applying Mutation: s1``` = 1110100101 f (s1```) = 6
s2``` = 1111110100 f (s2```) = 7
s3``` = 1110101111 f (s3```) = 8
s4``` = 0111000101 f (s4```) = 5
s5``` = 0100011101 f (s5```) = 5
s6``` = 1110110001 f (s6```) = 6 -------------------------------------------------------------
37
146
April 21, 2023DATA MINING CSE@HCST
Example (End)
In one generation, the total population fitness changed from 34 to 37, thus improved by ~9%
At this point, we go through the same process all over again, until a stopping criterion is met
147
April 21, 2023DATA MINING CSE@HCST
Distribution of Individuals
Distribution of Individuals in Generation 0
Distribution of Individuals in Generation N
April 21, 2023DATA MINING CSE@HCST
148
Issues
Choosing basic implementation issues: representation population size, mutation rate, ... selection, deletion policies crossover, mutation operators
Termination Criteria Performance, scalability Solution is only as good as the evaluation function (often
hardest part)
149
April 21, 2023DATA MINING CSE@HCST
When to Use a GA
Alternate solutions are too slow or overly complicated Need an exploratory tool to examine new approaches Problem is similar to one that has already been
successfully solved by using a GA Want to hybridize with an existing solution Benefits of the GA technology meet key problem
requirements
April 21, 2023DATA MINING CSE@HCST
150
Conclusion
Inspired from Nature Has many areas of Applications GA is powerful
151
April 21, 2023DATA MINING CSE@HCST
April 21, 2023DATA MINING CSE@HCST152
END