This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Mining 2013
Section 8 Classification
Supervised Learning
Classification is a form ofsupervised learning - where we have prior information on what the result (orclass/group) should look like. There are many situations in which classification or discrimination of data isrequired.
For example,� when you apply for a loan, the lender will look at your credit, employment, and residence history
in order to detemine if you qualify for a loan;� if you are suspected of having a particular disease the doctor will take your temperature, blood
pressure, check various other things - probably have some tests done - and then, based on allthedata collected, determine the presence or absence of the disease.
In these (and many other situations) the decision is made on the basis of past experience, i.e. how havepeople with ‘similar’ histories behaved with respect to paying off their loans, or have people with ‘similar’medical characteristics been shown to have the disease.
To determine such things, data can be collected on credit or medical characteristics - along with the results- and analysed to see if it is possible to classify someone as a good or bad risk(for defaulting on the loan orhaving the disease). We can look at the process as one of finding ‘features’ in the data that can be used tobuild a ‘classifier’. In so doing we hope to determine which features are important for the classification,and which values of those features lead to success or failure. Once such valuesare determined based on theknown cases, we have a rule or “classifier” that can be used to predict the outcome/result for new cases.
You may recall that inclustering (unsupervised learning) we saw that sometimes the clustering wasequivalent to the classification. This was due to the values of the features for clustering being the same aswould be needed for classification.
When we do a classification, we need to be concerned with how good or “accurate” ourmethod is. In themedical situation, if a person who has a disease is classified as not having it, the person could die. On theother hand, if the person who does not have the disease is classified as having it, a needless operation couldresult. Thus the seriousness of the misclassification is something we alsowish to consider.
Logistic regression may be used to classify data into one of two groups where
Y �0 if we are in group 1
1 if we are in group 2
and whereP�Y � 1� � p (called “success”) andP�Y � 0� � q � �1 � p� (called “failure”). Y isbinomial(1,p) i.e.Y is binary and discrete. If we useY as the response variable, we violate the normalityassumption associated with ordinary least squares linear regression (OLS). We note however, that
E�Y� � 0 � P�Y � 0� � 1 � P�Y � 1� � p
Notep � �0,1� and is continuous. This gives us a hint on how we might proceed. Often we assume thatthis probabilityp can be modelled by a sigmoidal curve of the form
p � E�Y� � 1 � e���0��1X1�...��kXk� �1
We now linearize this logistic response function by modelling theloge odds, (called thelogit) which is theloge of the ratio of p, the probability of “success”, toq � 1 � p, the probability of “failure”.i.e.,
�� � logepq � �0 � �1X1 � ... � �kXk
Since�� � ���,��, this is now formulated as a linear regression problem.
Once estimates for the�’s are obtained, it is possible to estimate the chance of “success”p as
�p � 1 � exp�����0 �
��1X1 � ... � �� kXk��
�1
� 11 � exp���
��0 �
��1X1 � ... � �� kXk��
Once�p exceeds some threshold (say 50%), we can decide to predictY � 1 (i.e. a “success”) for that case.
We can assess how well our model fits by examiningdeviance. The deviance of a fitted model comparesthelog-likelihood of the fitted model to thelog-likelihood of a model with n parameters that fits thenobservations perfectly (i.e. asaturated model.).
The following function computes the predicted value forp :
Logistic regression uses theglm function (generalized linear models) that we used earlier for OLSregression. We tell it to do logistic withquasibinomial(link � ”logit”) :
� ( log. 1 �- glm( c( rep( 0, sz. 1), rep( 1, sz. 2))~ G. 1. data[, 1] �G. 2. data[, 2], quasibinomial( link �
” logit”)))
Call: glm( formula � c( rep( 0, sz. 1), rep( 1, sz. 2)) ~ G. 1. data[, 1] � G. 1. data[, 2],family � quasibinomial( link � ” logit”))
Coefficients:( Intercept) G. 1. data[, 1] G. 1. data[, 2]
- 46. 17 32. 66 - 28. 96Degrees of Freedom: 899 Total ( i. e. Null); 897 ResidualNull Deviance: 1237Residual Deviance: 7. 072e- 07 AIC: NA
(Because we are creating the Gaussians on the fly, the coefficients etc.will differ.) We can get moreinformation with:
� summary( log. 1)Call:glm( formula � c( rep( 0, sz. 1), rep( 1, sz. 2)) ~ G. 1. data[, 1] �
G. 1. data[, 2], family � quasibinomial( link � ” logit”))Deviance Residuals:
Null deviance: 1. 2365e�03 on 899 degrees of freedomResidual deviance: 7. 0719e- 07 on 897 degrees of freedomAIC: NANumber of Fisher Scoring iterations: 25
Because we wish to do several logistic regressions (and other types of classifications), we want a fairlygeneral set of routines for displaying classifications of two dimensional data.
The code inCreateGrid.r :� #��������������������������������������
� # Set up the points at which to determine the values
plot.class.boundary (in the file example_display.r) is designed to show the boundaries betweentwo or more classes. It takes the training data (together with their corresponding classes) along withexpressions for the model and prediction. The model expression (see below) is evaluated to create a model;then the model is applied to predict the class for every point on the grid which has been created to cover thetraining data. This produces aZ value for each grid point and thecontour routine is used to draw curves
We will also make use of a function that will set up the arguments for the plotting as well as displayinginformation about the accuracy of the classification.� example. display �- function( train. data, train. class, test. data, test. class,
In example.display the data is plotted as numbers with both the colour and the number indicatingthe class. Theplot.class.boundary is used and information about the misclassification rate iscomputed. Points that are misclassified in thetraining set are displayed as blue� and in thetest set as blue�.
Note that this logistic regression can only distinguish between two classes.
We can repeat the example using our new display function:
� lr. 1 �- example. display( G. 1. data[, 1: 2], c( rep( 0, sz. 1), rep( 1, sz. 2)), {}, {},
����� The model is �����Call: glm( formula � factor( train. class) ~ train. data[, 1] � train. data[, 2],family � quasibinomial( link � ” logit”))Coefficients:
����� The model is �����Call: glm( formula � factor( train. class) ~ train. data[, 1] � train. data[, 2],family � quasibinomial( link � ” logit”))Coefficients:
����� The model is �����Call: glm( formula � factor( train. class) ~ train. data[, 1] � train. data[, 2],family � quasibinomial( link � ” logit”))Coefficients:
����� The model is �����Call: glm( formula � factor( train. class) ~ train. data[, 1] � train. data[, 2],family � quasibinomial( link � ” logit”))Coefficients:
Synthetic ExamplesFor all our problems in classification we will use some synthetic data setsto show how the differentclassifiers perform in different situations.
The data sets are created by the functioncreate.synthetic:
k-nearest neighbour classification is done for a test dataset using a training set. For data in the test set,thek nearest (in Euclidean distance) training set vectors are found, and the classification of a value inthe test set is decided by majority vote of the training set vectors, withties broken at random. If there areties for thekth nearest vector, all candidates are included in the vote.
Demonstration of what k-NN is all about:
We will create 3 classes from random normal data with� � 1 and centers at�3,3�, �5,5�, and�5,3� :
Figure 19. Red was predicted. Figure 20. Separating lines.
To illustrate, we could try a nearest neighbour method on our two-cloud data. We will determine theclassification of the points��6,3�,��5,3�, ...,�20,3� based on their 3 nearest neighbours.
It turns out that the result is given in the form of a factor, so in order to make use ofthe information (forcolouring points or doing numerical operations) we need to convert to numeric values.� unclass( nn. 3)[ 1] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2attr(,” levels”)[ 1] ” 0” ” 1”� unclass( nn. 3)[ 1] � 2[ 1] 3
To use theexample.display , weusually need to set up the model (in this case there is no model) andthe predictor and run the examples:� model. exp �- expression({})
When we look at the3 nearest neighbours for theoverlapping clouds, we see that the separating curve isvery twisty.This suggests that we are overfitting. Use of more neighbours will tend to smooth the curve.
� graphics. off()
� oldpar �- par( mfrow�c( 2, 2))
� example. display( G. 2. data[, 1: 2], c( rep( 0, sz. 3), rep( 1, sz. 4)), {}, {}, c( 100, 100), ” 5- NN”,
We can also look at the effect of choice of neighbours on our synthetic data sets.As before, we create a simple menu function to enable us to run several examples:� f. menu �- function(){
Another method to discriminate (and hence classify) two groups (populations) of datais due to Sir R. A.Fisher. Fisher’s idea was to transform amultivariate observationx into aunivariate observationy such thatthey’s derived from populations�i � i � 1 or 2) were separated (or different) as much as possible. Fishersuggested usinglinear combinations of the variables inx to create they’s. We define
� i � E�X � �i�. for i � 1,2
and assume the covariance matrix ofX
� � E �X � � i� �X � � i�T , i � 1,2
is the same (i.e. common) for the two populations. Then
Y � lTX
where l is a vector of weights and� iY � lT� i and�Y2 � lT�l so
the squared distance between the averages ofY for the two populations
variance ofY�
��1Y � �2Y �2
�Y2
�lT� 2
lT�lwhere� � �1 � �2
This is maximized by setting
l � c��1��1 � �2�
for anyc � 0. Choosingc � 1 gives us
Y� lTX � ��1 � �2�T ��1 X
which is called Fisher’slinear discriminant (LDA) function. We use this as a classification tool.
Note that this function transforms the midpoint��1��2�2 between the two multivariate population means
� i, i � 1,2 intom (which is the average of the 2 univariate means forY), i.e.
and any other pointx0 also gets transformed by this function. We assignx0 to �1(population 1� if
��1 � �2�T ��1x0 � m
and to�2 (population 2) if
��1 � �2�T ��1x0 � m
(This results in alinear separator for the 2 multivariate populations.)
Relaxing the assumption of a common covariance matrix for the two populations results in a modificationto LDA giving aquadratic discriminant rule (QDA) where we allocatex0 to �1 if
��1T�1
�1 � �2T�2
�1� x0 � 0.5 x0T��1
�1 � �2�1�x0 � 0.5��1
T�1�1�1 � �2
T�2�1�2�
� 0.5 ln |�1||�2|
and otherwise allocatex0 to �2 and hence we haveQDA which gives quadratic separation rather than linearseparation of the 2 populations.
We illustrate this in R using:� library( MASS)
As before, we can apply these to the cloud data:
� lda. c �- lda( G. 1. data[ , 1: 2], c( rep( 0, sz. 1), rep( 1, sz. 2)))Call:lda( G. 1. data[, 1: 2], grouping � c( rep( 0, sz. 1), rep( 1, sz. 2)))Prior probabilities of groups:
0 10. 5555556 0. 4444444Group means:
X Y0 3. 370369 14. 9047571 10. 021932 - 4. 662915Coefficients of linear discriminants:
� ( qda. c �- qda( G. 1. data[ , 1: 2], c( rep( 0, sz. 1), rep( 1, sz. 2))))Call:qda( G. 1. data[, 1: 2], grouping � c( rep( 0, sz. 1), rep( 1, sz. 2)))Prior probabilities of groups:
attr(,” error”)[ 1] 0����� Misclassified training cases �����No misclassified cases�����������������������������������
So far our examples have been in two dimensions. We could look at the flea beetles in six dimensions but,in order for us to visualize the results, we will restrict ourselves to three dimensions (using variables 1, 4,and 6).
The following function will determine the points at which the classes change in three dimensional spacebased on a classification model. We create a model (LDA or QDA) in the three dimensional space. Theidea is to then create a sequence of 30 grids on planes perpendicular to the d.flea[,4] direction and, on each
of these planes, use the model to compute the class at each of the grid points (as wasdone in theclassification examples). Then we use thediff function along each row and column of the grid todetermine points at which the class changes.
For example, if the values at the grid points are: