Top Banner
24

PROJECT

Jan 12, 2016

Download

Documents

jimbo

PROJECT. Compare SVM and standard models (neural nets etc) & Bagging+SVM+Data Reduction for fast and efficient learning on large data. By: Nitin ChaudharyProf: Dr. Vucetic Slobodan. INTRODUCTION. SVM:The goal in training a Support Vector machine is to find the separating hyperplane - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PROJECT
Page 2: PROJECT

PROJECT

•Compare SVM and standard models (neural nets etc)

&

•Bagging+SVM+Data Reduction for fast and efficient learning on large data

By: Nitin Chaudhary Prof: Dr. Vucetic Slobodan

Page 3: PROJECT

INTRODUCTION

SVM:The goal in training a Support Vector machine is to find the separating hyperplanewith the largest margin; we expect that the larger the margin, the better generalization of the classifier

y1

y2

Optimal Hyper Plane

Wow, These are the support vectors

Page 4: PROJECT

The Support Vectors are equally close to the hyperplane

The Support Vectors are the training samples that definethe optimal separating hyperplane and are the most difficult patterns to classify.

Informally speaking, they arethe patterns most informative for the classification task.

Page 5: PROJECT

FORMULATIONS

1) C-Support Vector Classificationmin (1/2)WT W + C Σ ξ iw,b,ξ

2) nu-Support Vector Classification

min (1/2)WT W- pv +(1/L) Σ ξ i

w,b,ξ,p

Above are the primal problems for both classifications

Page 6: PROJECT

Remember the form of polynomial kernel

(Gamma*<X(:,i),X(:,j)>+Coefficient)^Degree

The form of rbf or Gaussian Kernelexp(-Gamma*|X(:,i)-X(:,j)|^2)

Wow!I remember

these, it was taught in CIS 595

class

For the first part of the project I have used Pima.txt as my data set

Page 7: PROJECT

Gamma : If the input value is zero, Gamma will be set defautly as 1/(max_pattern_dimension) in the function. If the input value is non-zero, Gamma will remain unchanged in the function.

C- Cost of Constraint Violation

Page 8: PROJECT

METHODOLGY

For nu-SVC and C-SVC using rbf or Gaussian Kernel

C Gamma

ConfMat

Acc

1 1 260 0123 0 0.6789

100 0.0001 216 2770 70 0.7467

200 0.0001 218 3266 67 0.7441

1000 0.00001

214 3460 75 0.7546

0.00001

.000001

218 3149 85 0.7911

Page 9: PROJECT

For nu-SVC or C-SVC using Polynomial Kernel

Gamma C Coeff Degree

ConfMat

Acc

0.0001 100 2 3 225 3154 73 0.778

1

0.00001 10 1 5 232 2255 74 0.799

0

Page 10: PROJECT

For Neural Networks

HiddenNeuron

#of Training Iterations

Show

Max_fail ConfMat

Accu

5 100 10 5 120 2832 49 0.738

0

8 100 10 5 134 1632 47 0.790

4

10 100 10 50 136 1725 51 0.816

6

Page 11: PROJECT

RESULTS

1) Maximum Accuracy in case of nu-SVC using rbf orGaussian kernel is 79.11% at Gamma = 0.000001 andC=0.00001

2) Maximum Accuracy in case of C-SVC using rbf orGaussian kernel is 80.94% at Gamma = 0.000001 andC=100000

3) Maximum Accuracy in case of nu-SVC using Polynomial kernel is 79.90% at Gamma = 0.00001, C=10, Coeff = 1and Degree = 5

Page 12: PROJECT

RESULTS cont….

4) Maximum Accuracy in case of C-SVC using Polynomialkernel is 80.68% at Gamma = 0.00001, C=100, Coeff=10and Degree=3

5)Maximum Accuracy in case of Neural Networks is 81.66%at no of hidden neurons =10, number of training iterations=100, show=10, max_fail=50

Page 13: PROJECT

Bagging + SVM + Data Reduction for fast and efficient learning on large data

Goal: To perform Bagging on very large data sets for SVM.

The data set that I have used over here is cover_type.txt

Page 14: PROJECT

The test_covertype.txt has 20 attributes and 7 classes, so last 7 rows are classes. I have used class 2 ie column 22 as positive class and all other classes as negative class. So it means that I am only dealing with binary classification problem.

Same is true for train_covertype.txt

But what is Bagging?

Page 15: PROJECT

Bagging is a “bootstrap” ensemble method that creates individuals for it’s ensemble by training each classifier on a random distribution of the training set. Each classifier’s training set is generated by randomly drawing with replacement, N examples- where N is the size of the training set

But how do you actually do it.

Page 16: PROJECT

Methodology

1) Divide the data set in to train_covertype.txt and test_covertype.txt2) First take 20% of train_covertype.txt and train SVCusing any of the kernels and then use this particularkernel for future experiments too.3) Then we test our model with test_covertype.txt and get the Predicted Labels or Predictions4) Records these Predicted Labels as the first column of PreLabel Matrix5) Again repeat steps 2,3 and 4 for a few numberof times, say 3 or 56) And record the Predicted Labels as second, third andfourth column respectively7) Remember that when we take the 20% of train_coverType.txt after the first time we do it by replacement.

Page 17: PROJECT

So, now we get a nice PreLabeled Matrix

1 2 3 4 5

1 1 0 0 1

0 1 1 0 0

1 0 1 1 1

1 1 0 1 1

0 0 1 0 0

1 1 0 0 0

1 0 1 1 1

0 1 1 1 1

1 1 0 0 0 And so on

Page 18: PROJECT

Once I get the PreLabel Matrix I need to get the Predictions by majority vote

1 2 3 4 5

1 1 0 0 1

0 1 1 0 0

1 0 1 1 1

1 1 0 1 1

0 0 1 0 0

1 1 0 0 0

1 0 1 1 1

0 1 1 1 1

1 1 0 0 0

Predictions

1

0

1

1

0

0

1

1

0

Page 19: PROJECT

Now, with these Predictions and the True Labels from the test_covertype I calculate the accuracy by using accuracy.m file provided in the class

Similarly, I do the above steps exactly the same way by taking 30%, 40%, 50%,….. Of train_covertype.txt and try to calculate the accuracy.

Page 20: PROJECT

RESULTS

% of TrainDat Set

Accuracy

0.01% 0.5207+_(0.0263)

0.05% 0.5977+_(0.0122)

0.1% 0.6154+_(0.0232)

0.25% 0.6659+_(0.0075)

0.5% 0.7089+_(0.0084)

0.75% 0.7156+_(0.0027)

1% 0.7297+_(0.0066)

2.5% 0.7457+_(0.0047)

5% 0.7511+_(0.0043)

Page 21: PROJECT

RESULTS cont…

% of Traindata set

Accuracy

7.5% 0.7529+_(0.0016)

10% 0.7530+_(0.0029)

20% 0.7605+_(0.0024)

30% 0.7619+_(0.0026)

40% 0.7630+_(0.0036)

50% Takes too much time

Page 22: PROJECT

RESULTS cont..

We try to plot a graph between thePercentage of train_cover type takenAnd accuracy that we got.

Accuracy

% of Train_covertype.txt

Page 23: PROJECT

QUESTIONS?????

Page 24: PROJECT

Good Bye