Top Banner
Supervised Learning Regression, Classification Linear regression, k-NN classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 11, 2014
24

Supervised Learning Regression, Classification Linear regression, k- NN classification

Jan 22, 2016

Download

Documents

fadey

Supervised Learning Regression, Classification Linear regression, k- NN classification. Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 11, 2014. An Example: Size of Engine vs Power. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Supervised Learning Regression, Classification Linear regression,  k- NN classification

Supervised Learning

Regression, ClassificationLinear regression, k-NN classification

Debapriyo Majumdar

Data Mining – Fall 2014

Indian Statistical Institute Kolkata

August 11, 2014

Page 2: Supervised Learning Regression, Classification Linear regression,  k- NN classification

2

An Example: Size of Engine vs Power

700 900 1100 1300 1500 1700 1900 2100 2300 25000

20406080

100120140160180200

Engine displacement (cc)

Pow

er (b

hp)

An unknown car has an engine of size 1800cc. What is likely to be the power of the engine?

Page 3: Supervised Learning Regression, Classification Linear regression,  k- NN classification

3

An Example: Size of Engine vs Power

700 900 1100 1300 1500 1700 1900 2100 2300 25000

20406080

100120140160180200

Engine displacement (cc)

Pow

er (b

hp)

Intuitively, the two variables have a relation Learn the relation from the given data Predict the target variable after learning

TargetVariable

Page 4: Supervised Learning Regression, Classification Linear regression,  k- NN classification

4

Exercise: on a simpler set of data points

Predict y for x = 2.5

0.5 1 1.5 2 2.5 3 3.5 4 4.50

2

4

6

8

10

12

y

x

x y1 12 33 74 10

2.5 ?

Page 5: Supervised Learning Regression, Classification Linear regression,  k- NN classification

5

Linear Regression

700 900 1100 1300 1500 1700 1900 2100 2300 25000

20406080

100120140160180200

Engine displacement (cc)

Pow

er (b

hp)

Assume: the relation is linear Then for a given x (=1800), predict the value of y

Training set

Page 6: Supervised Learning Regression, Classification Linear regression,  k- NN classification

6

Linear Regression

700 900 1100 1300 1500 1700 1900 2100 2300 25000

20406080

100120140160180200

Engine displacement (cc)

Pow

er (b

hp)

Linear regression Assume y = a . x + b Try to find suitable a and b

Optional exercise

Engine (cc)

Power (bhp)

800 601000 901200 801200 1001200 751400 901500 1201800 1602000 1402000 1702400 180

Page 7: Supervised Learning Regression, Classification Linear regression,  k- NN classification

7

Exercise: using Linear Regression

Define a regression line of your choice Predict y for x = 2.5

0.5 1 1.5 2 2.5 3 3.5 4 4.50

2

4

6

8

10

12

y

x

x y1 12 33 74 10

2.5 ?

Page 8: Supervised Learning Regression, Classification Linear regression,  k- NN classification

8

Choosing the parameters right

The data points: (x1, y1), (x2, y2), … , (xm, ym)

The regression line: f(x) = y = a . x + b

Least-square cost function: J = Σi ( f(xi) – yi )2

Goal: minimize J over choices of a and b

700 900 1100 1300 1500 1700 1900 2100 2300 25000

20406080

100120140160180200

x

y

Goal: minimizing the deviation from the actual data points

Page 9: Supervised Learning Regression, Classification Linear regression,  k- NN classification

9

How to Minimize the Cost Function?

Goal: minimize J for all values of a and b Start from some a = a0 and b = b0

Compute: J(a0,b0)

Simultaneously change a and b towards the negative gradient and eventually hope to arrive an optimal

Question: Can there be more than one optimal?

a

b

Δ

Page 10: Supervised Learning Regression, Classification Linear regression,  k- NN classification

10

Another example:

Given that a person’s age is 24, predict if (s)he has high blood sugar

Discrete values of the target variable (Y / N) Many ways of approaching this problem

0 10 20 30 40 50 60 70 80

Hig

h bl

ood

suga

r

N

Y

Age

Training set

Page 11: Supervised Learning Regression, Classification Linear regression,  k- NN classification

11

Classification problem

One approach: what other data points are nearest to the new point?

Other approaches?

0 10 20 30 40 50 60 70 80

Hig

h bl

ood

suga

r

N

Y

Age

?

24

Page 12: Supervised Learning Regression, Classification Linear regression,  k- NN classification

12

Classification Algorithms The k-nearest neighbor classification Naïve Bayes classification Decision Tree Linear Discriminant Analysis Logistics Regression Support Vector Machine

Page 13: Supervised Learning Regression, Classification Linear regression,  k- NN classification

13

Classification or Regression?Given data about some cars: engine size, number of seats, petrol / diesel, has airbag or not, price

Problem 1: Given engine size of a new car, what is likely to be the price?

Problem 2: Given the engine size of a new car, is it likely that the car is run by petrol?

Problem 3: Given the engine size, is it likely that the car has airbags?

Page 14: Supervised Learning Regression, Classification Linear regression,  k- NN classification

Classification

Page 15: Supervised Learning Regression, Classification Linear regression,  k- NN classification

15

Example: Age, Income and Owning a flat

10 20 30 40 50 60 700

50

100

150

200

250

Mon

thly

inco

me

(tho

usan

d ru

pees

)

Age

Training set• Owns a

flat

• Does not own a flat

Given a new person’s age and income, predict – does (s)he own a flat?

Page 16: Supervised Learning Regression, Classification Linear regression,  k- NN classification

16

Example: Age, Income and Owning a flat

10 20 30 40 50 60 700

50

100

150

200

250

Mon

thly

inco

me

(tho

usan

d ru

pees

)

Age

Nearest neighbor approach Find nearest neighbors among the known data points

and check their labels

Training set• Owns a

flat

• Does not own a flat

Page 17: Supervised Learning Regression, Classification Linear regression,  k- NN classification

17

Example: Age, Income and Owning a flat

10 20 30 40 50 60 700

50

100

150

200

250

Mon

thly

inco

me

(tho

usan

d ru

pees

)

Age

The 1-Nearest Neighbor (1-NN) Algorithm:– Find the closest point in the training set– Output the label of the nearest neighbor

Training set• Owns a

flat

• Does not own a flat

Page 18: Supervised Learning Regression, Classification Linear regression,  k- NN classification

18

The k-Nearest Neighbor Algorithm

10 20 30 40 50 60 700

50

100

150

200

250

Mon

thly

inco

me

(tho

usan

d ru

pees

)

Age

The k-Nearest Neighbor (k-NN) Algorithm:– Find the closest k point in the training set– Majority vote among the labels of the k points

Training set• Owns a

flat

• Does not own a flat

Page 19: Supervised Learning Regression, Classification Linear regression,  k- NN classification

19

Distance measures How to measure distance to find closest points? Euclidean: Distance between vectors x = (x1, … , xk)

and y = (y1, … , yk)

Manhattan distance:

Generalized squared interpoint distance: S is the covariance matrix

The Maholanobis distance (1936)

Page 20: Supervised Learning Regression, Classification Linear regression,  k- NN classification

20

Classification setup Training data / set: set of input data points and given

answers for the data points Labels: the list of possible answers Test data / set: inputs to the classification algorithm

for finding labels– Used for evaluating the algorithm in case the answers are

known (but known to the algorithm)

Classification task: Determining labels of the data points for which the label is not known or not passed to the algorithm

Features: attributes that represent the data

Page 21: Supervised Learning Regression, Classification Linear regression,  k- NN classification

21

Evaluation Test set accuracy: the correct performance measure Accuracy = #of correct answer / #of all answers Need to know the true test labels – Option: use training set itself– Parameter selection (for k-NN) by accuracy on training set

Overfitting: a classifier performs too good on training set compared to new (unlabeled) test data

Page 22: Supervised Learning Regression, Classification Linear regression,  k- NN classification

22

Better validation methods Leave one out:– For each training data point x of training set D– Construct training set D – x, test set {x}– Train on D – x, test on x– Overall accuracy = average over all such cases– Expensive to compute

Hold out set: – Randomly choose x% (say 25-30%) of the training data, set

aside as test set– Train on the rest of training data, test on the test set– Easy to compute, but tends to have higher variance

Page 23: Supervised Learning Regression, Classification Linear regression,  k- NN classification

23

The k-fold Cross Validation Method Randomly divide the training data into k partitions

D1,…, Dk : possibly equal division

For each fold Di

– Train a classifier with training data = D – Di

– Test and validate with Di

Overall accuracy: average accuracy over all cases

Page 24: Supervised Learning Regression, Classification Linear regression,  k- NN classification

24

References Lecture videos by Prof. Andrew Ng, Stanford University

Available on Coursera (Course: Machine Learning)

Data Mining Map: http://www.saedsayad.com/