K Nearest Neighbor Presentation

Post on 13-Apr-2017

148 Views

Category:

Data & Analytics

3 Downloads

Preview:

Click to see full reader

Transcript

k Nearest Neighbor

Dessy Amirudin

May 2016Data Science Indonesia

Bootcamp

Introduction

Other Name• K-Nearest Neighbors • Memory-Based Reasoning• Example-Based Reasoning• Instance-Based Learning• Case-Based Reasoning• Lazy Learning

History of kNN

• Has been used in statistical estimation and pattern recognition already in the beginning of 1970’s (non-parametric techniques).

• The outcome decision is based on k nearest neighbor from its evidence

• The nearest neighbor is calculated based on the distance

Application

text mining agriculture

financial healthcare

Source: http://personalexcellence.co/

Distance

• Numerical Data

• Categorical Data

𝐷=√∑𝑖=1𝑛

(𝑥 𝑖− 𝑦 𝑖 )2

Distance – Text Mining

Hamming Distance

•"karolin" and "kathrin" is 3.•"karolin" and "kerstin" is 3.•1011101 and 1001001 is 2.•2173896 and 2233796 is 3.

Regression Formulation

kNN Regression

0 2 4 6 8 10 12 14 16 180

20

40

60

80

100

120

0 2 4 6 8 10 12 14 16 180

5

10

15

20

25

30

35

40

kNN Regression

𝑦 ′= 1𝐾 ∑

𝑖=1

𝐾

𝑦 𝑖

Simple Linear Regression

Exercise 1• Open “simple_regression.R”• Create the simulated data• Follow the instruction

Simulated Data 1

MSE Plot Simple Regression

Plot with K=1

Plot with K=10

Plot with K=100

Simple Linear RegressionIntroduce Non Linearity

Introducing Non Linear Component

MSE Plot Non Linear Problem

Curse of Dimensionality

Exercise 2• Open “boston_knn_class.R”• Load MASS library• Load “Boston” data• Follow the step in the file

kNN Tips• Normalize the input variable• Find the optimum value of K using cross validation

Other experiment

Classification Formulation

kNN Classification

𝑦 ′=argmin𝑣

∑( 𝑥𝑖 , 𝑦 𝑖)∈𝐷𝑧

𝐼 (𝑣=𝑦 𝑖)

Binary Classification

Exercise 3• Open “logistic vs knn v2.R”• Follow the step

Recall on Confusion Table

• Source wikipedia

Multi-class Classification

Exercise 4• Open “multiclass.R”• Follow the step

Assigment

Assignment – Due to Next Week• Increase the accuracy of the Multiclass problem by 10%• In word document, tell what is the improvement that you can obtaind,

what is your method, why it is work, why it doesn’t work

• Submit your code and word document to trainer.datascience@gmail.com before 23 May 2016 23:59:59

Hint: You can increase the sample size

References

• Hastie T., Tibshirani R., Witten D. and James G. The Introduction of Statistical Learning. Springer. 2014.

top related