Top Banner
Notes and Announcements Midterm exam: Oct 20, Wednesday, In Class Late Homeworks Turn in hardcopies to Michelle. DO NOT ask Michelle for extensions. Note down the date and time of submission. If submitting softcopy, email to 10-701 instructors list. Software needs to be submitted via Blackboard. HW2 out today watch email 1
49

Multi-resolution methods for - School of Computer Science

Sep 12, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multi-resolution methods for - School of Computer Science

Notes and Announcements

• Midterm exam: Oct 20, Wednesday, In Class

• Late Homeworks– Turn in hardcopies to Michelle.– DO NOT ask Michelle for extensions.– Note down the date and time of submission.– If submitting softcopy, email to 10-701 instructors list.– Software needs to be submitted via Blackboard.

• HW2 out today – watch email

1

Page 2: Multi-resolution methods for - School of Computer Science

Projects

Hands-on experience with Machine Learning Algorithms –understand when they work and fail, develop new ones!

Project Ideas online, discuss TAs, every project must have a TA mentor

• Proposal (10%): Oct 11

• Mid-term report (25%): Nov 8

• Poster presentation (20%): Dec 2, 3-6 pm, NSH Atrium

• Final Project report (45%): Dec 6

2

Page 3: Multi-resolution methods for - School of Computer Science

Project Proposal

• Proposal (10%): Oct 11

– 1 pg maximum

– Describe data set

– Project idea (approx two paragraphs)

– Software you will need to write.

– 1-3 relevant papers. Read at least one before submitting your proposal.

– Teammate. Maximum team size is 2. division of work

– Project milestone for mid-term report? Include experimental results.

3

Page 4: Multi-resolution methods for - School of Computer Science

Recitation Tomorrow!

4

• Linear & Non-linear Regression, Nonparametric methods

• Strongly recommended!!

• Place: NSH 1507 (Note)

• Time: 5-6 pm

TK

Page 5: Multi-resolution methods for - School of Computer Science

Non-parametric methods

Kernel density estimate, kNN classifier, kernel regression

Aarti Singh

Machine Learning 10-701/15-781Sept 29, 2010

Page 6: Multi-resolution methods for - School of Computer Science

Parametric methods

• Assume some functional form (Gaussian, Bernoulli, Multinomial, logistic, Linear) for

– P(Xi|Y) and P(Y) as in Naïve Bayes

– P(Y|X) as in Logistic regression

• Estimate parameters (m,s2,q,w,b) using MLE/MAP and plug in

• Pro – need few data points to learn parameters

• Con – Strong distributional assumptions, not satisfied in practice

6

Page 7: Multi-resolution methods for - School of Computer Science

Example

7

35

8

7

9

4

2

1

Hand-written digit imagesprojected as points on a two-dimensional (nonlinear) feature spaces

Page 8: Multi-resolution methods for - School of Computer Science

Non-Parametric methods

• Typically don’t make any distributional assumptions

• As we have more data, we should be able to learn more complex models

• Let number of parameters scale with number of training data

• Today, we will see some nonparametric methods for

– Density estimation

– Classification

– Regression8

Page 9: Multi-resolution methods for - School of Computer Science

Histogram density estimate

9

Partition the feature space into distinct bins with widths ¢i and count the number of observations, ni, in each bin.

• Often, the same width is

used for all bins, ¢i = ¢.

• ¢ acts as a smoothing

parameter.

Image src: Bishop book

Page 10: Multi-resolution methods for - School of Computer Science

Effect of histogram bin width

10

# bins = 1/D

Assuming density it roughly constant in each bin(holds true if D is small)

Bias of histogram density estimate: x

Page 11: Multi-resolution methods for - School of Computer Science

Bias – Variance tradeoff

• Choice of #bins

• Bias – how close is the mean of estimate to the truth• Variance – how much does the estimate vary around mean

Small D, large #bins “Small bias, Large variance”

Large D, small #bins “Large bias, Small variance”

Bias-Variance tradeoff11

# bins = 1/D

(p(x) approx constant per bin)

(more data per bin, stable estimate)

Page 12: Multi-resolution methods for - School of Computer Science

Choice of #bins

12Image src: Bishop book

# bins = 1/D

Image src: Larry book

fixed nD decreasesni decreases

MSE

= B

ias

+ V

aria

nce

Page 13: Multi-resolution methods for - School of Computer Science

Histogram as MLE

• Class of density estimates – constants on each binParameters pj - density in bin j

Note since

• Maximize likelihood of data under probability model with parameters pj

• Show that histogram density estimate is MLE under this model – HW/Recitation

13

Page 14: Multi-resolution methods for - School of Computer Science

• Histogram – blocky estimate

• Kernel density estimate aka “Parzen/moving window method”

Kernel density estimate

14

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Page 15: Multi-resolution methods for - School of Computer Science

• more generally

Kernel density estimate

15

-1 1

Page 16: Multi-resolution methods for - School of Computer Science

Kernel density estimation

16

Gaussian bumps (red) around six data points and their sum (blue)

• Place small "bumps" at each data point, determined by the kernel function.

• The estimator consists of a (normalized) "sum of bumps”.

• Note that where the points are denser the density estimate will have higher values.

Img src: Wikipedia

Page 17: Multi-resolution methods for - School of Computer Science

Kernels

17

Any kernel function that satisfies

Page 18: Multi-resolution methods for - School of Computer Science

Kernels

18

Finite support – only need local

points to computeestimate

Infinite support- need all points tocompute estimate

-But quite popular since smoother (10-702)

Page 19: Multi-resolution methods for - School of Computer Science

Choice of kernel bandwidth

19

Image Source: Larry’s book – All of NonparametricStatistics

Bart-Simpson Density

Too small

Too largeJust right

Page 20: Multi-resolution methods for - School of Computer Science

Histograms vs. Kernel density estimation

20

D = h acts as a smoother.

Page 21: Multi-resolution methods for - School of Computer Science

Bias-variance tradeoff

• Simulations

21

Page 22: Multi-resolution methods for - School of Computer Science

k-NN (Nearest Neighbor) density estimation

• Histogram

• Kernel density est

Fix D, estimate number of points within D of x (ni or nx) from data

Fix nx= k, estimate D from data (volume of ball around x that contains k training pts)

• k-NN density est

22

Page 23: Multi-resolution methods for - School of Computer Science

k-NN density estimation

23

k acts as a smoother.

Not very popular for densityestimation - expensive to compute, bad estimates

But a related versionfor classification quite popular…

Page 24: Multi-resolution methods for - School of Computer Science

From

Density estimation

to

Classification

24

Page 25: Multi-resolution methods for - School of Computer Science

k-NN classifier

25

Sports

Science

Arts

Page 26: Multi-resolution methods for - School of Computer Science

k-NN classifier

26

Sports

Science

Arts

Test document

Page 27: Multi-resolution methods for - School of Computer Science

k-NN classifier (k=4)

27

Sports

Science

Arts

Test document

What should we predict? … Average? Majority? Why?

Dk,x

Page 28: Multi-resolution methods for - School of Computer Science

k-NN classifier

28

• Optimal Classifier:

• k-NN Classifier:

(Majority vote)

# total training pts of class y

# training pts of class ythat lie within Dk ball

Page 29: Multi-resolution methods for - School of Computer Science

1-Nearest Neighbor (kNN) classifier

Sports

Science

Arts

29

Page 30: Multi-resolution methods for - School of Computer Science

2-Nearest Neighbor (kNN) classifier

Sports

Science

Arts

30

K even not usedin practice

Page 31: Multi-resolution methods for - School of Computer Science

3-Nearest Neighbor (kNN) classifier

Sports

Science

Arts

31

Page 32: Multi-resolution methods for - School of Computer Science

5-Nearest Neighbor (kNN) classifier

Sports

Science

Arts

32

Page 33: Multi-resolution methods for - School of Computer Science

What is the best K?

33

Bias-variance tradeoffLarger K => predicted label is more stable Smaller K => predicted label is more accurate

Similar to density estimation

Choice of K - in next class …

Page 34: Multi-resolution methods for - School of Computer Science

1-NN classifier – decision boundary

34

K = 1

VoronoiDiagram

Page 35: Multi-resolution methods for - School of Computer Science

k-NN classifier – decision boundary

35

• K acts as a smoother (Bias-variance tradeoff)

• Guarantee: For , the error rate of the 1-nearest-neighbour classifier is never more than twice the optimal error.

Page 36: Multi-resolution methods for - School of Computer Science

Case Study:kNN for Web Classification

• Dataset – 20 News Groups (20 classes)

– Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/)

– 61,118 words, 18,774 documents

– Class labels descriptions

37

Page 37: Multi-resolution methods for - School of Computer Science

Experimental Setup

• Training/Test Sets: – 50%-50% randomly split.

– 10 runs

– report average results

• Evaluation Criteria:

38

Page 38: Multi-resolution methods for - School of Computer Science

Results: Binary Classes

alt.atheismvs.

comp.graphics

rec.autosvs.

rec.sport.baseball

comp.windows.xvs.

rec.motorcycles

k

Acc

ura

cy

39

Page 39: Multi-resolution methods for - School of Computer Science

From

Classification

to

Regression

40

Page 40: Multi-resolution methods for - School of Computer Science

Temperature sensing

• What is the temperature

in the room?

41

Average “Local” Average

at location x?

x

Page 41: Multi-resolution methods for - School of Computer Science

Kernel Regression

• Aka Local Regression

• Nadaraya-Watson Kernel Estimator

Where

• Weight each training point based on distance to test point

• Boxcar kernel yields

local average

42

h

Page 42: Multi-resolution methods for - School of Computer Science

Kernels

43

Page 43: Multi-resolution methods for - School of Computer Science

Choice of kernel bandwidth h

44

Image Source: Larry’s book – All of NonparametricStatistics

h=1 h=10

h=50 h=200

Choice of kernel isnot that important

Too small

Too largeJust right

Too small

Page 44: Multi-resolution methods for - School of Computer Science

Kernel Regression as Weighted Least Squares

45

Weighted Least Squares

Kernel regression corresponds to locally constant estimator obtained from (locally) weighted least squares

i.e. set f(Xi) = b (a constant)

Page 45: Multi-resolution methods for - School of Computer Science

Kernel Regression as Weighted Least Squares

46

constant

Notice that

set f(Xi) = b (a constant)

Page 46: Multi-resolution methods for - School of Computer Science

Local Linear/Polynomial Regression

47

Weighted Least Squares

Local Polynomial regression corresponds to locally polynomial estimator obtained from (locally) weighted least squares

i.e. set

(local polynomial of degree p around X)

More in HW, 10-702 (statistical machine learning)

Page 47: Multi-resolution methods for - School of Computer Science

Summary

• Instance based/non-parametric approaches

48

Four things make a memory based learner:

1. A distance metric, dist(x,Xi)Euclidean (and many more)

2. How many nearby neighbors/radius to look at?k, D/h

3. A weighting function (optional)W based on kernel K

4. How to fit with the local points?Average, Majority vote, Weighted average, Poly fit

Page 48: Multi-resolution methods for - School of Computer Science

Summary

• Parametric vs Nonparametric approaches

49

Nonparametric models place very mild assumptions on the data distribution and provide good models for complex data

Parametric models rely on very strong (simplistic) distributional assumptions

Nonparametric models (not histograms) requires storing and computing with the entire data set.

Parametric models, once fitted, are much more efficient in terms of storage and computation.

Page 49: Multi-resolution methods for - School of Computer Science

What you should know…

• Histograms, Kernel density estimation– Effect of bin width/ kernel bandwidth– Bias-variance tradeoff

• K-NN classifier– Nonlinear decision boundaries

• Kernel (local) regression– Interpretation as weighted least squares– Local constant/linear/polynomial regression

50