Notes and Announcements • Midterm exam: Oct 20, Wednesday, In Class • Late Homeworks – Turn in hardcopies to Michelle. – DO NOT ask Michelle for extensions. – Note down the date and time of submission. – If submitting softcopy, email to 10-701 instructors list. – Software needs to be submitted via Blackboard. • HW2 out today – watch email 1
49
Embed
Multi-resolution methods for - School of Computer Science
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Notes and Announcements
• Midterm exam: Oct 20, Wednesday, In Class
• Late Homeworks– Turn in hardcopies to Michelle.– DO NOT ask Michelle for extensions.– Note down the date and time of submission.– If submitting softcopy, email to 10-701 instructors list.– Software needs to be submitted via Blackboard.
• HW2 out today – watch email
1
Projects
Hands-on experience with Machine Learning Algorithms –understand when they work and fail, develop new ones!
Project Ideas online, discuss TAs, every project must have a TA mentor
• Proposal (10%): Oct 11
• Mid-term report (25%): Nov 8
• Poster presentation (20%): Dec 2, 3-6 pm, NSH Atrium
• Final Project report (45%): Dec 6
2
Project Proposal
• Proposal (10%): Oct 11
– 1 pg maximum
– Describe data set
– Project idea (approx two paragraphs)
– Software you will need to write.
– 1-3 relevant papers. Read at least one before submitting your proposal.
– Teammate. Maximum team size is 2. division of work
– Project milestone for mid-term report? Include experimental results.
3
Recitation Tomorrow!
4
• Linear & Non-linear Regression, Nonparametric methods
• Strongly recommended!!
• Place: NSH 1507 (Note)
• Time: 5-6 pm
TK
Non-parametric methods
Kernel density estimate, kNN classifier, kernel regression
Aarti Singh
Machine Learning 10-701/15-781Sept 29, 2010
Parametric methods
• Assume some functional form (Gaussian, Bernoulli, Multinomial, logistic, Linear) for
– P(Xi|Y) and P(Y) as in Naïve Bayes
– P(Y|X) as in Logistic regression
• Estimate parameters (m,s2,q,w,b) using MLE/MAP and plug in
• Pro – need few data points to learn parameters
• Con – Strong distributional assumptions, not satisfied in practice
6
Example
7
35
8
7
9
4
2
1
Hand-written digit imagesprojected as points on a two-dimensional (nonlinear) feature spaces
Non-Parametric methods
• Typically don’t make any distributional assumptions
• As we have more data, we should be able to learn more complex models
• Let number of parameters scale with number of training data
• Today, we will see some nonparametric methods for
– Density estimation
– Classification
– Regression8
Histogram density estimate
9
Partition the feature space into distinct bins with widths ¢i and count the number of observations, ni, in each bin.
• Often, the same width is
used for all bins, ¢i = ¢.
• ¢ acts as a smoothing
parameter.
Image src: Bishop book
Effect of histogram bin width
10
# bins = 1/D
Assuming density it roughly constant in each bin(holds true if D is small)
Bias of histogram density estimate: x
Bias – Variance tradeoff
• Choice of #bins
• Bias – how close is the mean of estimate to the truth• Variance – how much does the estimate vary around mean
Small D, large #bins “Small bias, Large variance”
Large D, small #bins “Large bias, Small variance”
Bias-Variance tradeoff11
# bins = 1/D
(p(x) approx constant per bin)
(more data per bin, stable estimate)
Choice of #bins
12Image src: Bishop book
# bins = 1/D
Image src: Larry book
fixed nD decreasesni decreases
MSE
= B
ias
+ V
aria
nce
Histogram as MLE
• Class of density estimates – constants on each binParameters pj - density in bin j
Note since
• Maximize likelihood of data under probability model with parameters pj
• Show that histogram density estimate is MLE under this model – HW/Recitation
13
• Histogram – blocky estimate
• Kernel density estimate aka “Parzen/moving window method”
Kernel density estimate
14
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
• more generally
Kernel density estimate
15
-1 1
Kernel density estimation
16
Gaussian bumps (red) around six data points and their sum (blue)
• Place small "bumps" at each data point, determined by the kernel function.
• The estimator consists of a (normalized) "sum of bumps”.
• Note that where the points are denser the density estimate will have higher values.
Img src: Wikipedia
Kernels
17
Any kernel function that satisfies
Kernels
18
Finite support – only need local
points to computeestimate
Infinite support- need all points tocompute estimate
-But quite popular since smoother (10-702)
Choice of kernel bandwidth
19
Image Source: Larry’s book – All of NonparametricStatistics
Bart-Simpson Density
Too small
Too largeJust right
Histograms vs. Kernel density estimation
20
D = h acts as a smoother.
Bias-variance tradeoff
• Simulations
21
k-NN (Nearest Neighbor) density estimation
• Histogram
• Kernel density est
Fix D, estimate number of points within D of x (ni or nx) from data
Fix nx= k, estimate D from data (volume of ball around x that contains k training pts)
• k-NN density est
22
k-NN density estimation
23
k acts as a smoother.
Not very popular for densityestimation - expensive to compute, bad estimates
But a related versionfor classification quite popular…
From
Density estimation
to
Classification
24
k-NN classifier
25
Sports
Science
Arts
k-NN classifier
26
Sports
Science
Arts
Test document
k-NN classifier (k=4)
27
Sports
Science
Arts
Test document
What should we predict? … Average? Majority? Why?
Dk,x
k-NN classifier
28
• Optimal Classifier:
• k-NN Classifier:
(Majority vote)
# total training pts of class y
# training pts of class ythat lie within Dk ball
1-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
29
2-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
30
K even not usedin practice
3-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
31
5-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
32
What is the best K?
33
Bias-variance tradeoffLarger K => predicted label is more stable Smaller K => predicted label is more accurate
Similar to density estimation
Choice of K - in next class …
1-NN classifier – decision boundary
34
K = 1
VoronoiDiagram
k-NN classifier – decision boundary
35
• K acts as a smoother (Bias-variance tradeoff)
• Guarantee: For , the error rate of the 1-nearest-neighbour classifier is never more than twice the optimal error.