Slide 1
Local surrogatesTo model a complex wavy function we need a lot
of data.Modeling a wavy function with high order polynomials is
inherently ill-conditioned. With a lot of data we normally predict
function values using only nearby values. We may fit several local
surrogates as in figure.
For example, if you have the price of gasoline every first of
the month from 2000 through 2009, how many values would you use to
estimate the price on June 15, 2007?
Linear regression using low order polynomials is ideal for
approximating simple functions when the data is contaminated with
substantial noise. In that case we may want to have many data
points in order to filter out the noise and this is done well by
giving all the points equal weight. On the other hand, when we
model a complex wavy function without much noise, we need the large
number of points in order to capture the local behavior of the
function.
It is possible to model a complex function with high order
polynomials, but this Is inherently ill conditioned. Even if we use
orthogonal polynomials to alleviate the ill conditioning, we often
get poor accuracy.
Instead, with a lot of data about a wavy function, we should use
only data near the point where we want to estimate the function.
For example, if we have a table that lists the price of gasoline
every first of the month from 2000 to 2009, and we want to estimate
the price on June 15, 2007, we would probably interpolate linearly
using the values for June 1, 2007 and July 1, 2007.
The figure compares a single global surrogate fitted to the
entire data with three local surrogate fitted to the data in three
different regions.
1Popular local surrogatesMoving least squares: Weighting more
heavily points near the prediction location.Radial basis neural
network: Regression with local functions that decay away from data
points.Kriging: Radial basis functions, but fitting philosophy not
based on error at data points but on correlation between function
values at near and far points.
In this lecture we will study two local surrogates, moving least
squares and radial basis neural networks. Moving least squares
performs weighted linear regression with the points nearby having
larger weights than far away points. Radial basis neural networks
achieves a similar result by having the values of the data at
points multiply functions that peak at the data point and decay
rapidly away from that point. A third popular local surrogate,
kriging, is covered in a separate lecture, because it is currently
the most popular local surrogate and is more versatile than the
other two. However, it is also much more computationally expensive,
especially for large number of data points.2Review of Linear
Regression
3Moving least squares
4Weighted least squares
5Six-hump camelback functionDefinition:
Function fit with moving least squares using quadratic
polynomials.
6Effect of number of points and decay rate.
7Radial basis neural networks
a1a2a3x(x)InputOutputRadial basis functionW1W2W30.5Radial basis
functions10-0.8330.833bInput
8In regression notation
9ExampleEvaluate the function y=x+0.5sin(5x) at 21 points in the
interval [1,9], fit an RBF to it and compare the surrogate to the
function over the interval[0,10].
Fit using default options in Matlab, achieves zero rms error by
using all data points as basis functions (neurons)Very good
interpolation, but even mild extrapolation is horrible.
10Accept 0.1 mean squared errornet=newrb(x,y,0.1,1,20,1); spread
set to 1,( 11 neurons were used).
With about half of the data points used as basis functions, the
fit is more like polynomial regression.Interpolation is not as
good, but the trend is captured, so that extrapolation is not as
disastrous.Obviously, if we just wanted to capture the trend, we
would have been better with a polynomial.
At the other extreme, we can allow fairly large mean square
error and get more robust fit in terms of the trend at the expense
of poorer interpolation. Here we specify mean square error of 0.1,
which would correspond to rms error of about 0.3.
As can be seen in the figure, the fit now captures the trend
better, so that extrapolation is less risky. This is done by using
only 11 of the 21 data points as neurons, so that the ratio between
data points and coefficient is similar to what we normally look for
in polynomial regression. However, this comes at the expense of
much poorer accuracy in the interpolation range. Obviously, though,
if we merely want to capture the overall trend, we should not be
using a purely local surrogate like radial basis neural network.
Kriging, for example, uses the same local shape functions, but it
also permits modeling a trend, so the local functions are used only
to model the departure from the trend.11Too narrow a
spreadnet=newrb(x,y,0.1,0.2,20,1); ( 17 neurons used)
With a spread of 0.2 and the points being 0.4 apart (21 points
in [1,9]), the shape functions decay to less than 0.02 at the
nearest point.This means that each data point if fitted
individually, so that we get spikes at data points.A rule of thumb
is that the spread should not be smaller than the distance to the
nearest point.
12ProblemsFit the example with weighted least squares. You can
use Matlabs lscov to perform the fit. Compare the fit to the one
obtained with the neural network fit.Repeat the example with 41
points, experimenting with the parameters of newrb. How much of
what you see did you expect?