Global predictors of regression fidelity

PowerPoint Presentation

Global predictors of regression fidelityA single number to characterize the overall quality of the surrogate.Equivalence measuresCoefficient of multiple determinationAdjusted coefficient of multiple determinationPrediction accuracy measuresModel independent: Cross validation errorModel dependent: Standard error

This lecture is about obtaining measures characterizing the fidelity of the surrogate for predicting the behavior of the future simulations. We will limit ourselves here to global measures, which means we will get a single number that will characterize the overall fidelity.

The coefficient of multiple determination and its adjusted cousin measure the equivalence between the surrogate and the data in terms of variability. The first provides the fraction of the variability in the data captured by the surrogate. The second adjusts it in an attempt to estimate the fraction that will be captured by using the surrogate to predict values at other points. Good fidelity will be reflected in these coefficients being close to 1.

Prediction accuracy measures estimate what will be the rms error in predictions based on the surrogate. Cross validation error is a measure that can be applied to any surrogate, while the standard error applies only to linear regression with specific assumptions on the noise in the data. Good fidelity will be reflected in these errors being small compared to the average values in the data.

There are also measures that estimate the error in the coefficients and the error at a given point that will be discussed in future lectures.1Linear Regression

2Coefficient of multiple determinationEquivalence of surrogate with data is often measured by how much of the variance in the data is captured by the surrogate.

Coefficient of multiple determination and adjusted version

3R2 does not reflect accuracyCompare y1=x to y2=0.1x plus same noise (normally distributed with zero mean and standard deviation of 1.Estimate the average errors between the function (red) and surrogate (blue).

R2=0.9785

R2=0.3016

4Cross validationValidation consists of checking the surrogate at a set of validation points.This may be considered wasteful because we do not use all the points for fitting the best possible surrogate.Cross validation divides data into ng groups.Fit the approximation to ng -1 groups, and use last group to estimate error. Repeat for each group.When each group consists of one point, error often called PRESS (prediction error sum of squares)Calculate error at each point and then present r.m.s errorFor linear regression can be shown that

5Model based error for linear regressionThe common assumptions for linear regression Surrogate is in functional form of true functionThe data is contaminated with normally distributed error with the same standard deviation at every point.The errors at different points are not correlated.Under these assumptions, the noise standard deviation (called standard error) is estimated as.

Similarly, the standard error in the coefficients is

6Comparison of errors

7Top hat questionWe sample the function y=x with noise at x=0, 1, 2 to get 0.5, 0.5, 2.5. Assume that the linear regression fit is y=0.8x.What are the noise (epsilon), the discrepancy (e), the cross-validation error, and the actual error at x=2.Prediction varianceLinear regression model

Define then

With some algebra

Standard error

9Example of prediction varianceFor a linear polynomial RS y=b1+b2x1+b3x2 find the prediction variance in the region

(a) For data at three vertices (omitting (1,1))

10Interpolation vs. ExtrapolationAt origin . At 3 vertices . At (1,1)

11Standard error contours

12Data at four verticesNow

And

Error at vertices

At the origin minimum is

How can we reduce error without adding points?

13Graphical Comparison of Standard ErrorsThree pointsFour points

A graphical comparison of the two cases, shows that the effect of adding the point on the regions with low prediction variance is small. On the other hand, because we avoided extrapolation, the largest standard error was reduced by a factor of two.14Problems The pairs (0,0), (1,1), (2,1) represent strain (millistrains) and stress (ksi) measurements.Estimate Youngs modulus using regression.Calculate the error in Young modulus using cross validation both from the definition and from the formula on Slide 5.Repeat the example of y=x, using only data at x=3,6,9,,30. Use the same noise values as given for these points in the notes for Slide 4.

15

Global predictors of regression fidelity

Documents

standard error

linear regression surrogate

distributed error

rms error

surrogate blue

model based error

best possible surrogate

linear regression fit