Top Banner
Chapter 18,19.
68
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter 18,19

Chapter 18,19.

Page 2: Chapter 18,19

Building a model can be a never-ending process

IMPROVE THE MODEL

ADDING INTERACTION S

Taking away variables

Doing transformation

Page 3: Chapter 18,19

How do we judge the quality of the model?

The answer :

In relation to other models.

an analysis of residuals

drop-in deviance

the results of an

ANOVA test Wald test

the AIC or BIC score

cross-validation

error

bootstrapping.

Page 4: Chapter 18,19

18.1. Residuals

The difference between the actual response and the fitted values.

where the errors, akin to residuals, are

normally distributed.

The basic idea is that if the model is appropriately fitted to the data, the residuals should be normally distributed as well.

Page 5: Chapter 18,19
Page 6: Chapter 18,19

each coefficient is plotted as a point with a thick line representing the one standard error confidence interval and a thin line representing the two standard error confidence interval. There is a vertical line indicating 0. In general, a good rule of thumb is that if the two standard error confidence interval does not contain 0, it is statistically significant.

Remember

Page 7: Chapter 18,19
Page 8: Chapter 18,19
Page 9: Chapter 18,19

ggplot2 with linear regression

has a handy trick for dealing with lm models. We can use the model as the data source and ggplot2 “fortifies” it, creating new columns, for easy plotting

The basic structure for ggplot2 starts with the ggplot function,which at its most basic should take the data as its first argument. It can take more arguments, or fewer, but we will stick with that for now. After initializing the object, we add layers using the + symbol. To start, we will just discuss geometric layers such as points, lines and histograms. They are included using functions like geom point, geom line and geom histogram. These functions take multiple arguments, the most important being which variable in the data gets mapped to which axis or other aesthetic using aes. Furthermore, each layer can have different aesthetic mappings and even different data.

ggplot2

Page 10: Chapter 18,19
Page 11: Chapter 18,19
Page 12: Chapter 18,19
Page 13: Chapter 18,19
Page 14: Chapter 18,19
Page 15: Chapter 18,19

Q-Q plot

If the model is a good fit, the standardized residuals should all fall along a straight line when plotted against the theoretical quantiles of the normal distribution. Both the base graphics and ggplot2 versions are shown in next slide .

Page 16: Chapter 18,19
Page 17: Chapter 18,19

histogram of the residuals. This time we will not be showing the base graphics alternative because a histogram is standard plot that we have shown repeatedly. The histogram is not normally distributed, meaning model is not an entirely correct.

histogram

Page 18: Chapter 18,19
Page 19: Chapter 18,19

All of this measuring of model fit only really makes sense when comparing multiple models, because all of these measures are relative.

Page 20: Chapter 18,19
Page 21: Chapter 18,19
Page 22: Chapter 18,19

where : ni is the number of observations in group i, i is the mean of group i, is the overall mean, Yij is observation j in group i, N is the total number of observations K is the number of groups.

Page 23: Chapter 18,19

ANOVA for a multisample test, we do believe it serves a useful purpose in testing the relative merits of different models. Simply passing multiple model objects to anova will return a table of results including the residual sum of squares (RSS), which is a measure of error, the lower the better.

Page 24: Chapter 18,19
Page 25: Chapter 18,19

Akaike Information Criterion (AIC). As with RSS, the model with thelowest AIC—even negative values—is considered optimal. The BIC (Bayesian Information Criterion) is a similar measure where, once again, lower is better.

AIC & BIC

Page 26: Chapter 18,19

The formula for AIC & BIC is :

Page 27: Chapter 18,19
Page 28: Chapter 18,19
Page 29: Chapter 18,19
Page 30: Chapter 18,19
Page 31: Chapter 18,19

Cross-Validation

Page 32: Chapter 18,19

The results from cv.glm include delta, which has two numbers, the raw cross-validation error : based on the cost function (in this case the mean squared error, which

is a measure of correctness for an estimator and is defined in this Equation ) for all the folds and the adjusted cross-validation error. This second number compensates for not using leave-one-out cross-validation, which is like k-fold cross-validation except that each fold is the all but one data point with one point held out. This is very accurate but highly computationally intensive.

Page 33: Chapter 18,19

we got a nice number for the error, it helps us only if we can compare it to other models

Page 34: Chapter 18,19
Page 35: Chapter 18,19
Page 36: Chapter 18,19
Page 37: Chapter 18,19

Bootstrapping

The idea is that we start with n rows of data. Some statistic (whether a mean, regression or some arbitrary function) is applied to the data.

Then the data are sampled, creating a new dataset. This new set still has n rows except that there are repeats and other rows are

entirely missing. The statistic is applied to this new dataset. The process is repeated R times (typically around 1,200), which generates an

entire distribution for the statistic. This distribution can then be used to find the mean and confidence interval

(typically 95%) for the statistic. The boot package is a very robust set of tools for making the bootstrap easy to

compute

Page 38: Chapter 18,19
Page 39: Chapter 18,19

to compute the batting average is to divide total hits by total at bats. This means we cannot simply run mean(h/ab) and sd(h/ab) to get the mean and standard deviation. Rather, the batting average is calculated as sum(h)/sum(ab) and its standar deviation is not easily calculated. This problem is a great candidate for using the bootstrap.

We calculate the overall batting average with the original data. Then we sample n rows with replacement and calculate the batting average again. We do this repeatedly until a distribution isformed. Rather that doing this manually, though, we use boot.

The first argument to boot is the data. The second argument is the function that is to be computed on the data. This function must take at least two arguments.

The beautiful thing about the bootstrap is its near universal applicability. It can be used in just about any situation where an analytical solution is impractical or impossible.

Bootstrapping

Page 40: Chapter 18,19
Page 41: Chapter 18,19

Visualizing the distribution is as simple as plotting a histogram of the replicate results

Page 42: Chapter 18,19

18.5. Stepwise Variable Selection

A common, though becoming increasingly discouraged, way to select variables for a model is stepwise selection. This is the process of iteratively adding and removing variables from a model and testing the model at each step, usually using AIC.

Return to the book to see all results.

Page 43: Chapter 18,19

Determining the quality of a model is an important step in the model-building process. This can take the form of traditional tests of fit such as ANOVA or more modern techniques like cross-validation.

The bootstrap is another means of determining model uncertainty, especially for models where confidence intervals are impractical to calculate. These can all be shaped by helping select which variables are included in a model and which are excluded.

18.6. Conclusion

Page 44: Chapter 18,19

Chapter 19. Regularization and Shrinkage

19.1. Elastic Net

a dynamic blending of lasso and ridge regression. The lasso uses an L1 penalty to perform variable selection and dimension

reduction, while the ridge uses an L2 penalty to shrink the coefficients for more stable predictions.

Page 45: Chapter 18,19

The formula for the Elastic Net is:

where λ is a complexity parameter controlling the amount of shrinkage (0 is no penalty and ∞ is complete penalty)

α regulates how much of the solution is ridge versus lasso with α = 0 being complete ridge and α = 1 being complete lasso.

Γ, not seen here, is a vector of penalty factors—one value per variable—that multiplies λ for fine tuning of the penalty applied to each variable;

Page 46: Chapter 18,19

Lasso vs ridge

Page 47: Chapter 18,19

Glmnet

which fits generalized linear models with the Elastic Net. it is designed for speed and larger, sparser data. Where functions like lm and glm take a formula to specify the model, glmnet

requires a matrix of predictors (including an intercept) and a response matrix

Page 48: Chapter 18,19

we will look at the American Community Survey(ACS) data for New York State. We will throw every possible predictor into the model and see which are selected.

Page 49: Chapter 18,19

λ controls the amount of shrinkage. By default glmnet fits the regularization path on 100 different values of λ. glmnet package has a function, cv.glmnet, that computes the cross-validation automatically. By default α = 1, meaning only the lasso is calculated. Selecting the best α requires an additional layer of cross-validation.

Page 50: Chapter 18,19
Page 51: Chapter 18,19

Visualizing where variables enter the model along the λ path can be illuminating

Page 52: Chapter 18,19

Finding the optimal value of α requires an additional layer of cross-validation, and unfortunately glmnet does not do that automatically. This will require us to run cv.glmnet at various levels of α, which will take a fairly large chunk of time if performed sequentially, making this a good time to use parallelization. The most straightforward way to run code in parallel is to the use the

parallel, doParallel and foreach packages

First, we build some helper objects to speed along the process.

When a two-layered cross validation is run, an observation should fall in the same fold each time, so we build a vector specifying fold membership.

We also specify the sequence of α values that foreach will loop over.

It is generally considered better to lean toward the lasso rather than the ridge, so we consider only α values greater than 0.5.

Page 53: Chapter 18,19

Before running a parallel job, a cluster (even on a single machine) must be started and registered with makeCluster and registerDoParallel. After the job is done the cluster should be stopped with stopCluster. Setting .errorhandling to ''remove'' means that if an error occurs, that iteration will be skipped. Setting .inorder to FALSE means that the order of combining the results does not matter and they can be combined whenever returned, which yields significant speed improvements. Because we are using the default combination function, list, which takes multiple arguments at once, we can speed up the process by setting .multicombine to TRUE. We specify in .packages that glmnet should be loaded on each of the workers, again leading to performance improvements. The operator %dopar% tells foreach to work in parallel. Parallel computing can be dependent on the environment, so we explicitly load some variables into the foreach environment using .export, namely, acsX, acsY, alphas and theFolds

Page 54: Chapter 18,19
Page 55: Chapter 18,19
Page 56: Chapter 18,19
Page 57: Chapter 18,19
Page 58: Chapter 18,19
Page 59: Chapter 18,19
Page 60: Chapter 18,19
Page 61: Chapter 18,19

19.2. Bayesian Shrinkage

useful when a model is built on data that does not have a large enough number of rows for some combinations of the variables.For this example, we blatantly steal an example

Page 62: Chapter 18,19
Page 63: Chapter 18,19
Page 64: Chapter 18,19
Page 65: Chapter 18,19
Page 66: Chapter 18,19
Page 67: Chapter 18,19
Page 68: Chapter 18,19