Stat 401XV Final Exam S17 - Iowa State Universityvardeman/stat401/Stat 401XV Final Exam S17... · Stat 401XV Final Exam ... The tree below was fit using rpart ... ("1" is the "y ≥

1

Stat 401XV Final Exam Spring 2017

I have neither given nor received unauthorized assistance on this exam.

________________________________________________________ Name Signed Date

_________________________________________________________ Name Printed

ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning will receive NO partial credit. Correct numerical answers to difficult questions unaccompanied by supporting reasoning may not receive full credit. SHOW YOUR WORK/EXPLAIN YOURSELF!

2

1. A so-called " k out of n system" will function provided at least k of its n components function. Consider a "4 out of 5 system" with independent components that each have reliability (probability of functioning) p . I need to know how large p must be in order to have overall system reliability (probability of functioning) .99. Set up an equation you could solve in order to find this p for me. 2. Customers arrive at a service counter with inter-arrival times (times between consecutive arrivals) modeled as independent exponential random variables with mean 1 min . a) Under this model, what fraction of inter-arrival times are less than .5 min ? b) Under this model, approximate the probability that less than 50 customers arrive in a particular 60 minute period. (Hint: This is the probability that the sum of 50 inter-arrival times is larger than 60.)

6 pts

4 pts

7 pts

3

3. A student project concerned measurement of resistivity of a type of copper wire at two different temperatures. Seven pieces of this were used in the study, and measured resistances at 0.0 C° and at 21.8 C° are in the following table. (Units are 810 m− Ω .) Wire 1 2 3 4 5 6 7 21.8 C° Resistivity 1.72 1.56 1.68 1.64 1.69 1.71 1.72

0.0 C° Resistivity 1.52 1.44 1.52 1.52 1.56 1.49 1.56

a) Give and interpret a 95% lower confidence bound for the mean increase in resistivity of this wire associated with an increase in temperature from 0.0 C° to 21.8 C° . (PLUG IN COMPLETELY, but there is no need to simplify. Say what the "95%" means.) b) Give a two-sided interval that you are "95% sure" will bracket 99% of measured increases in resistivity of this wire associated with an increase of temperature 0.0 C° to 21.8 C° . (PLUG IN COMPLETELY, but there is no need to simplify.) In a second study concerning resistivity of this wire, two different meters were both used in measuring resistance at 21.8 C° for the same 70n = specimens. For 50 of the 70 specimens/trials, meter A produced a higher reading than did meter B. c) Give a -vlauep for assessing whether there is clear evidence that the fraction of specimens for which meter A produces a higher reading than meter B exceeds .5.

6 pts

5 pts

6 pts

4

4. Beginning on Page 8 there is R analysis of a partially replicated 32 factorial experiment due to R. Snee treated in Engineering Statistics by Hogg and Ledolter. It concerned the effects of factors Factor Levels A-Polymer Type Standard (1) vs New (But Expensive) (2)B-Polymer Concentration .01% (1) vs .04% (2) C-Amount of an Additive 2 lb (1) vs 12 lb (2) on percentage impurityy = produced by a chemical process. Use that in the following questions. a) Give "margins of error" based on 95% two-sided confidence limits to associate with the 8 sample means in the study. (Some of these "sample means" are based on only 1 observation.) Where combination 1n = : | Where combination 2n = : | | | | | | | b) Give the value of an F statistic and degrees of freedom for testing the hypothesis that all 8 experimental combinations produce the same mean purity. F = ___________________ . .d f = _______ , _______ c) Based on the last 3 runs of the lm() routine with these data, what model for y in terms of the experimental variables do you judge to be best? (Name and interpret values of detectable effects and say what other effects are not detectable.) d) For the first case, the predicted value produced in the final lm() run is .895. If it were printed out, what would be the corresponding value for the next-to-final run? If it is .895 say why. If it is not .895 say why not.

5 pts

6 pts

4 pts

4 pts

5

5. There is a dataset on the UCI Machine Learning Data Set Repository that provides 1-10 quality ratings by experts ( y ) for wine samples and corresponding results of 11 chemical analyses ( ( )1 2 11, , ,x x x=x ). This problem concerns data analysis for 1599 red wine samples. Beginning on Page 10 there is relevant R code and output. Consider first a SLR analysis of the variable quality using the predictor variable alcohol. Below is a scatterplot for these variables and the least squares line through the data pairs. (The plotting locations have ben randomly "jittered" slightly to minimize the visual effects of over-plotting.)

a) Say what the plot suggests about the appropriateness of the Gaussian simple linear regression model (particularly the modeling of "errors" iε ). b) Would you be willing to use a 95% prediction interval for the expert quality rating, y , of a new specimen with alcohol content 11x = based on these data and the Gaussian SLR model? Explain. c) Is there definitive evidence that average quality rating increases with alcohol content? Provide quantitative support for your answer based on the R output.

4 pts

4 pts

5 pts

6

Suppose that one suspends any concerns about model assumptions and adopts the usual MLR model 0 1 1 2 2 11 11y x x xβ β β β ε= + + + + + for quality rating as a function of the 11 chemical analysis results. d) Interpret the fitted regression coefficient for 2 grams of acetic acid per cubic decimeterx = . e) Give the value of an F statistic and degrees of freedom for judging whether after accounting for

11 alcohol contentx = , the other 10 chemical analysis results add detectably to one's ability to predict quality rating. F = _______________ . .d f = _______ , _______ Consider now the only issue of building an effective predictor of quality ratingy = . (Leave behind Gaussian model assumptions.) f) Below is a table of some summaries for several linear predictors fit by least squares. Which linear predictor (set of chemical analysis terms) is most attractive and why? Chemical Analysis Terms 2R MSE -CV RMSPE 1 through 11 .3606 .6480 .6504 2,3,5,6,7,9,10,11 .3599 .6477 .6491 2,5,6,7,9,10,11 .3595 .6477 .6489 2,5,7,9,10,11 .3572 .6487 .6495 2,5,7,10,11 .3515 .6514 .6519 2,7,10,11 .3438 .6550 .6551 2,10,11 .3359 .6587 .6587 2,11 .3170 .6678 .6674

4 pts

8 pts

4 pts

7

g) Searching for an elastic net predictor for quality ratingy = based on the 11 predictors, the best CV-RMSPE available seems to be about .6502 for .0011α ≈ and .03λ ≈ . The predictions it produces are not much different from ordinary MLR. Why is this not surprising given the elastic net parameters and what you know about the MLR model from part f)? h) There is code and output from train() in caret for k -nearest-neighbor and random forest predictors for quality ratingy = based on the 11 predictors. What value of " k " is best for the former and what value of " mtry " is best for the latter? How do these predictors compare to each other and to MLR predictors in terms of performance? (Give numerical support for your latter answer.) i) The printout presents a scatterplot matrix and correlations between y and MLR, kNN, and random forest predictions. It seems impossible to improve much on the best of these predictors using a linear combination of them. Based on the information available to you, give rationale for this happening. j) Rather than predict y , one could instead use a classification tree to identify chemical analysis vectors ( )1 2 11, , ,x x x that produce 7y ≥ . The tree below was fit using rpart (and .022cp = ). What is the misclassification rate for this tree on the training set? Describe in simple terms what chemical analysis results it associates with a quality score of 7 or more.

("1" is the " 7y ≥ " class and "to the left" is "the condition holds" circumstance.)

4 pts

4 pts

4 pts

6 pts

8

R Code and Output for Chemical Process Analyses > Type<-c(1,2,2,3,4,5,5,6,7,7,8) > PolyType<-c(1,2,2,1,2,1,1,2,1,1,2) > PolyConc<-c(1,1,1,2,2,1,1,1,2,2,2) > AddAmount<-c(rep(1,5),rep(2,6)) > Impurity<-c(1,1,1.2,.2,.5,.9,.7,1.1,.2,.3,.5) > cbind(Type,PolyType,PolyConc,AddAmount,Impurity) Type PolyType PolyConc AddAmount Impurity [1,] 1 1 1 1 1.0 [2,] 2 2 1 1 1.0 [3,] 2 2 1 1 1.2 [4,] 3 1 2 1 0.2 [5,] 4 2 2 1 0.5 [6,] 5 1 1 2 0.9 [7,] 5 1 1 2 0.7 [8,] 6 2 1 2 1.1 [9,] 7 1 2 2 0.2 [10,] 7 1 2 2 0.3 [11,] 8 2 2 2 0.5 > aggregate(Impurity,by=list(Type),FUN="mean") Group.1 x 1 1 1.00 2 2 1.10 3 3 0.20 4 4 0.50 5 5 0.80 6 6 1.10 7 7 0.25 8 8 0.50 > aggregate(Impurity,by=list(Type),FUN="sd") Group.1 x 1 1 NA 2 2 0.14142136 3 3 NA 4 4 NA 5 5 0.14142136 6 6 NA 7 7 0.07071068 8 8 NA > Type<-as.factor(Type) > PolyType<-as.factor(PolyType) > PolyConc<-as.factor(PolyConc) > AddAmount<-as.factor(AddAmount) > > Snee<-data.frame(Type,PolyType,PolyConc,AddAmount,Impurity) > > summary(Snee) Type PolyType PolyConc AddAmount Impurity 2 :2 1:6 1:6 1:5 Min. :0.2000 5 :2 2:5 2:5 2:6 1st Qu.:0.4000 7 :2 Median :0.7000 1 :1 Mean :0.6909 3 :1 3rd Qu.:1.0000 4 :1 Max. :1.2000 (Other):2 > > options(contrasts = rep("contr.sum", 2)) > Snee.out1<-lm(Impurity~Type,data=Snee) > summary(Snee.out1)

9

Call: lm(formula = Impurity ~ Type, data = Snee) Residuals: 1 2 3 4 5 6 7 -3.643e-17 -1.000e-01 1.000e-01 2.343e-19 -2.058e-17 1.000e-01 -1.000e-01 8 9 10 11 -4.140e-17 -5.000e-02 5.000e-02 -8.439e-18 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.68125 0.03903 17.454 0.00041 *** Type1 0.31875 0.11302 2.820 0.06672 . Type2 0.41875 0.08455 4.953 0.01580 * Type3 -0.48125 0.11302 -4.258 0.02375 * Type4 -0.18125 0.11302 -1.604 0.20711 Type5 0.11875 0.08455 1.405 0.25479 Type6 0.41875 0.11302 3.705 0.03416 * Type7 -0.43125 0.08455 -5.101 0.01457 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1225 on 3 degrees of freedom Multiple R-squared: 0.9671, Adjusted R-squared: 0.8904 F-statistic: 12.61 on 7 and 3 DF, p-value: 0.0308 > Snee.out2<-lm(Impurity~PolyType*PolyConc*AddAmount,data=Snee) > summary(Snee.out2) Call: lm(formula = Impurity ~ PolyType * PolyConc * AddAmount, data = Snee) Residuals: 1 2 3 4 5 6 7 2.079e-17 -1.000e-01 1.000e-01 -4.723e-17 8.286e-18 1.000e-01 -1.000e-01 8 9 10 11 1.522e-17 -5.000e-02 5.000e-02 2.216e-17 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.68125 0.03903 17.454 0.00041 *** PolyType1 -0.11875 0.03903 -3.042 0.05576 . PolyConc1 0.31875 0.03903 8.167 0.00384 ** AddAmount1 0.01875 0.03903 0.480 0.66381 PolyType1:PolyConc1 0.01875 0.03903 0.480 0.66381 PolyType1:AddAmount1 0.01875 0.03903 0.480 0.66381 PolyConc1:AddAmount1 0.03125 0.03903 0.801 0.48188 PolyType1:PolyConc1:AddAmount1 0.03125 0.03903 0.801 0.48188 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1225 on 3 degrees of freedom Multiple R-squared: 0.9671, Adjusted R-squared: 0.8904 F-statistic: 12.61 on 7 and 3 DF, p-value: 0.0308 > Snee.out3<-lm(Impurity~PolyType+PolyConc,data=Snee) > summary(Snee.out3) Call: lm(formula = Impurity ~ PolyType + PolyConc, data = Snee) Residuals: Min 1Q Median 3Q Max -0.15926 -0.04074 0.01111 0.05000 0.14074

10

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.67407 0.02928 23.021 1.35e-08 *** PolyType1 -0.12407 0.02928 -4.237 0.00285 ** PolyConc1 0.30926 0.02928 10.562 5.64e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.09623 on 8 degrees of freedom Multiple R-squared: 0.9459, Adjusted R-squared: 0.9324 F-statistic: 69.93 on 2 and 8 DF, p-value: 8.569e-06 > predict(Snee.out3) 1 2 3 4 5 6 7 8 0.8592593 1.1074074 1.1074074 0.2407407 0.4888889 0.8592593 0.8592593 1.1074074 9 10 11 0.2407407 0.2407407 0.4888889 R Code and Output for the Wines Data > wines<-read.clipboard(header=TRUE,sep=",") > wines$quality<-as.numeric(wines$quality) > Good<-rep(0,1599) > for (i in 1:1599) if (wines$quality[i] >6) Good[i]<-1 > GoodF<-as.factor(Good) > Wines<-data.frame(wines,GoodF) > summary(Wines) fixed.acidity volatile.acidity citric.acid residual.sugar chlorides Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900 Min. :0.01200 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900 1st Qu.:0.07000 Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200 Median :0.07900 Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539 Mean :0.08747 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600 3rd Qu.:0.09000 Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500 Max. :0.61100 free.sulfur.dioxide total.sulfur.dioxide density pH sulphates Min. : 1.00 Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 Median :14.00 Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200 Mean :15.87 Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 Max. :72.00 Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000 alcohol quality GoodF Min. : 8.40 Min. :3.000 0:1382 1st Qu.: 9.50 1st Qu.:5.000 1: 217 Median :10.20 Median :6.000 Mean :10.42 Mean :5.636 3rd Qu.:11.10 3rd Qu.:6.000 Max. :14.90 Max. :8.000 > Wines2<-Wines[,1:12] > summary(lm(quality~alcohol,data=Wines)) Call: lm(formula = quality ~ alcohol, data = Wines) Residuals: Min 1Q Median 3Q Max -2.8442 -0.4112 -0.1690 0.5166 2.5888 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.87497 0.17471 10.73 <2e-16 *** alcohol 0.36084 0.01668 21.64 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7104 on 1597 degrees of freedom Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263 F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16

11

> summary(lm(quality~.,data=Wines2)) Call: lm(formula = quality ~ ., data = Wines2) Residuals: Min 1Q Median 3Q Max -2.68911 -0.36652 -0.04699 0.45202 2.02498 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.197e+01 2.119e+01 1.036 0.3002 fixed.acidity 2.499e-02 2.595e-02 0.963 0.3357 volatile.acidity -1.084e+00 1.211e-01 -8.948 < 2e-16 *** citric.acid -1.826e-01 1.472e-01 -1.240 0.2150 residual.sugar 1.633e-02 1.500e-02 1.089 0.2765 chlorides -1.874e+00 4.193e-01 -4.470 8.37e-06 *** free.sulfur.dioxide 4.361e-03 2.171e-03 2.009 0.0447 * total.sulfur.dioxide -3.265e-03 7.287e-04 -4.480 8.00e-06 *** density -1.788e+01 2.163e+01 -0.827 0.4086 pH -4.137e-01 1.916e-01 -2.159 0.0310 * sulphates 9.163e-01 1.143e-01 8.014 2.13e-15 *** alcohol 2.762e-01 2.648e-02 10.429 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.648 on 1587 degrees of freedom Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561 F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16 > lmTune<-train(y=Wines[,12], + x=Wines[,1:11], + method="lm", + preProcess = c("center","scale"), + trControl=trainControl(method="repeatedcv",repeats=100,number=10)) > > lmTune Linear Regression 1599 samples 11 predictor Pre-processing: centered (11), scaled (11) Resampling: Cross-Validated (10 fold, repeated 100 times) Summary of sample sizes: 1440, 1439, 1439, 1438, 1438, 1441, ... Resampling results: RMSE Rsquared 0.6505441 0.3539098 Tuning parameter 'intercept' was held constant at a value of TRUE > lmPred<-predict(lmTune) > > > knnTune<-train(y=Wines[,12], + x=Wines[,1:11], + method="knn", + preProcess = c("center","scale"), + tuneGrid=data.frame(.k=1:25), + trControl=trainControl(method="repeatedcv",repeats=100,number=10)) > > knnTune k-Nearest Neighbors 1599 samples 11 predictor Pre-processing: centered (11), scaled (11) Resampling: Cross-Validated (10 fold, repeated 100 times)

12

Summary of sample sizes: 1439, 1440, 1439, 1439, 1438, 1438, ... Resampling results across tuning parameters: k RMSE Rsquared 1 0.7456849 0.3215345 2 0.6993469 0.3266718 3 0.6850274 0.3247421 . . . 17 0.6568158 0.3439370 18 0.6564992 0.3445679 19 0.6566953 0.3441629 20 0.6568225 0.3440396 21 0.6568162 0.3442339 22 0.6568163 0.3444893 23 0.6567660 0.3448824 24 0.6568525 0.3448659 25 0.6569285 0.3448316 RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 18. > kNNPred<-predict(knnTune) > > > ForestTune<-train(y=Wines[,12], + x=Wines[,1:11], + tuneGrid=data.frame(mtry=1:11), + method="rf",ntree=1000, + trControl=trainControl(method="oob")) > ForestTune Random Forest 1599 samples 11 predictor No pre-processing Resampling results across tuning parameters: mtry RMSE Rsquared 1 0.5737527 0.4949185 2 0.5639598 0.5120130 3 0.5624150 0.5146828 4 0.5610697 0.5170017 5 0.5620331 0.5153416 6 0.5612893 0.5166236 7 0.5637885 0.5123095 8 0.5623914 0.5147235 9 0.5641200 0.5117357 10 0.5645734 0.5109505 11 0.5635874 0.5126574 RMSE was used to select the optimal model using the smallest value. The final value used for the model was mtry = 4. > rfPred<-predict(ForestTune) > > round(cor(cbind(quality,lmPred,kNNPred,rfPred)),2) quality lmPred kNNPred rfPred quality 1.00 0.60 0.64 0.72 lmPred 0.60 1.00 0.87 0.84 kNNPred 0.64 0.87 1.00 0.88 rfPred 0.72 0.84 0.88 1.00 > > pairs(cbind(quality,lmPred,kNNPred,rfPred))

13

Stat 401XV Final Exam S17 - Iowa State Universityvardeman/stat401/Stat 401XV Final Exam S17... · Stat 401XV Final Exam ... The tree below was fit using rpart ... ("1" is the "y ≥

Documents