Solutions to obligatorisk oppgave 1, STK2100 Vinnie Ko May 14, 2018 Disclaimer: This document is made solely for my own personal use and can contain many errors. Oppgave 1 (a) c i can be seen as a categorical variable that indicates the j -th category where 1 ≤ j ≤ K. x i,j can then be seen as a dummy variable which defined as: x i,j = 1 j (c i ) := ( 1 if c i = j 0 if c i 6= j . This way of “dummy coding” results in the following mapping: x i,1 x i,2 ··· x i,K-1 x i,K When c i =1 1 0 ··· 0 0 When c i =2 0 1 ··· 0 0 . . . . . . . . . . . . . . . . . . When c i = K - 1 0 0 ··· 1 0 When c i = K 0 0 ··· 0 1 The given models Y i = β 0 + β 2 x i,2 + ··· + β K x i,K + ε i (1) and Y i = α 1 x i,1 + ··· + α K x i,K + ε i (2) correspond to 2 different ways of using this “dummy coding”. For each model, we can see that: Under equation (1): Under equation (2): Y i Y i When c i =1 β 0 + ε i α 1 + ε i When c i =2 β 0 + β 2 + ε i α 2 + ε i . . . . . . . . . When c i = K - 1 β 0 + β K-1 + ε i α K-1 + ε i When c i = K β 0 + β K + ε i α K + ε i 1
29
Embed
Solutions to obligatorisk oppgave 1, STK2100 · Solutions to obligatorisk oppgave 1, STK2100 Vinnie Ko May 14, 2018 Disclaimer: This document is made solely for my own personal use
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Solutions to obligatorisk oppgave 1, STK2100
Vinnie Ko
May 14, 2018
Disclaimer:This document is made solely for my own personal use and can contain many errors.
Oppgave 1
(a)
ci can be seen as a categorical variable that indicates the j-th category where 1 ≤ j ≤ K.xi,j can then be seen as a dummy variable which defined as:
xi,j = 1j(ci) :=
{1 if ci = j
0 if ci 6= j.
This way of “dummy coding” results in the following mapping:
When ci = K − 1 0 0 · · · 1 0When ci = K 0 0 · · · 0 1
The given models
Yi = β0 + β2xi,2 + · · ·+ βKxi,K + εi (1)
and
Yi = α1xi,1 + · · ·+ αKxi,K + εi (2)
correspond to 2 different ways of using this “dummy coding”.
For each model, we can see that:
Under equation (1): Under equation (2):Yi Yi
When ci = 1 β0 + εi α1 + εiWhen ci = 2 β0 + β2 + εi α2 + εi
......
...When ci = K − 1 β0 + βK−1 + εi αK−1 + εiWhen ci = K β0 + βK + εi αK + εi
1
So, if αj = β0 + βj , β1 = 0 for j ∈ {1, · · ·K}, then model (1) and model (2) are the same.
Interpretation:αj = average value of {Yi|ci = j} in the data.βj = difference between average of {Yi|ci = 1} (reference group) and average of {Yi|ci = j}
(b)
The design matrix for model (2): X =
x1,1 · · · x1,j · · · x1,K...
. . ....
. . ....
xi,1 · · · xi,j · · · xi,K...
. . ....
. . ....
xn,1 · · · xn,j · · · xn,K
XTX =
x1,1 · · · xl,1 · · · xn,1...
. . ....
. . ....
x1,j · · · xl,j · · · xn,j...
. . ....
. . ....
x1,K · · · xl,K · · · xn,K
·
x1,1 · · · x1,j · · · x1,K...
. . ....
. . ....
xl,1 · · · xl,j · · · xl,K...
. . ....
. . ....
xn,1 · · · xn,j · · · xn,K
=
n∑l=1
xl,1xl,1 · · ·n∑l=1
xl,1xl,j · · ·n∑l=1
xl,1xl,K
.... . .
.... . .
...n∑l=1
xl,jxl,1 · · ·n∑l=1
xl,jxl,j · · ·n∑l=1
xl,jxl,K
.... . .
.... . .
...n∑l=1
xl,Kxl,1 · · ·n∑l=1
xl,Kxl,j · · ·n∑l=1
xl,Kxl,K
We can see that
(XTX)i,j =
n∑l=1
xl,ixl,j =
n∑l=1
1j(cl) if i = j
0 if i 6= j
.
So, XTX is a diagonal matrix with diagonal elements (XTX)j,j =
n∑l=1
1j(cl).
2
XTy =
x1,1 · · · xl,1 · · · xn,1...
. . ....
. . ....
x1,j · · · xl,j · · · xn,j...
. . ....
. . ....
x1,K · · · xl,K · · · xn,K
·
y1...yl...yn
=
n∑l=1
xl,1yl
...n∑l=1
xl,jyl
...n∑l=1
xl,Kyl
=
n∑l=1
11(cl)yl
...n∑l=1
1j(cl)yl
...n∑l=1
1K(cl)yl
=
∑l:cl=1
yl
...∑l:cl=j
yl
...∑l:cl=K
yl
Now, we derive the least squares estimator of α:
RSS =
n∑i=1
(yi −K∑j=1
xijαj)2
=∥∥y −Xα∥∥2
= (y −Xα)T(y −Xα).
This leads us to:
α = arg minα
(y −Xα)T(y −Xα).
Differentiate:
∂RSS
∂α=∂(yTy − yTXα−αTXTy +αTXTXα)
∂α
= 0−XTy −XTy + (XTX + (XXT)T)α
= −2XTy + 2XTXα
This first derivative should equal to 0. So,
−2XTy + 2XTXα = 0
XTXα = XTy
α = (XTX)−1XTy.
Therefore, the least squares estimate for α is:
α = (XTX)−1XTy =
∑l:cl=1 yl∑nl=1 11(cl)
...∑l:cl=j
yl∑nl=1 1j(cl)
...∑l:cl=K
yl∑nl=1 1K(cl)
.
We can easily see that αj is {y|ci = j}.
XTX is a diagonal matrix. So, XTX is invertible if and only if all diagonal entries are non-zero. This
means that we can obtain α only if ∀j ∈ {1, · · · ,K} :
n∑l=1
1j(cl) 6= 0.
3
In other words, to compute α, we need at least 1 unique observation for every possible categorical valueof ci.
(c)
βj =
{α1 if j = 0
αj − α1 if j ∈ {1, · · · ,K}(3)
α is the least false estimate of α, obtained via the least squares principle. And we know that there isa one-to-one mapping between α and β (3). Therefore, β obtained through this one-to-one mapping isalso a least squares estimate of β.
(d)
The given alternative model:
Yi = γ0 + γ1xi,1 + · · ·+ γKxi,K + εi (4)
This way of “dummy coding” results in the following mapping:
Under equation (1): Under equation (2): Under equation (4):Yi Yi Yi
This is a test called ‘model utility test’ and the belonging hypotheses are:
- Model (1): H0 : β2 = β3 = β4 = 0 vs. H1 : at least one βj 6= 0.
- Model (2): H0 : α1 = α2 = α3 = α4 = 0 vs. H1 : at least one αj 6= 0.
- Model (4): H0 : γ1 = γ2 = γ3 = 0 vs. H1 : at least one γj 6= 0.
The F-test of all 3 models tells us to reject the null hypothesis. So, there is certain form of differencebetween 4 iron content categories in terms of their effect on the response variable.If we want to check whether there is a difference between 2 specific iron content categories, we can usemodel (1) for hypothesis testing. This is because βj can be interpreted as a mean difference between thereference category and another comparing category.Note that the coefficients of model (1) only shows the mean difference between type 1 and other types.This is because type 1 is the reference category in the model. If we for example wish to compare type 2and type 3 iron contents, we can recreate model (1) with type 2 or type 3 as a reference category.
(h)
From the t-test of coefficients model (1), we can see that there is no significant difference between type1 and type 2 iron contents. So, we can consider to merge them as one type. This will result in a simplermodel with only 3 categories.
F-statistic: 15.42 on 9 and 22 DF, p-value: 1.424e-07
It’s sensible to remove the variable with the highest P-value since, according to the Wald test, thisvariable has the lowest chance that it has a significant (linear) relationship with logcost.
Compare to the model with all explanatory variables, the model without t1 has somewhat changed p-values. It’s logical because the same variables are in 2 different context in 2 different models. In general,one would expect that the explanatory variables that had higher correlation with t1 will have biggerchanges in their p-values (usually decreased p-values). The reason is that these variables now take overthe role of t1.
(d)
# This function fits lm() with the given data.
# It uses y.name as reponse varialbe and all other variables as explanatory variables.
# It removes the variable with the highest p-value (according to the Wald test) from
the model.
# It repeats this process untill all variables in the model have significant p-values.
Figure 3: Diagnostic plot of the final model that contains only significant variables.
According to the F -test, at least one of βj 6= 0. The diagnostic plots of the model suggest that there isa slight non-linearity in the upper quantile of y. However, this non linearity is caused by 2 data points.We can say more about the non-linearity when we have more data.
We have MSE = 0.0264, but this doesn’t tell much about how good the model is. Instead, we compute
R2 =Var(y)
Var(y)= 1−
∑ni=1(yi − yi)2∑ni=1(yi − y)2
= 1− n ·MSE
(n− 1) ·Var(y).
15
> # R-squared
> # Method 1
> 1 - n*MSE/((n-1)*var(Nuclear[,"logcost"]))
[1] 0.8095693
> # Method 2
> var(y.hat)/var(Nuclear[,"logcost"])
[1] 0.8095693
> # Method 3
> summary(best.model.Wald)$r.squared
[1] 0.8095693
So, the model explains about 80, 96% of the variance in y, which is not bad. However, we used thesame training data to measure the prediction performance of the model. This often gives too optimisticresult (i.e. underestimated MSE) since we are using the data that we have seen before. To prevent thisproblem, one can use test set, that is not used during the model fitting process, to measure the predictionperformance.
(f)
> best.model.AIC = stepAIC(full.model, direction = "backward")
Start: AIC=-105.01
logcost ~ date + t1 + t2 + cap + pr + ne + ct + bw + cum.n +
pt
Df Sum of Sq RSS AIC
- t1 1 0.00160 0.60603 -106.930
- bw 1 0.00345 0.60788 -106.832
<none> 0.60443 -105.014
- t2 1 0.04284 0.64727 -104.823
- pr 1 0.04826 0.65269 -104.556
- cum.n 1 0.06792 0.67235 -103.607
- ct 1 0.07781 0.68224 -103.140
- pt 1 0.08337 0.68781 -102.879
- date 1 0.19899 0.80343 -97.907
- ne 1 0.30859 0.91302 -93.815
- cap 1 0.68497 1.28940 -82.770
Step: AIC=-106.93
logcost ~ date + t2 + cap + pr + ne + ct + bw + cum.n + pt
Df Sum of Sq RSS AIC
- bw 1 0.00213 0.60816 -108.818
<none> 0.60603 -106.930
- t2 1 0.04135 0.64738 -106.818
- pr 1 0.04680 0.65283 -106.550
- cum.n 1 0.07045 0.67648 -105.411
- ct 1 0.07654 0.68257 -105.124
- pt 1 0.08216 0.68819 -104.862
- ne 1 0.31255 0.91858 -95.621
- date 1 0.54190 1.14793 -88.489
- cap 1 0.68916 1.29518 -84.627
Step: AIC=-108.82
logcost ~ date + t2 + cap + pr + ne + ct + cum.n + pt
16
Df Sum of Sq RSS AIC
<none> 0.60816 -108.818
- pr 1 0.05738 0.66554 -107.932
- t2 1 0.06379 0.67195 -107.626
- cum.n 1 0.06839 0.67656 -107.407
- ct 1 0.07440 0.68257 -107.124
- pt 1 0.08066 0.68882 -106.832
- ne 1 0.31375 0.92192 -97.505
- date 1 0.54592 1.15408 -90.318
- cap 1 0.68739 1.29556 -86.617
> best.model.BIC = stepAIC(full.model, direction = "backward", k = log(n))
Start: AIC=-88.89
logcost ~ date + t1 + t2 + cap + pr + ne + ct + bw + cum.n +
pt
Df Sum of Sq RSS AIC
- t1 1 0.00160 0.60603 -92.273
- bw 1 0.00345 0.60788 -92.175
- t2 1 0.04284 0.64727 -90.166
- pr 1 0.04826 0.65269 -89.899
- cum.n 1 0.06792 0.67235 -88.949
<none> 0.60443 -88.891
- ct 1 0.07781 0.68224 -88.482
- pt 1 0.08337 0.68781 -88.222
- date 1 0.19899 0.80343 -83.250
- ne 1 0.30859 0.91302 -79.158
- cap 1 0.68497 1.28940 -68.113
Step: AIC=-92.27
logcost ~ date + t2 + cap + pr + ne + ct + bw + cum.n + pt
Df Sum of Sq RSS AIC
- bw 1 0.00213 0.60816 -95.626
- t2 1 0.04135 0.64738 -93.626
- pr 1 0.04680 0.65283 -93.358
<none> 0.60603 -92.273
- cum.n 1 0.07045 0.67648 -92.219
- ct 1 0.07654 0.68257 -91.933
- pt 1 0.08216 0.68819 -91.670
- ne 1 0.31255 0.91858 -82.430
- date 1 0.54190 1.14793 -75.297
- cap 1 0.68916 1.29518 -71.435
Step: AIC=-95.63
logcost ~ date + t2 + cap + pr + ne + ct + cum.n + pt
Df Sum of Sq RSS AIC
- pr 1 0.05738 0.66554 -96.207
- t2 1 0.06379 0.67195 -95.900
- cum.n 1 0.06839 0.67656 -95.681
<none> 0.60816 -95.626
- ct 1 0.07440 0.68257 -95.398
- pt 1 0.08066 0.68882 -95.106
- ne 1 0.31375 0.92192 -85.779
- date 1 0.54592 1.15408 -78.592
- cap 1 0.68739 1.29556 -74.892
17
Step: AIC=-96.21
logcost ~ date + t2 + cap + ne + ct + cum.n + pt
Df Sum of Sq RSS AIC
- t2 1 0.02447 0.69001 -98.517
- cum.n 1 0.05351 0.71905 -97.198
<none> 0.66554 -96.207
- ct 1 0.10237 0.76791 -95.094
- pt 1 0.12015 0.78570 -94.361
- ne 1 0.28784 0.95339 -88.171
- date 1 0.49109 1.15664 -81.987
- cap 1 0.68019 1.34573 -77.141
Step: AIC=-98.52
logcost ~ date + cap + ne + ct + cum.n + pt
Df Sum of Sq RSS AIC
- cum.n 1 0.06006 0.75007 -99.312
<none> 0.69001 -98.517
- pt 1 0.11719 0.80720 -96.963
- ct 1 0.12931 0.81932 -96.486
- ne 1 0.27215 0.96216 -91.343
- date 1 0.46672 1.15673 -85.450
- cap 1 0.89456 1.58457 -75.379
Step: AIC=-99.31
logcost ~ date + cap + ne + ct + pt
Df Sum of Sq RSS AIC
<none> 0.75007 -99.312
- ct 1 0.09317 0.84324 -99.031
- ne 1 0.21478 0.96485 -94.720
- pt 1 0.37487 1.12494 -89.807
- date 1 0.55668 1.30675 -85.013
- cap 1 0.83451 1.58458 -78.845
>
> summary(best.model.AIC)
Call:
lm(formula = logcost ~ date + t2 + cap + pr + ne + ct + cum.n +
F-statistic: 25.5 on 5 and 26 DF, p-value: 2.958e-09
When we use AIC, we remove t1, bw (in the order of removal) from the model that contains all explana-tory variables.When we use BIC, we remove t1, bw, pr, t2, cum.n (in the order of removal) from the model thatcontains all explanatory variables.
With Nuclear data, the BIC’s penalty term is bigger than that of AIC since ln(32) = 3.4657 > 2. Thisresults in that BIC penalizes bigger model more severely.
(g)
When stepAIC tests all the models that drop one variable, we can see that the rank between thosemodels are identical between AIC and BIC. This is because the penalty term is defined as the numberof parameters times a constant across all models. So, the penalty term is same for all the models withthe same dimension (within AIC or BIC). Thus, the rank is decided purely by the log-likelihood value,which is the same for AIC and BIC.
19
(h)
Table 1 shows which variables are (not) in the best model from each model selection framework. We seethat the best model from the Wald test frame work has the least number of variables. This is becauseAIC and BIC evaluates the model as a whole (in terms of KL divergence), while Wald-test frameworkevaluates components of the model (i.e. βj) separately. So, even though a variable has non-significantp-value, it can happen that this non-significant variable improves the model according to AIC or BIC.
In terms of the order that variables are removed, there is no difference between the frameworks (for thecommon variables that are removed). This can however differ from application to application.
Table 1: An overview of variables in the best models from 3 different model selection frameworks. Thenumbers in red cells indicate in which order the variables were removed from the model.
Model selection criterion date t1 t2 cap pr ne ct bw cum.n ptWald test 1 4 3 6 2 5AIC 1 2BIC 1 4 3 2 5
(i)
We know that the moment generating function is defined as MY (t) = E[eY t]. For normal distribution,
the moment generating function isMY (t) = exp[µt+ σ2t2/2
]. Thus, E
[eZ]
= MZ(1) = exp[µ+ 0.5σ2
].
We are given that θ = E [Z] and η = E[eZ].
The natural estimator of η = E[eZ]
can be obtained by replacing the expectation with sample equivalent:
η3 =1
n
n∑i=1
ezi .
However, since we know E[eZ]
= exp[θ + 0.5σ2
], we can estimate η more directly with a plug-in
estimator:η2 = exp
[θ + 0.5σ2
].
exp[·] is not a linear operator, so it can’t get out of expectation. But when an immature and naivestatistician does this, he/she gets η = E
[eZ]
= eE[Z] = eθ. And by using the plug-in estimator, he/shegets
We also can’t use the delta method since we don’t know Var(σ2)
and Cov(θ, σ2
).
22
Oppgave 3
(a)
When one looks at the variance of the estimated parameters in exercise 2 (j), it ignores the uncertaintyof the model itself and the data.(Law of total variance: Var(Y ) = E [Var (Y |X)] + Var (E [Y |X]).)
(b)
> # We only concentrate on AIC
> theta.hat = theta.hat.AIC
> sigma.hat = sigma.hat.AIC
> psi.hat = theta.hat + 0.5*sigma.hat^2
> eta.hat = eta.hat.AIC
>
> # (b)
> # A function that draws a bootstrap sample.
> # Then, it fits the model with the bootstrap sample.
> # Then, it performs variable selection based on backward AIC.
> # Then, it predicts from the fitted model with given newdata.