Model Assessment & Model Assessment & Selection Selection Prof. Liqing Zhang Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University
Jan 01, 2016
Model Assessment & Model Assessment & SelectionSelection
Prof. Liqing Zhang Prof. Liqing Zhang
Dept. Computer Science & Engineering,
Shanghai Jiaotong University
23/04/20 Model Assessment & Selection 2
Outline• Bias, Variance and Model Complexity
• The Bias-Variance Decomposition
• Optimism of the Training Error Rate
• Estimates of In-Sample Prediction Error
• The Effective Number of Parameters
• The Bayesian Approach and BIC
• Minimum Description Length
• Vapnik-Chernovenkis Dimension
• Cross-Validation
• Bootstrap Methods
23/04/20 Model Assessment & Selection 3
Bias, Variance & Model Complexity
23/04/20 Model Assessment & Selection 4
Bias, Variance & Model Complexity
• The standard of model assessment : the generalization performance of a learning method– Model:– Prediction Model:– Loss function:
)(; XfYYX
error absolute
error squared
|)(ˆ|
))(ˆ())(ˆ,(
2
XfY
XfYXfYL
)(ˆ Xf
23/04/20 Model Assessment & Selection 5
Bias, Variance & Model Complexity
• Error: training error, generalization error
• Typical loss function:
))(ˆ,(
))(ˆ,(1
1
XfYLEErr
xfyLN
errN
iii
:error tionGeneraliza
:error Training
)(ˆlog2
)(ˆlog)(2))(ˆ,(
))(ˆ())(ˆ,(
1
Xp
XpkGIXpGL
XGGIXGGL
G
k
K
k
likelihood-log
loss 1-0
23/04/20 Model Assessment & Selection 6
Bias-Variance Decomposition
• Basic Model:• The expected prediction error of a regression
fit .
• The more complex the model, the lower the (squared) bias but the higher the variance.
VarianceBiasError eIrreducibl
2
2
2
))(ˆ())(ˆ(
)](ˆ)(ˆ[)]()(ˆ[
]|))(ˆ[()(
02
0
200
200
02
00
xfVarxfBias
xfxfEExfxfE
xXxfYExErr
),0(~,)( 2 NXfY
)(ˆ Xf
23/04/20 Model Assessment & Selection 7
Bias-Variance Decomposition
• For the k-NN regression fit the prediction error:
• For the linear model fit
kσxfk
xfσ
x|XxfYExErr
ε
k
jjε /])(
1)([
]))(ˆ[()(
22
10
2
02
00
xxf Tp )(ˆ
22
02
00
02
00
)()](ˆ)([
]|))(ˆ[()
xhxfExf
xXxfYExErr
p
p
(2
23/04/20 Model Assessment & Selection 8
Bias-Variance Decomposition
• The in-sample error of the Linear Model
– The model complexity is directly related to the number of parameters p.
• For ridge regression the square bias
2
1i
2 N
pxfExf
NxErr
N
N
ii
N
ii
2
1
)](ˆ)([1
)(1
22 Bias] n[EstimatioBias] [Model
2
00
2
00
2
0ˆ)()(ˆ)(
0000xExExxfExfExfE TT
xT
xxx
23/04/20 Model Assessment & Selection 9
Bias-Variance Decomposition
Regularized fit
Closest fit
Estimation Variance
Model bias
Estimation bias
Truth
RealizationClosest fit in population
RESTRICED MODEL SPACE
MODEL SPACE
• Schematic of the behavior of bias and variance
23/04/20 Model Assessment & Selection 10
Optimism of the Training Error Rate
• Training Error < True Error
• is extra-sample error
• The in-sample error
• Optimism:
)](ˆ,([
))(ˆ,(1
1
XfYLEErr
xfyLN
errN
iii
Error True
Error Training
Err
N
ii
newiYyin xfYLEE
NErr new
1
))(ˆ,(1
23/04/20 Model Assessment & Selection 11
Optimism of the Training Error Rate
• For squared error, 0-1, other loss function :
• is obtained by a linear fit with d inputs or basis function, a simplification is:
N
i iiyi n
N
i ii
yyCovN
errEErr
yyCovN
op
1
1
),ˆ(2
),ˆ(2
iy
2
2
1
2
),ˆcov(
N
derrEErr
dyy
yi n
N
i ii
•输入维数或基函数的个数增加,乐观性增大•训练样本数增加,乐观性降低
23/04/20 Model Assessment & Selection 12
Estimates of In-sample Prediction Error
• The general form of the in-sample estimates is
• parameters are fit under Squared error loss
• Use a log-likelihood function to estimate
– This relationship introduce the Akaike Information Criterion
poerrErrE yin ˆ][ˆ
2ˆ2 N
derrCC pp :statistic
inErr
N
i iyloglikN
dloglikE
NYE-N
1)(Prlog
22
)(Prlog2,
d
23/04/20 Model Assessment & Selection 13
Akaike Information Criterion
• Akaike Information Criterion is a similar but more generally applicable estimate of
• A set of models with a turning parameter :
• provides an estimate of the test error curve, and we find the turning parameter that minimizes it.
inErr)(xf
parameters of number :error training the:
)()(
ˆ)(
2)()AIC( 2
derr
N
derr
)AIC(
)ˆminAIC(:ˆ|)(ˆ xf
23/04/20 Model Assessment & Selection 14
Akaike Information Criterion
• For the logistic regression model, using the binomial log-likelihood.
• For Gaussian model the AICAIC statistic equals to the CCp p statistic.
N
dlikE
N2][log
2AIC
2ˆ2AIC N
derrC p
23/04/20 Model Assessment & Selection 15
Akaike 信息准则• 音素识别例子:
M)d(
,)()(1
M
kkk fhf
23/04/20 Model Assessment & Selection 16
Effective number of parameters
• A linear fitting method:
• Effective number of parameters:
– If is an orthogonal projection matrix onto a basis set spanned by features, then:
– is the correct quantity to replace in the CCpp statistic
ixNNSSyy on depending matrix, is ,ˆ
)trace()( SSd
MS )trace(
SM
MS )trace( d
23/04/20 Model Assessment & Selection 17
Bayesian Approach & BIC
• The Bayesian Information Criterion (BIC)
• Gaussian model: – Variance – then– So
– is proportional to , 2 replaced by– 倾向选择简单模型 , 而惩罚复杂模型
dNloglik )(log2BIC
2
22
1
2 //))(ˆ(2 errNxfyCloglikN
iii
])(log[BIC 22
N
dNerr
N
BIC )AIC(Cp Nlog,4.72 eN
23/04/20 Model Assessment & Selection 18
Bayesian Model Selection
• BIC derived from Bayesian Model Selection
• Candidate models MMmm , model parameter and
a prior distribution• Posterior probability:
m)|Pr( mm M
Nii
mmmmmm
mmm
yx
dMMM
MMM
1},{Z
)|Pr(),|Pr(Z)Pr(
)|Pr(Z)Pr(Z)|Pr(
data training the represents -
23/04/20 Model Assessment & Selection 19
Bayesian Model Selection
• Compare two models
• If the odds are greater than 1, model m will be chosen, otherwise choose model
• Bayes Factor:
– The contribution of the data to the posterior odds
MM m and
)|Pr(Z
)|Pr(Z
)Pr(
)Pr(
Z)|Pr(
Z)|Pr(
M
M
M
M
M
M mmm
)|Pr(Z
)|Pr(ZBF(Z)
M
M m
23/04/20 Model Assessment & Selection 20
Bayesian 模型选择• 如果模型的先验是均匀的 Pr(M) 是常数,
)M,ˆ||ZPr(log
dˆ
)(ONlogd
)M,ˆ||ZPr(log)M|ZPr(log
mm
mm
mmmm
2
是模型维数。估计,是模型参数的极大似然其中
12
极小 BIC的模型等价于极大化后验概率模型优点:当模型包含真实模型是,当样本趋于无穷时, BIC选择正确的概率是一。
23/04/20 Model Assessment & Selection 21
最小描述长度 (MDL)
• 来源:最优编码• 信息: z1 z2 z3 z4• 编码: 0 10 110 111• 编码 2 : 110 10 111 0• 准则:最频繁的使用最短的编码
• 发送信息 zi 的概率:• 香农定律指出使用长度: )Pr(logl
)Pr(
i i
i
z
z
23/04/20 Model Assessment & Selection 22
最小描述长度 (MDL)
8/1;8/1;4/1;2/1)Pr(,
)Pr(log)Pr((
Length
i
li
ii
zAp
zzE
i 等式成立。
)
23/04/20 Model Assessment & Selection 23
模型选择 MDL
模型参数的平均长度:
的偏差传递模型与实际目标值:
率假定模型输出的条件概
输入输出参数:模型:
length
)|Pr(log
),,|Pr(log
)|Pr(log),,|Pr(log
),,|Pr(
),(;
M
XMy
MXMy
XMy
yXZM
23/04/20 Model Assessment & Selection 24
模型选择 MDL
度。小的具有较短的信息长
,参数如果
2
)(log
)10(~),,(~2
2
2
2
yNNy
Const.Length
MDL原理:我们应该选择模型,使得下列长度极小
).|Pr(log),,|Pr(log MxMy
M
length
参数:模型:
23/04/20 Model Assessment & Selection 25
Vapnik-Chernovenkis 维• 问题:如何选择模型的参数的个数 d ?• 该参数代表了模型的复杂度• VC 维是描述模型复杂度的一个重要的指标
。?)(sin),(,
复杂度)0(线性指示函数:
,)},,({
1010
只有一个参数但是
指示函数参数函数族
fxIxfIRx
fxIfEx
fIRxxf
T
p
1p:
),,(
:
23/04/20 Model Assessment & Selection 26
VC 维• 类 的 VC 维定义为可以被 成员分散
的点的最大的个数
• 平面的直线类 VC 维为 3 。• sin(ax) 的 VC 维是无穷大。
)},({ xf )},({ xf
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1
0
1
23/04/20 Model Assessment & Selection 27
VC 维• 实值函数类 的 VC 维定义为指示类 的 VC 维。• 引入 VC 维可以为泛化误差提供一个估计• 设 的 VC 维为 h, 样本数为 N.
)},({ xg
)}0),(({ xgI
)},({ xg
二类分类 )4
11(2 err
errErr
N
hNaha
c
errErr
)4/log()1/[log(;
)1(2
1
回归
23/04/20 Model Assessment & Selection 28
交叉验证
1 2 3 4 5
训练集 训练集 检验集 训练集 训练集
23/04/20 Model Assessment & Selection 29
自助法• 基本思想:从训练数据中有放回地随机抽样
数据集,每个数据集的与原训练集相同。• 如此产生 B 组 自助法数据集• 如何利用这些数据集进行预测?
自助集误差:
上的预测值是如果 iib xxf )(ˆ *
B
b
N
ii
biboot xfyL
NBErr
1 1
*, ))(ˆ(
11
23/04/20 Model Assessment & Selection 30
自助法• 自助法过程图解:
1*Z 2*Z BZ *
)( 1*ZS )( 2*ZS )( *BZS
),,,( 21 NzzzZ
重复实验
样本
训练样本
23/04/20 Model Assessment & Selection 31
Summary• Bias, Variance and Model Complexity• The Bias-Variance Decomposition• Optimism of the Training Error Rate• Estimates of In-Sample Prediction Error• The Effective Number of Parameters• The Bayesian Approach and BIC• Minimum Description Length• Vapnik-Chernovenkis Dimension• Cross-Validation• Bootstrap Methods