Model Assessment & Selection

Model Assessment & Model Assessment & SelectionSelection

Prof. Liqing Zhang Prof. Liqing Zhang

Dept. Computer Science & Engineering,

Shanghai Jiaotong University

23/04/20 Model Assessment & Selection 2

Outline• Bias, Variance and Model Complexity

• The Bias-Variance Decomposition

• Optimism of the Training Error Rate

• Estimates of In-Sample Prediction Error

• The Effective Number of Parameters

• The Bayesian Approach and BIC

• Minimum Description Length

• Vapnik-Chernovenkis Dimension

• Cross-Validation

• Bootstrap Methods


Bias, Variance & Model Complexity



• The standard of model assessment : the generalization performance of a learning method– Model:– Prediction Model:– Loss function:

)(; XfYYX

error absolute

error squared

|)(ˆ|

))(ˆ())(ˆ,(

2

XfY

XfYXfYL

)(ˆ Xf



• Error: training error, generalization error

• Typical loss function:

))(ˆ,(

))(ˆ,(1

1

XfYLEErr

xfyLN

errN

iii

:error tionGeneraliza

:error Training

)(ˆlog2

)(ˆlog)(2))(ˆ,(

))(ˆ())(ˆ,(

1

Xp

XpkGIXpGL

XGGIXGGL

G

k

K

k

likelihood-log

loss 1-0


Bias-Variance Decomposition

• Basic Model:• The expected prediction error of a regression

fit .

• The more complex the model, the lower the (squared) bias but the higher the variance.

VarianceBiasError eIrreducibl

2

2

2

))(ˆ())(ˆ(

)](ˆ)(ˆ[)]()(ˆ[

]|))(ˆ[()(

02

0

200

200

02

00

xfVarxfBias

xfxfEExfxfE

xXxfYExErr

),0(~,)( 2 NXfY

)(ˆ Xf



• For the k-NN regression fit the prediction error:

• For the linear model fit

kσxfk

xfσ

x|XxfYExErr

ε

k

jjε /])(

1)([

]))(ˆ[()(

22

10

2

02

00

xxf Tp )(ˆ

22

02

00

02

00

)()](ˆ)([

]|))(ˆ[()

xhxfExf

xXxfYExErr

p

p

(2



• The in-sample error of the Linear Model

– The model complexity is directly related to the number of parameters p.

• For ridge regression the square bias

2

1i

2 N

pxfExf

NxErr

N

N

ii

N

ii

2

1

)](ˆ)([1

)(1

22 Bias] n[EstimatioBias] [Model

2

00

2

00

2

0ˆ)()(ˆ)(

0000xExExxfExfExfE TT

xT

xxx



Regularized fit

Closest fit

Estimation Variance

Model bias

Estimation bias

Truth

RealizationClosest fit in population

RESTRICED MODEL SPACE

MODEL SPACE

• Schematic of the behavior of bias and variance


Optimism of the Training Error Rate

• Training Error < True Error

• is extra-sample error

• The in-sample error

• Optimism:

)](ˆ,([

))(ˆ,(1

1

XfYLEErr

xfyLN

errN

iii

Error True

Error Training

Err

N

ii

newiYyin xfYLEE

NErr new

1

))(ˆ,(1


Optimism of the Training Error Rate

• For squared error, 0-1, other loss function ：

• is obtained by a linear fit with d inputs or basis function, a simplification is:

N

i iiyi n

N

i ii

yyCovN

errEErr

yyCovN

op

1

1

),ˆ(2

),ˆ(2

iy

2

2

1

2

),ˆcov(

N

derrEErr

dyy

yi n

N

i ii

•输入维数或基函数的个数增加，乐观性增大•训练样本数增加，乐观性降低


Estimates of In-sample Prediction Error

• The general form of the in-sample estimates is

• parameters are fit under Squared error loss

• Use a log-likelihood function to estimate

– This relationship introduce the Akaike Information Criterion

poerrErrE yin ˆ][ˆ

2ˆ2 N

derrCC pp :statistic

inErr

N

i iyloglikN

dloglikE

NYE-N

1)(Prlog

22

)(Prlog2,

d


Akaike Information Criterion

• Akaike Information Criterion is a similar but more generally applicable estimate of

• A set of models with a turning parameter :

• provides an estimate of the test error curve, and we find the turning parameter that minimizes it.

inErr)(xf

parameters of number :error training the:

)()(

ˆ)(

2)()AIC( 2

derr

N

derr

)AIC(

)ˆminAIC(:ˆ|)(ˆ xf


Akaike Information Criterion

• For the logistic regression model, using the binomial log-likelihood.

• For Gaussian model the AICAIC statistic equals to the CCp p statistic.

N

dlikE

N2][log

2AIC

2ˆ2AIC N

derrC p


Akaike 信息准则• 音素识别例子：

M)d(

,)()(1

M

kkk fhf


Effective number of parameters

• A linear fitting method:

• Effective number of parameters:

– If is an orthogonal projection matrix onto a basis set spanned by features, then:

– is the correct quantity to replace in the CCpp statistic

ixNNSSyy on depending matrix, is ,ˆ

)trace()( SSd

MS )trace(

SM

MS )trace( d


Bayesian Approach & BIC

• The Bayesian Information Criterion (BIC)

• Gaussian model: – Variance – then– So

– is proportional to , 2 replaced by– 倾向选择简单模型 , 而惩罚复杂模型

dNloglik )(log2BIC

2

22

1

2 //))(ˆ(2 errNxfyCloglikN

iii

])(log[BIC 22

N

dNerr

N

BIC )AIC(Cp Nlog,4.72 eN


Bayesian Model Selection

• BIC derived from Bayesian Model Selection

• Candidate models MMmm , model parameter and

a prior distribution• Posterior probability:

m)|Pr( mm M

Nii

mmmmmm

mmm

yx

dMMM

MMM

1},{Z

)|Pr(),|Pr(Z)Pr(

)|Pr(Z)Pr(Z)|Pr(

data training the represents -


Bayesian Model Selection

• Compare two models

• If the odds are greater than 1, model m will be chosen, otherwise choose model

• Bayes Factor:

– The contribution of the data to the posterior odds

MM m and

)|Pr(Z

)|Pr(Z

)Pr(

)Pr(

Z)|Pr(

Z)|Pr(

M

M

M

M

M

M mmm

)|Pr(Z

)|Pr(ZBF(Z)

M

M m


Bayesian 模型选择• 如果模型的先验是均匀的 Pr(M) 是常数，

)M,ˆ||ZPr(log

dˆ

)(ONlogd

)M,ˆ||ZPr(log)M|ZPr(log

mm

mm

mmmm

2

是模型维数。估计，是模型参数的极大似然其中

12

极小 BIC的模型等价于极大化后验概率模型优点：当模型包含真实模型是，当样本趋于无穷时， BIC选择正确的概率是一。


最小描述长度 (MDL)

• 来源：最优编码• 信息： z1 z2 z3 z4• 编码： 0 10 110 111• 编码 2 ： 110 10 111 0• 准则：最频繁的使用最短的编码

• 发送信息 zi 的概率：• 香农定律指出使用长度： )Pr(logl

)Pr(

i i

i

z

z


最小描述长度 (MDL)

8/1;8/1;4/1;2/1)Pr(,

)Pr(log)Pr((

Length

i

li

ii

zAp

zzE

i 等式成立。

）


模型选择 MDL

模型参数的平均长度：

的偏差传递模型与实际目标值：

率假定模型输出的条件概

输入输出参数：模型：

length

)|Pr(log

),,|Pr(log

)|Pr(log),,|Pr(log

),,|Pr(

),(;

M

XMy

MXMy

XMy

yXZM


模型选择 MDL

度。小的具有较短的信息长

，参数如果

2

)(log

)10(~),,(~2

2

2

2

yNNy

Const.Length

MDL原理：我们应该选择模型，使得下列长度极小

).|Pr(log),,|Pr(log MxMy

M

length

参数：模型：


Vapnik-Chernovenkis 维• 问题：如何选择模型的参数的个数 d ？• 该参数代表了模型的复杂度• VC 维是描述模型复杂度的一个重要的指标

。?)(sin),(,

复杂度)0(线性指示函数:

,)},,({

1010

只有一个参数但是

指示函数参数函数族

fxIxfIRx

fxIfEx

fIRxxf

T

p

1p:

），，（

：


VC 维• 类的 VC 维定义为可以被成员分散

的点的最大的个数

• 平面的直线类 VC 维为 3 。• sin(ax) 的 VC 维是无穷大。

)},({ xf )},({ xf

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1

0

1


VC 维• 实值函数类的 VC 维定义为指示类的 VC 维。• 引入 VC 维可以为泛化误差提供一个估计• 设的 VC 维为 h, 样本数为 N.

)},({ xg

)}0),(({ xgI

)},({ xg

二类分类 )4

11(2 err

errErr

N

hNaha

c

errErr

)4/log()1/[log(;

)1(2

1

回归


交叉验证

1 2 3 4 5

训练集训练集检验集训练集训练集


自助法• 基本思想：从训练数据中有放回地随机抽样

数据集，每个数据集的与原训练集相同。• 如此产生 B 组自助法数据集• 如何利用这些数据集进行预测？

自助集误差：

上的预测值是如果 iib xxf )(ˆ *

B

b

N

ii

biboot xfyL

NBErr

1 1

*, ))(ˆ(

11


自助法• 自助法过程图解：

1*Z 2*Z BZ *

)( 1*ZS )( 2*ZS )( *BZS

),,,( 21 NzzzZ

重复实验

样本

训练样本


Summary• Bias, Variance and Model Complexity• The Bias-Variance Decomposition• Optimism of the Training Error Rate• Estimates of In-Sample Prediction Error• The Effective Number of Parameters• The Bayesian Approach and BIC• Minimum Description Length• Vapnik-Chernovenkis Dimension• Cross-Validation• Bootstrap Methods

Model Assessment & Selection

Documents

model assessment selectionprof

prediction model

gaussian model

variance model complexityerror

logistic regression

biasvariance decompositionfor

biasvariance decompositionthe

squared bias