April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

April 3, 2003Max Plank Institute for Biological Cybernetics

Functional Analytic Approach to Model Selection

Functional Analytic Approach to Model Selection

Department of Computer Science,Tokyo Institute of Technology, Tokyo, Japan

Masashi Sugiyama([email protected])

2

From , obtain such thatit is as close to as possible.

Regression ProblemRegression Problem

:Learning target function

:Learned function

:Training examples

(noise)

3Typical Method of LearningTypical Method of Learning

Linear regression model

Ridge regression

:Parameters to be learned

:Basis functions

:Ridge parameter

4Model SelectionModel Selection

Too simple Appropriate Too complex

Target functionLearned function

Choice of the model affectsheavily on the learned function

(Model refers to, e.g., the ridge parameter )

5Ideal Model SelectionIdeal Model Selection

Determine the model such thata certain generalization error

is minimized.

Badness of

6

However, the generalization error can not be directly calculated since it

includes unknown learning target function

Practical Model Selection

Practical Model Selection

Determine the model such thatan estimator of the generalization error

is minimized.

We want to have an accurate estimator

(not true for Bayesian model selection using evidence)

7

Try to obtain unbiased estimators

Two Approaches toEstimating Generalization Error

(1)


(1)

• CP (Mallows, 1973)• Cross-Validation• Akaike Information Criterion (Akaike, 1974) etc.

Interested intypical-caseperformance

8

Try to obtain probabilistic upper bounds


(2)


(2)

• VC-bound (Vapnik & Chervonenkis, 1974)• Span bound (Chapelle & Vapnik, 2000)• Concentration bound

(Bousquet & Elisseeff 2001) etc.with probability

Interested inworst-case

performance

9Popular Choices of

Generalization MeasurePopular Choices of

Generalization MeasureRisk

e.g.,

Kullback-Leibler divergence

:Learned density:Target density

10Concerns in Existing

MethodsConcerns in Existing

MethodsThe used approximation often requires a large (infinite) number of training examples for justification (asymptotic approximation)

They do not work with small samples

Generalization measure should be integrated over , from which training examples are drawn

They can not be used for transduction

(estimating error at a point of interest)

11Our InterestsOur Interests

We are interested inEstimating the generalization error with

accuracy guaranteed for small (finite) samplesEstimating the transduction error

(the error at a point of interest) Investigating the role of unlabeled samples

(samples without output sample values )

12

:Functional Hilbert spaceWe assume

Our Generalization MeasureOur Generalization Measure

:Norm in the function space

13Generalization Measure

in Functional Hilbert Space

Generalization Measurein Functional Hilbert

SpaceA functional Hilbert space is specified bySet of functions which span the space, Inner product (and norm).

Given a set of functions, we can design the inner product (and therefore the generalization measure) as desired.

14

Weighted distance in input domain

Weighted distance in Fourier domain

Sobolev norm

Examples of the NormExamples of the Norm

:Weight function

:Weight function

: -th derivative of

15Interesting FeaturesInteresting Features

When and , we can use unlabelled samples for estimating :

For transductive inference (given ),

For interpolation, extrapolation: Desired

:Weightfunction

16Goal of My TalkGoal of My Talk

I suppose that you like the generalization measure defined in the functional Hilbert space.

The goal of my talk is to give a method for estimating the generalization error.

:Norm in the function space

17

For further discussion, we have to specify the class of function spaces.

We want the class to be less restrictive.A general function space such as is

not suitable for learning problems because a value of a function at a point is not specified in .

Function Spaces for LearningFunction Spaces for Learning

and have different values at

But they are treated asthe same function in

is spanned by

18

A function space that is rather general and a value of a function at a point is specified is the reproducing kernel Hilbert space (RKHS).

RKHS has the reproducing kernelFor any fixed ,

is a function of in For any function in and any ,

Reproducing Kernel Hilbert Space

Reproducing Kernel Hilbert Space

:Inner product in the function space

19

Specified RKHS : Fixed

, : Mean , VarianceLinear estimation

e.g., ridge regression for linear model

Formulation of Learning ProblemFormulation of Learning Problem

:Linear operator

:Basis functions in

20Sampling OperatorSampling Operator

For any RKHS , there exists a linear operator from to such that

Indeed,

:Neumann-Schatten product

: -th standard basis in

For vectors,

21FormulationFormulation

Learningtarget

function

Learnedfunction

＋noise

Sampling operator(Always linear)

Learning operator(Assume linear)

RKHS Sample value space

Gen. error

22

We are interested in typical performance so we estimate the expected generalization error over the noise

We do not take expectation over input points

Data-dependent !We do not assume

Advantageous in active learning !

Expected Generalization ErrorExpected Generalization Error

:Expectation over noise

23Bias / Variance DecompositionBias / Variance Decomposition

Bias Variance

Bias

Variance

We want to estimate the bias !

:Noise variance:Expectation over noise

:Adjoint of

RKHS

24Tricks for Estimating

BiasTricks for Estimating

BiasSuppose we have a linear operator

that gives an unbiased estimate of

We use for estimating the bias of


RKHS

Sugiyama & Ogawa (Neural Comp., 2001)

25Unbiased Estimator of

BiasUnbiased Estimator of

BiasBias

Rough estimate

RKHS

26Subspace Information Criterion

(SIC)Subspace Information Criterion

(SIC)

SIC is an unbiased estimator ofthe generalization error with finite samples

Estimate of Bias Variance


27

exists if and only if span the entire space .

When this is satisfied, is given by .

We can enjoy all the features ! (Unlabeled samples, transductive inference etc.)

Obtaining Unbiased Estimate

Obtaining Unbiased EstimateWe need that gives an unbiased

estimate of learning target .

:Generalized inverse

28Example of Using SIC:

Standard Linear RegressionExample of Using SIC:

Standard Linear RegressionLearning target function

where are unknownRegression model

where are estimated linearly(e.g., ridge regression)

29Example (cont.)Example (cont.)

Generalization measure

If the design matrix has rank , then the best linear unbiased estimator (BLUE) always exists

In this case, SIC provides an unbiased estimate of the above generalization error

:Number of basis functions

:Weight function

30

However, the design matrix has rank only if

Therefore, the target function should be included in a rather small model

Applicability of SIC

Applicability of SIC

:Number of basis functions

:Number of training examples

Range of application of SIC is rather limited

31When Unbiased Estimate

Does Not ExistWhen Unbiased Estimate

Does Not Exist exists if and only if span

the whole space .When this condition is not fulfilled, let us

restrict ourselves to finding a learning result function from a subspace , not from the entire RKHS

RKHS

Sugiyama & Müller (JMLR, 2002)

32Essential Generalization

ErrorEssential Generalization

Error

RKHS

Essential Irrelevant(constant)

Essentially, we are estimating projection

is just replaced by

33

Such exists if and only if the subspace

is included in the span of .

Unbiased Estimate of Projection

Unbiased Estimate of ProjectionIf a linear operator that gives an

unbiased estimate of the projection of the learning target is available, then SIC is an unbiased estimator of the essential generalization error.

e.g., kernel regression model

34RestrictionRestriction

However, another restriction arises:If the generalization measure is designed

as desired, we have to use the kernel function induced by the generalization measure

35Restriction

(cont.)Restriction

(cont.)On the other hand,If a desired kernel function is used, then

we have to use the generalization measure induced by the kernel

e.g., generalization measure in Gaussian RKHS heavily penalizes high frequency components

36Summary of Usage of SICSummary of Usage of SIC

SIC essentially has two modes.For rather restricted linear regression, SIC ha

s several interesting properties.Unlabeled samples can be utilized for estimating

prediction error (expected test error).Any weighted error measures can be used,

e.g., inter-(extra-)polation, transductive inference.

For kernel regression, SIC can always be applied. However, kernel induced generalization measure should be employed.

37Simulation (1): Setting Simulation (1): Setting

:Trigonometric polynomial RKHS

Span:

Gen. measure:

Learning target function :

sinc-like function in

38

Training examples :

Ridge regression is used for learning

Number of training examples:

Noise variance:

Number of trials:

Simulation (1): Setting (cont.)

Simulation (1): Setting (cont.)

39Simulation (1-a):Using Unlabeled

Samples

Simulation (1-a):Using Unlabeled

SamplesWe estimate the prediction error using 1000 unlabeled samples

40Results: Unlabeled

SamplesResults: Unlabeled

Samples

Values can be negative since some constants are ignored

:Ridge parameter

41Results: Unlabeled

SamplesResults: Unlabeled

Samples

:Ridge parameter

42Simulation (1-b):

TransductionSimulation (1-b):

Transduction

We estimate the test error

at a single test point

43Results:

TransductionResults:

Transduction

:Ridge parameter

44Results:

TransductionResults:

Transduction

:Ridge parameter

45

:Gaussian RKHS

Learning target function : sinc function

Training examples :

We estimate

Simulation (2): Infinite Dimensional

RKHS

Simulation (2): Infinite Dimensional

RKHS

46Results: Gaussian

RKHSResults: Gaussian

RKHS

:Ridge parameter

47Results: Gaussian

RKHSResults: Gaussian

RKHS

:Ridge parameter

48Simulation (3):

DELVE Data SetsSimulation (3):

DELVE Data Sets :Gaussian RKHS

We choose the ridge parameter bySICLeave-one-out cross-validationAn empirical Bayesian method

(Marginal likelihood maximization)

Performance is compared by test error

(Akaike, 1980)

49Normalized Test

ErrorsNormalized Test

Errors

Red: Best or comparable (95%t-test)

50Image RestorationImage Restoration

Restoration Filter

We would like to determineparameter values appropriately.

largesmall appropriate

DegradedImage

Parameter

e.g.,Gaussian filter,regularization filter

Sugiyama et al. (IEICE Trans., 2001)Sugiyama & Ogawa (Signal Processing, 2002)

51FormulationFormulation

Hilbert space Hilbert space

Original image

Restored image

Degradation

Filter

Observed image

Noise

52Results with Regularization

FilterResults with Regularization

FilterOriginalimages

Degradedimages

Restoredimages

using SIC

53Precipitation Estimation

Precipitation EstimationEstimating future precipitation from past

precipitation and whether radar data.Our method with SIC won the 1st prize

in estimation accuracy in IEICE Precipitation Estimation Contest 2001

Precipitation and weather radar and data from

IEICE Precipitation Estimation Contest 2001

Moro & Sugiyama (IEICE General Conf., 2001)

1st TokyoTech MSE=0.71

2nd KyuTech MSE=0.75

3rd Chiba Univ MSE=0.93

4th MSE=1.18

54References

(Fundamentals of SIC)References

(Fundamentals of SIC)

Proposing the concept of SICSugiyama, M. & Ogawa, H. Subspace information criterion for model selection. Neural Computation, vol.13, no.8, pp.1863-1889, 2001

Performance evaluation of SICSugiyama, M. & Ogawa, H. Theoretical and experimental evaluation of the subspace information criterion. Machine Learning, vol.48, no.1/2/3, pp.25-50, 2002.

http://neco.mitpress.org/

http://www.wkap.nl/prod/j/0885-6125

55References (SIC for

Particular Learning Methods)References (SIC for

Particular Learning Methods)SIC for regularization learning

Sugiyama, M. & Ogawa, H. Optimal design of regularization term and regularization parameter by subspace information criterion. Neural Networks, vol.15, no.3, pp.349-361, 2002.

SIC for sparse regressorsTsuda, K., Sugiyama, M., & Müller, K.-R. Subspace information criterion for non-quadratic regularizers --- Model selection for sparse regressors. IEEE Transactions on Neural Networks, vol.13, no.1, pp.70-80, 2002.

http://www.elsevier.nl/locate/inca/841

http://www.ieee.org/nnc/pubs/tnn/

56References (Applications of

SIC)References (Applications of

SIC)Applying SIC to image restorationSugiyama, M., Imaizumi, D., & Ogawa, H. Subspace information criterion for image restoration --- Optimizing parameters in linear filters. IEICE Transactions on Information and Systems, vol.E84-D, no.9, pp.1249-1256, Sep. 2001.

Sugiyama, M. & Ogawa, H. A unified method for optimizing linear image restoration filters. Signal Processing, vol.82, no.11, pp.1773-1787, 2002.

Applying SIC to precipitation estimationMoro, S. & Sugiyama, M. Estimation of precipitation from meteorological radar data. In Proceedings of the 2001 IEICE General Conference SD-1-10, pp.264-265, Shiga, Japan, Mar. 26-29, 2001.

http://www.ieice.org/

http://www.elsevier.com/locate/issn/01651684/


57References (Extensions of

SIC)References (Extensions of

SIC)Extending range of application of SIC Sugiyama, M. & Müller, K.-R. The subspace information criterion for infinite dimensional hypothesis spaces. Journal of Machine Learning Research, vol.3 (Nov), pp.323-359, 2002.

Further Improving SICSugiyama, M.Improving precision of the subspace information criterion. IEICE Transactions on Fundamentals (to appear).

Sugiyama, M., Kawanabe, M. & Müller, K.-R. Trading variance reduction with unbiasedness --- The regularized subspace information criterion for robust model selection (submitted).

http://www.jmlr.org/


58Conclusion

sConclusion

sWe formulated the regression problem from a functional analytic point of view.

Within this framework, we gave a generalization error estimator called the subspace information criterion (SIC).

Unbiasedness of SIC guaranteed even with finite samples.

We did not take expectation over training sample points so SIC may be more data-dependent.

59Conclusions

(cont.)Conclusions

(cont.)SIC essentially has two modes:For rather restrictive linear regression,

SIC has several interesting properties.Unlabeled samples can be utilized for estimating p

rediction error.Any weighted error measures can be used, e.g., i

nterpolation, extrapolation, transductive inference.

For kernel regression, SIC can always be applied. However, kernel induced generalization measure should be employed.

April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

Documents

function space slide

weight function slide

badness of slide

learned function model

target density slide

ridge parameter slide

certain generalization

worstcase performance