Top Banner
April 3, 2003 Max Plank Institute for Biological C ybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan Masashi Sugiyama ([email protected])
59

April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

Dec 18, 2015

Download

Documents

Sarah Hensley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

April 3, 2003Max Plank Institute for Biological Cybernetics

Functional Analytic Approach to Model Selection

Functional Analytic Approach to Model Selection

Department of Computer Science,Tokyo Institute of Technology, Tokyo, Japan

Masashi Sugiyama([email protected])

Page 2: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

2

From , obtain such thatit is as close to as possible.

Regression ProblemRegression Problem

:Learning target function

:Learned function

:Training examples

(noise)

Page 3: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

3Typical Method of LearningTypical Method of Learning

Linear regression model

Ridge regression

:Parameters to be learned

:Basis functions

:Ridge parameter

Page 4: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

4Model SelectionModel Selection

Too simple Appropriate Too complex

Target functionLearned function

Choice of the model affectsheavily on the learned function

(Model refers to, e.g., the ridge parameter )

Page 5: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

5Ideal Model SelectionIdeal Model Selection

Determine the model such thata certain generalization error

is minimized.

Badness of

Page 6: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

6

However, the generalization error can not be directly calculated since it

includes unknown learning target function

Practical Model Selection

Practical Model Selection

Determine the model such thatan estimator of the generalization error

is minimized.

We want to have an accurate estimator

(not true for Bayesian model selection using evidence)

Page 7: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

7

Try to obtain unbiased estimators

Two Approaches toEstimating Generalization Error

(1)

Two Approaches toEstimating Generalization Error

(1)

• CP (Mallows, 1973)• Cross-Validation• Akaike Information Criterion (Akaike, 1974) etc.

Interested intypical-caseperformance

Page 8: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

8

Try to obtain probabilistic upper bounds

Two Approaches toEstimating Generalization Error

(2)

Two Approaches toEstimating Generalization Error

(2)

• VC-bound (Vapnik & Chervonenkis, 1974)• Span bound (Chapelle & Vapnik, 2000)• Concentration bound

(Bousquet & Elisseeff 2001) etc.with probability

Interested inworst-case

performance

Page 9: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

9Popular Choices of

Generalization MeasurePopular Choices of

Generalization MeasureRisk

e.g.,

Kullback-Leibler divergence

:Learned density:Target density

Page 10: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

10Concerns in Existing

MethodsConcerns in Existing

MethodsThe used approximation often requires a large (infinite) number of training examples for justification (asymptotic approximation)

They do not work with small samples

Generalization measure should be integrated over , from which training examples are drawn

They can not be used for transduction

(estimating error at a point of interest)

Page 11: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

11Our InterestsOur Interests

We are interested inEstimating the generalization error with

accuracy guaranteed for small (finite) samplesEstimating the transduction error

(the error at a point of interest) Investigating the role of unlabeled samples

(samples without output sample values )

Page 12: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

12

:Functional Hilbert spaceWe assume

Our Generalization MeasureOur Generalization Measure

:Norm in the function space

Page 13: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

13Generalization Measure

in Functional Hilbert Space

Generalization Measurein Functional Hilbert

SpaceA functional Hilbert space is specified bySet of functions which span the space, Inner product (and norm).

Given a set of functions, we can design the inner product (and therefore the generalization measure) as desired.

Page 14: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

14

Weighted distance in input domain

Weighted distance in Fourier domain

Sobolev norm

Examples of the NormExamples of the Norm

:Weight function

:Weight function

: -th derivative of

Page 15: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

15Interesting FeaturesInteresting Features

When and , we can use unlabelled samples for estimating :

For transductive inference (given ),

For interpolation, extrapolation: Desired

:Weightfunction

Page 16: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

16Goal of My TalkGoal of My Talk

I suppose that you like the generalization measure defined in the functional Hilbert space.

The goal of my talk is to give a method for estimating the generalization error.

:Norm in the function space

Page 17: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

17

For further discussion, we have to specify the class of function spaces.

We want the class to be less restrictive.A general function space such as is

not suitable for learning problems because a value of a function at a point is not specified in .

Function Spaces for LearningFunction Spaces for Learning

and have different values at

But they are treated asthe same function in

is spanned by

Page 18: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

18

A function space that is rather general and a value of a function at a point is specified is the reproducing kernel Hilbert space (RKHS).

RKHS has the reproducing kernelFor any fixed ,

is a function of in For any function in and any ,

Reproducing Kernel Hilbert Space

Reproducing Kernel Hilbert Space

:Inner product in the function space

Page 19: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

19

Specified RKHS : Fixed

, : Mean , VarianceLinear estimation

e.g., ridge regression for linear model

Formulation of Learning ProblemFormulation of Learning Problem

:Linear operator

:Basis functions in

Page 20: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

20Sampling OperatorSampling Operator

For any RKHS , there exists a linear operator from to such that

Indeed,

:Neumann-Schatten product

: -th standard basis in

For vectors,

Page 21: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

21FormulationFormulation

Learningtarget

function

Learnedfunction

+noise

Sampling operator(Always linear)

Learning operator(Assume linear)

RKHS Sample value space

Gen. error

Page 22: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

22

We are interested in typical performance so we estimate the expected generalization error over the noise

We do not take expectation over input points

Data-dependent !We do not assume

Advantageous in active learning !

Expected Generalization ErrorExpected Generalization Error

:Expectation over noise

Page 23: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

23Bias / Variance DecompositionBias / Variance Decomposition

Bias Variance

Bias

Variance

We want to estimate the bias !

:Noise variance:Expectation over noise

:Adjoint of

RKHS

Page 24: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

24Tricks for Estimating

BiasTricks for Estimating

BiasSuppose we have a linear operator

that gives an unbiased estimate of

We use for estimating the bias of

:Expectation over noise

RKHS

Sugiyama & Ogawa (Neural Comp., 2001)

Page 25: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

25Unbiased Estimator of

BiasUnbiased Estimator of

BiasBias

Rough estimate

RKHS

Page 26: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

26Subspace Information Criterion

(SIC)Subspace Information Criterion

(SIC)

SIC is an unbiased estimator ofthe generalization error with finite samples

Estimate of Bias Variance

:Expectation over noise

Page 27: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

27

exists if and only if span the entire space .

When this is satisfied, is given by .

We can enjoy all the features ! (Unlabeled samples, transductive inference etc.)

Obtaining Unbiased Estimate

Obtaining Unbiased EstimateWe need that gives an unbiased

estimate of learning target .

:Generalized inverse

Page 28: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

28Example of Using SIC:

Standard Linear RegressionExample of Using SIC:

Standard Linear RegressionLearning target function

where are unknownRegression model

where are estimated linearly(e.g., ridge regression)

Page 29: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

29Example (cont.)Example (cont.)

Generalization measure

If the design matrix has rank , then the best linear unbiased estimator (BLUE) always exists

In this case, SIC provides an unbiased estimate of the above generalization error

:Number of basis functions

:Weight function

Page 30: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

30

However, the design matrix has rank only if

Therefore, the target function should be included in a rather small model

Applicability of SIC

Applicability of SIC

:Number of basis functions

:Number of training examples

Range of application of SIC is rather limited

Page 31: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

31When Unbiased Estimate

Does Not ExistWhen Unbiased Estimate

Does Not Exist exists if and only if span

the whole space .When this condition is not fulfilled, let us

restrict ourselves to finding a learning result function from a subspace , not from the entire RKHS

RKHS

Sugiyama & Müller (JMLR, 2002)

Page 32: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

32Essential Generalization

ErrorEssential Generalization

Error

RKHS

Essential Irrelevant(constant)

Essentially, we are estimating projection

is just replaced by

Page 33: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

33

Such exists if and only if the subspace

is included in the span of .

Unbiased Estimate of Projection

Unbiased Estimate of ProjectionIf a linear operator that gives an

unbiased estimate of the projection of the learning target is available, then SIC is an unbiased estimator of the essential generalization error.

e.g., kernel regression model

Page 34: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

34RestrictionRestriction

However, another restriction arises:If the generalization measure is designed

as desired, we have to use the kernel function induced by the generalization measure

Page 35: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

35Restriction

(cont.)Restriction

(cont.)On the other hand,If a desired kernel function is used, then

we have to use the generalization measure induced by the kernel

e.g., generalization measure in Gaussian RKHS heavily penalizes high frequency components

Page 36: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

36Summary of Usage of SICSummary of Usage of SIC

SIC essentially has two modes.For rather restricted linear regression, SIC ha

s several interesting properties.Unlabeled samples can be utilized for estimating

prediction error (expected test error).Any weighted error measures can be used,

e.g., inter-(extra-)polation, transductive inference.

For kernel regression, SIC can always be applied. However, kernel induced generalization measure should be employed.

Page 37: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

37Simulation (1): Setting Simulation (1): Setting

:Trigonometric polynomial RKHS

Span:

Gen. measure:

Learning target function :

sinc-like function in

Page 38: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

38

Training examples :

Ridge regression is used for learning

Number of training examples:

Noise variance:

Number of trials:

Simulation (1): Setting (cont.)

Simulation (1): Setting (cont.)

Page 39: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

39Simulation (1-a):Using Unlabeled

Samples

Simulation (1-a):Using Unlabeled

SamplesWe estimate the prediction error using 1000 unlabeled samples

Page 40: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

40Results: Unlabeled

SamplesResults: Unlabeled

Samples

Values can be negative since some constants are ignored

:Ridge parameter

Page 41: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

41Results: Unlabeled

SamplesResults: Unlabeled

Samples

:Ridge parameter

Page 42: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

42Simulation (1-b):

TransductionSimulation (1-b):

Transduction

We estimate the test error

at a single test point

Page 43: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

43Results:

TransductionResults:

Transduction

:Ridge parameter

Page 44: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

44Results:

TransductionResults:

Transduction

:Ridge parameter

Page 45: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

45

:Gaussian RKHS

Learning target function : sinc function

Training examples :

We estimate

Simulation (2): Infinite Dimensional

RKHS

Simulation (2): Infinite Dimensional

RKHS

Page 46: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

46Results: Gaussian

RKHSResults: Gaussian

RKHS

:Ridge parameter

Page 47: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

47Results: Gaussian

RKHSResults: Gaussian

RKHS

:Ridge parameter

Page 48: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

48Simulation (3):

DELVE Data SetsSimulation (3):

DELVE Data Sets :Gaussian RKHS

We choose the ridge parameter bySICLeave-one-out cross-validationAn empirical Bayesian method

(Marginal likelihood maximization)

Performance is compared by test error

(Akaike, 1980)

Page 49: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

49Normalized Test

ErrorsNormalized Test

Errors

Red: Best or comparable (95%t-test)

Page 50: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

50Image RestorationImage Restoration

Restoration Filter

We would like to determineparameter values appropriately.

largesmall appropriate

DegradedImage

Parameter

e.g.,Gaussian filter,regularization filter

Sugiyama et al. (IEICE Trans., 2001)Sugiyama & Ogawa (Signal Processing, 2002)

Page 51: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

51FormulationFormulation

Hilbert space Hilbert space

Original image

Restored image

Degradation

Filter

Observed image

Noise

Page 52: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

52Results with Regularization

FilterResults with Regularization

FilterOriginalimages

Degradedimages

Restoredimages

using SIC

Page 53: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

53Precipitation Estimation

Precipitation EstimationEstimating future precipitation from past

precipitation and whether radar data.Our method with SIC won the 1st prize

in estimation accuracy in IEICE Precipitation Estimation Contest 2001

Precipitation and weather radar and data from

IEICE Precipitation Estimation Contest 2001

Moro & Sugiyama (IEICE General Conf., 2001)

1st TokyoTech MSE=0.71

2nd KyuTech MSE=0.75

3rd Chiba Univ MSE=0.93

4th MSE=1.18

Page 54: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

54References

(Fundamentals of SIC)References

(Fundamentals of SIC)

Proposing the concept of SICSugiyama, M. & Ogawa, H. Subspace information criterion for model selection. Neural Computation, vol.13, no.8, pp.1863-1889, 2001

Performance evaluation of SICSugiyama, M. & Ogawa, H. Theoretical and experimental evaluation of the subspace information criterion. Machine Learning, vol.48, no.1/2/3, pp.25-50, 2002.

Page 55: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

55References (SIC for

Particular Learning Methods)References (SIC for

Particular Learning Methods)SIC for regularization learning

Sugiyama, M. & Ogawa, H. Optimal design of regularization term and regularization parameter by subspace information criterion. Neural Networks, vol.15, no.3, pp.349-361, 2002.

SIC for sparse regressorsTsuda, K., Sugiyama, M., & Müller, K.-R. Subspace information criterion for non-quadratic regularizers --- Model selection for sparse regressors. IEEE Transactions on Neural Networks, vol.13, no.1, pp.70-80, 2002.

Page 56: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

56References (Applications of

SIC)References (Applications of

SIC)Applying SIC to image restorationSugiyama, M., Imaizumi, D., & Ogawa, H. Subspace information criterion for image restoration --- Optimizing parameters in linear filters. IEICE Transactions on Information and Systems, vol.E84-D, no.9, pp.1249-1256, Sep. 2001.

Sugiyama, M. & Ogawa, H. A unified method for optimizing linear image restoration filters. Signal Processing, vol.82, no.11, pp.1773-1787, 2002.

Applying SIC to precipitation estimationMoro, S. & Sugiyama, M. Estimation of precipitation from meteorological radar data. In Proceedings of the 2001 IEICE General Conference SD-1-10, pp.264-265, Shiga, Japan, Mar. 26-29, 2001.

Page 57: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

57References (Extensions of

SIC)References (Extensions of

SIC)Extending range of application of SIC Sugiyama, M. & Müller, K.-R. The subspace information criterion for infinite dimensional hypothesis spaces. Journal of Machine Learning Research, vol.3 (Nov), pp.323-359, 2002.

Further Improving SICSugiyama, M.Improving precision of the subspace information criterion. IEICE Transactions on Fundamentals (to appear).

Sugiyama, M., Kawanabe, M. & Müller, K.-R. Trading variance reduction with unbiasedness --- The regularized subspace information criterion for robust model selection (submitted).

Page 58: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

58Conclusion

sConclusion

sWe formulated the regression problem from a functional analytic point of view.

Within this framework, we gave a generalization error estimator called the subspace information criterion (SIC).

Unbiasedness of SIC guaranteed even with finite samples.

We did not take expectation over training sample points so SIC may be more data-dependent.

Page 59: April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

59Conclusions

(cont.)Conclusions

(cont.)SIC essentially has two modes:For rather restrictive linear regression,

SIC has several interesting properties.Unlabeled samples can be utilized for estimating p

rediction error.Any weighted error measures can be used, e.g., i

nterpolation, extrapolation, transductive inference.

For kernel regression, SIC can always be applied. However, kernel induced generalization measure should be employed.