April 3, 2003 Max Plank Institute for Biological C ybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan Masashi Sugiyama ([email protected])
Dec 18, 2015
April 3, 2003Max Plank Institute for Biological Cybernetics
Functional Analytic Approach to Model Selection
Functional Analytic Approach to Model Selection
Department of Computer Science,Tokyo Institute of Technology, Tokyo, Japan
Masashi Sugiyama([email protected])
2
From , obtain such thatit is as close to as possible.
Regression ProblemRegression Problem
:Learning target function
:Learned function
:Training examples
(noise)
3Typical Method of LearningTypical Method of Learning
Linear regression model
Ridge regression
:Parameters to be learned
:Basis functions
:Ridge parameter
4Model SelectionModel Selection
Too simple Appropriate Too complex
Target functionLearned function
Choice of the model affectsheavily on the learned function
(Model refers to, e.g., the ridge parameter )
5Ideal Model SelectionIdeal Model Selection
Determine the model such thata certain generalization error
is minimized.
Badness of
6
However, the generalization error can not be directly calculated since it
includes unknown learning target function
Practical Model Selection
Practical Model Selection
Determine the model such thatan estimator of the generalization error
is minimized.
We want to have an accurate estimator
(not true for Bayesian model selection using evidence)
7
Try to obtain unbiased estimators
Two Approaches toEstimating Generalization Error
(1)
Two Approaches toEstimating Generalization Error
(1)
• CP (Mallows, 1973)• Cross-Validation• Akaike Information Criterion (Akaike, 1974) etc.
Interested intypical-caseperformance
8
Try to obtain probabilistic upper bounds
Two Approaches toEstimating Generalization Error
(2)
Two Approaches toEstimating Generalization Error
(2)
• VC-bound (Vapnik & Chervonenkis, 1974)• Span bound (Chapelle & Vapnik, 2000)• Concentration bound
(Bousquet & Elisseeff 2001) etc.with probability
Interested inworst-case
performance
9Popular Choices of
Generalization MeasurePopular Choices of
Generalization MeasureRisk
e.g.,
Kullback-Leibler divergence
:Learned density:Target density
10Concerns in Existing
MethodsConcerns in Existing
MethodsThe used approximation often requires a large (infinite) number of training examples for justification (asymptotic approximation)
They do not work with small samples
Generalization measure should be integrated over , from which training examples are drawn
They can not be used for transduction
(estimating error at a point of interest)
11Our InterestsOur Interests
We are interested inEstimating the generalization error with
accuracy guaranteed for small (finite) samplesEstimating the transduction error
(the error at a point of interest) Investigating the role of unlabeled samples
(samples without output sample values )
12
:Functional Hilbert spaceWe assume
Our Generalization MeasureOur Generalization Measure
:Norm in the function space
13Generalization Measure
in Functional Hilbert Space
Generalization Measurein Functional Hilbert
SpaceA functional Hilbert space is specified bySet of functions which span the space, Inner product (and norm).
Given a set of functions, we can design the inner product (and therefore the generalization measure) as desired.
14
Weighted distance in input domain
Weighted distance in Fourier domain
Sobolev norm
Examples of the NormExamples of the Norm
:Weight function
:Weight function
: -th derivative of
15Interesting FeaturesInteresting Features
When and , we can use unlabelled samples for estimating :
For transductive inference (given ),
For interpolation, extrapolation: Desired
:Weightfunction
16Goal of My TalkGoal of My Talk
I suppose that you like the generalization measure defined in the functional Hilbert space.
The goal of my talk is to give a method for estimating the generalization error.
:Norm in the function space
17
For further discussion, we have to specify the class of function spaces.
We want the class to be less restrictive.A general function space such as is
not suitable for learning problems because a value of a function at a point is not specified in .
Function Spaces for LearningFunction Spaces for Learning
and have different values at
But they are treated asthe same function in
is spanned by
18
A function space that is rather general and a value of a function at a point is specified is the reproducing kernel Hilbert space (RKHS).
RKHS has the reproducing kernelFor any fixed ,
is a function of in For any function in and any ,
Reproducing Kernel Hilbert Space
Reproducing Kernel Hilbert Space
:Inner product in the function space
19
Specified RKHS : Fixed
, : Mean , VarianceLinear estimation
e.g., ridge regression for linear model
Formulation of Learning ProblemFormulation of Learning Problem
:Linear operator
:Basis functions in
20Sampling OperatorSampling Operator
For any RKHS , there exists a linear operator from to such that
Indeed,
:Neumann-Schatten product
: -th standard basis in
For vectors,
21FormulationFormulation
Learningtarget
function
Learnedfunction
+noise
Sampling operator(Always linear)
Learning operator(Assume linear)
RKHS Sample value space
Gen. error
22
We are interested in typical performance so we estimate the expected generalization error over the noise
We do not take expectation over input points
Data-dependent !We do not assume
Advantageous in active learning !
Expected Generalization ErrorExpected Generalization Error
:Expectation over noise
23Bias / Variance DecompositionBias / Variance Decomposition
Bias Variance
Bias
Variance
We want to estimate the bias !
:Noise variance:Expectation over noise
:Adjoint of
RKHS
24Tricks for Estimating
BiasTricks for Estimating
BiasSuppose we have a linear operator
that gives an unbiased estimate of
We use for estimating the bias of
:Expectation over noise
RKHS
Sugiyama & Ogawa (Neural Comp., 2001)
25Unbiased Estimator of
BiasUnbiased Estimator of
BiasBias
Rough estimate
RKHS
26Subspace Information Criterion
(SIC)Subspace Information Criterion
(SIC)
SIC is an unbiased estimator ofthe generalization error with finite samples
Estimate of Bias Variance
:Expectation over noise
27
exists if and only if span the entire space .
When this is satisfied, is given by .
We can enjoy all the features ! (Unlabeled samples, transductive inference etc.)
Obtaining Unbiased Estimate
Obtaining Unbiased EstimateWe need that gives an unbiased
estimate of learning target .
:Generalized inverse
28Example of Using SIC:
Standard Linear RegressionExample of Using SIC:
Standard Linear RegressionLearning target function
where are unknownRegression model
where are estimated linearly(e.g., ridge regression)
29Example (cont.)Example (cont.)
Generalization measure
If the design matrix has rank , then the best linear unbiased estimator (BLUE) always exists
In this case, SIC provides an unbiased estimate of the above generalization error
:Number of basis functions
:Weight function
30
However, the design matrix has rank only if
Therefore, the target function should be included in a rather small model
Applicability of SIC
Applicability of SIC
:Number of basis functions
:Number of training examples
Range of application of SIC is rather limited
31When Unbiased Estimate
Does Not ExistWhen Unbiased Estimate
Does Not Exist exists if and only if span
the whole space .When this condition is not fulfilled, let us
restrict ourselves to finding a learning result function from a subspace , not from the entire RKHS
RKHS
Sugiyama & Müller (JMLR, 2002)
32Essential Generalization
ErrorEssential Generalization
Error
RKHS
Essential Irrelevant(constant)
Essentially, we are estimating projection
is just replaced by
33
Such exists if and only if the subspace
is included in the span of .
Unbiased Estimate of Projection
Unbiased Estimate of ProjectionIf a linear operator that gives an
unbiased estimate of the projection of the learning target is available, then SIC is an unbiased estimator of the essential generalization error.
e.g., kernel regression model
34RestrictionRestriction
However, another restriction arises:If the generalization measure is designed
as desired, we have to use the kernel function induced by the generalization measure
35Restriction
(cont.)Restriction
(cont.)On the other hand,If a desired kernel function is used, then
we have to use the generalization measure induced by the kernel
e.g., generalization measure in Gaussian RKHS heavily penalizes high frequency components
36Summary of Usage of SICSummary of Usage of SIC
SIC essentially has two modes.For rather restricted linear regression, SIC ha
s several interesting properties.Unlabeled samples can be utilized for estimating
prediction error (expected test error).Any weighted error measures can be used,
e.g., inter-(extra-)polation, transductive inference.
For kernel regression, SIC can always be applied. However, kernel induced generalization measure should be employed.
37Simulation (1): Setting Simulation (1): Setting
:Trigonometric polynomial RKHS
Span:
Gen. measure:
Learning target function :
sinc-like function in
38
Training examples :
Ridge regression is used for learning
Number of training examples:
Noise variance:
Number of trials:
Simulation (1): Setting (cont.)
Simulation (1): Setting (cont.)
39Simulation (1-a):Using Unlabeled
Samples
Simulation (1-a):Using Unlabeled
SamplesWe estimate the prediction error using 1000 unlabeled samples
40Results: Unlabeled
SamplesResults: Unlabeled
Samples
Values can be negative since some constants are ignored
:Ridge parameter
41Results: Unlabeled
SamplesResults: Unlabeled
Samples
:Ridge parameter
42Simulation (1-b):
TransductionSimulation (1-b):
Transduction
We estimate the test error
at a single test point
43Results:
TransductionResults:
Transduction
:Ridge parameter
44Results:
TransductionResults:
Transduction
:Ridge parameter
45
:Gaussian RKHS
Learning target function : sinc function
Training examples :
We estimate
Simulation (2): Infinite Dimensional
RKHS
Simulation (2): Infinite Dimensional
RKHS
46Results: Gaussian
RKHSResults: Gaussian
RKHS
:Ridge parameter
47Results: Gaussian
RKHSResults: Gaussian
RKHS
:Ridge parameter
48Simulation (3):
DELVE Data SetsSimulation (3):
DELVE Data Sets :Gaussian RKHS
We choose the ridge parameter bySICLeave-one-out cross-validationAn empirical Bayesian method
(Marginal likelihood maximization)
Performance is compared by test error
(Akaike, 1980)
49Normalized Test
ErrorsNormalized Test
Errors
Red: Best or comparable (95%t-test)
50Image RestorationImage Restoration
Restoration Filter
We would like to determineparameter values appropriately.
largesmall appropriate
DegradedImage
Parameter
e.g.,Gaussian filter,regularization filter
Sugiyama et al. (IEICE Trans., 2001)Sugiyama & Ogawa (Signal Processing, 2002)
51FormulationFormulation
Hilbert space Hilbert space
Original image
Restored image
Degradation
Filter
Observed image
Noise
52Results with Regularization
FilterResults with Regularization
FilterOriginalimages
Degradedimages
Restoredimages
using SIC
53Precipitation Estimation
Precipitation EstimationEstimating future precipitation from past
precipitation and whether radar data.Our method with SIC won the 1st prize
in estimation accuracy in IEICE Precipitation Estimation Contest 2001
Precipitation and weather radar and data from
IEICE Precipitation Estimation Contest 2001
Moro & Sugiyama (IEICE General Conf., 2001)
1st TokyoTech MSE=0.71
2nd KyuTech MSE=0.75
3rd Chiba Univ MSE=0.93
4th MSE=1.18
54References
(Fundamentals of SIC)References
(Fundamentals of SIC)
Proposing the concept of SICSugiyama, M. & Ogawa, H. Subspace information criterion for model selection. Neural Computation, vol.13, no.8, pp.1863-1889, 2001
Performance evaluation of SICSugiyama, M. & Ogawa, H. Theoretical and experimental evaluation of the subspace information criterion. Machine Learning, vol.48, no.1/2/3, pp.25-50, 2002.
55References (SIC for
Particular Learning Methods)References (SIC for
Particular Learning Methods)SIC for regularization learning
Sugiyama, M. & Ogawa, H. Optimal design of regularization term and regularization parameter by subspace information criterion. Neural Networks, vol.15, no.3, pp.349-361, 2002.
SIC for sparse regressorsTsuda, K., Sugiyama, M., & Müller, K.-R. Subspace information criterion for non-quadratic regularizers --- Model selection for sparse regressors. IEEE Transactions on Neural Networks, vol.13, no.1, pp.70-80, 2002.
56References (Applications of
SIC)References (Applications of
SIC)Applying SIC to image restorationSugiyama, M., Imaizumi, D., & Ogawa, H. Subspace information criterion for image restoration --- Optimizing parameters in linear filters. IEICE Transactions on Information and Systems, vol.E84-D, no.9, pp.1249-1256, Sep. 2001.
Sugiyama, M. & Ogawa, H. A unified method for optimizing linear image restoration filters. Signal Processing, vol.82, no.11, pp.1773-1787, 2002.
Applying SIC to precipitation estimationMoro, S. & Sugiyama, M. Estimation of precipitation from meteorological radar data. In Proceedings of the 2001 IEICE General Conference SD-1-10, pp.264-265, Shiga, Japan, Mar. 26-29, 2001.
57References (Extensions of
SIC)References (Extensions of
SIC)Extending range of application of SIC Sugiyama, M. & Müller, K.-R. The subspace information criterion for infinite dimensional hypothesis spaces. Journal of Machine Learning Research, vol.3 (Nov), pp.323-359, 2002.
Further Improving SICSugiyama, M.Improving precision of the subspace information criterion. IEICE Transactions on Fundamentals (to appear).
Sugiyama, M., Kawanabe, M. & Müller, K.-R. Trading variance reduction with unbiasedness --- The regularized subspace information criterion for robust model selection (submitted).
58Conclusion
sConclusion
sWe formulated the regression problem from a functional analytic point of view.
Within this framework, we gave a generalization error estimator called the subspace information criterion (SIC).
Unbiasedness of SIC guaranteed even with finite samples.
We did not take expectation over training sample points so SIC may be more data-dependent.
59Conclusions
(cont.)Conclusions
(cont.)SIC essentially has two modes:For rather restrictive linear regression,
SIC has several interesting properties.Unlabeled samples can be utilized for estimating p
rediction error.Any weighted error measures can be used, e.g., i
nterpolation, extrapolation, transductive inference.
For kernel regression, SIC can always be applied. However, kernel induced generalization measure should be employed.