-
Unsupervised Kernel Regression
for Nonlinear Dimensionality Reduction
Diplomarbeit an der Technischen Fakultät
der Universität Bielefeld
Februar 2003
Roland Memisevic
Betreuer:
Prof. Helge Ritter
Universität Bielefeld
AG Neuroinformatik
Universitätsstr. 25
33615 Bielefeld
Dr. Peter Meinicke
Universität Bielefeld
AG Neuroinformatik
Universitätsstr. 25
33615 Bielefeld
-
Contents
1. Introduction 1
1.1. Dimensionality Reduction . . . . . . . . . . . . . . . . .
. . . . . . . 2
1.2. Nonlinear Dimensionality Reduction / Overview . . . . . . .
. . . . . 4
1.3. Conventions and Notation . . . . . . . . . . . . . . . . .
. . . . . . . . 6
2. Unsupervised Learning as Generalized Regression 7
2.1. Conventional Regression . . . . . . . . . . . . . . . . . .
. . . . . . . 7
2.2. Unsupervised Regression . . . . . . . . . . . . . . . . . .
. . . . . . . 9
2.3. Optimization . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 10
2.4. Projection Models . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 11
2.4.1. Principal Axes . . . . . . . . . . . . . . . . . . . . .
. . . . . 13
2.4.2. Principal Curves . . . . . . . . . . . . . . . . . . . .
. . . . . 13
2.4.3. Principal Points . . . . . . . . . . . . . . . . . . . .
. . . . . . 13
2.4.4. Local Principal Axes . . . . . . . . . . . . . . . . . .
. . . . . 14
2.5. Generative Models . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 14
3. Spectral Methods for Dimensionality Reduction 16
3.1. Linear Models . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 17
3.1.1. Principal Component Analysis . . . . . . . . . . . . . .
. . . . 17
3.1.2. Multidimensional Scaling . . . . . . . . . . . . . . . .
. . . . 17
3.2. Nonlinear Models . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 18
3.2.1. Locally Linear Embedding . . . . . . . . . . . . . . . .
. . . . 18
i
-
Contents
3.2.2. Isomap . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 21
3.2.3. Kernel PCA . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 21
4. Unsupervised Kernel Regression 23
4.1. Unsupervised Nonparametric Regression . . . . . . . . . . .
. . . . . . 23
4.1.1. The Nadaraya Watson Estimator . . . . . . . . . . . . . .
. . . 23
4.1.2. Unsupervised Nonparametric Regression . . . . . . . . . .
. . 25
4.2. Observable Space Error Minimization . . . . . . . . . . . .
. . . . . . 26
4.2.1. Optimization . . . . . . . . . . . . . . . . . . . . . .
. . . . . 27
4.2.2. Experiments . . . . . . . . . . . . . . . . . . . . . . .
. . . . 36
4.3. Latent Space Error Minimization . . . . . . . . . . . . . .
. . . . . . . 42
4.3.1. Optimization . . . . . . . . . . . . . . . . . . . . . .
. . . . . 43
4.3.2. Choosing the Observable Space Kernel Bandwidth . . . . .
. . 44
4.4. Combination of Latent and Observable Space Error
Minimization . . . 50
4.4.1. Optimization . . . . . . . . . . . . . . . . . . . . . .
. . . . . 52
4.4.2. Experiments . . . . . . . . . . . . . . . . . . . . . . .
. . . . 55
5. Applications 62
5.1. Visualization . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 62
5.2. Pattern detection . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 68
5.3. Pattern production . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 70
6. Conclusions 73
A. Generation of the Toy Datasets 79
ii
-
1. Introduction
This thesis investigates the recently proposed method of
’Unsupervised Kernel Regres-
sion’ (UKR). The theoretical placement of this method as a means
to pursue Nonlin-
ear Dimensionality Reduction (NLDR) is analyzed, the
technicalities involved with its
practical implementation are inspected, and its applicability to
real world problems is
explored.
The UKR method stems from the area of Machine Learning which is
concerned
with the development of algorithms that discover patterns in
data. More specifically,
in relying on the key idea to let a system learn from examples
what is important for a
specific task, within this area methods are being developed that
help to classify, detect,
manipulate and produce patterns in a wide range of areas. These
methods are not only of
increasing practical interest, as they may be used to cope with
the ever growing amount
of data available in electronic form today, but also play an
important role in the area
of Artificial Intelligence, where they act as models of those
mechanisms that lie at the
heart of our own cognitive capabilities.
The methods developed in the area of Machine Learning can be
divided into two
broad classes. Those belonging to the first are concerned with
generalizing knowledge
presented by means of examples to new, unseen data. Inspired by
biological learning,
where any kind of knowledge residing in the examples needs to be
pointed at by a
teacher who corrects and thereby adjusts the learning system,
the respective area is gen-
erally referred to as Supervised Learning (SL). It is contrasted
and complemented by
Unsupervised Learning (UL), which aims at developing systems
that in the absence of
prior knowledge automatically discover meaningful information
hidden in the example
1
-
1. Introduction
data. These methods thereby achieve the goal of accounting for
the variability in the
data and to provide alternative representations. Besides
providing some kind of pre-
processing which is often crucial to simplify a subsequent SL
task these systems give
rise to many further applications. Some of these will be
sketched throughout. It is this
second class of methods that UKR belongs to.
The two main tasks of UL can be defined as dimensionality
reduction and density
estimation. The UKR method may be cast in two distinct ways
leading to a variant
that pursues dimensionality reduction and a second variant that
includes some kind of
density estimation. This thesis examines only the first variant,
and brief references to
the closely related second method will be made at the
appropriate places.
1.1. Dimensionality Reduction
Generally, methods in dimensionality reduction discover more
compact representations
of their input data, while at the same time they try to keep the
usually resulting infor-
mation loss at a minimum. These methods thereby minimize the
required storage space
or bandwidth capacity for saving or transmitting the data, which
gives rise to some of
their most widespread applications. Additionally, the resulting
representations are of-
ten expected to capture the meaning inherent in the data more
explicitly. This provides
the basis for applications such as denoising and visualization
and has also led to an in-
creasing interest of the area in computational neuroscience for
these methods, where the
awareness that many higher level cognitive capabilities are not
possible without some
kind of dimensionality reduction is common grounds.
The basis for the reasoning adopted in a large body of methods
in Machine Learning
in general and in dimensionality reduction virtually exclusively
is to represent the input
and output data for these methods as sets of real valued
vectors. For some given set of
input vectors dimensionality reduction then amounts to computing
an equally sized set
of output vectors with lower dimensionality that fits the
meaning of the input data set as
closely as possible. Since the lower dimensional representations
in many circumstances
can be thought of as representing the ’real’ or ’original’
meaning of the data more closely
than the (often noisy) input data, they will formally be denoted
by the letter � , whereas
2
-
1. Introduction
the vectors that are the input to the respective algorithm will
be written � , although theconverse notation can often be found in
the literature.
For a given set of � input vectors ���������
��������������� , the problem of dimension-ality reduction is
then defined as that of finding a corresponding set of output
vectors
� �������
��������������� , a mapping ���! "�$# "� and a mapping %&�!
'�"# (� such that) �*������+�,�+
�-
%/.0�1�324� � � (1.1)��. � �524� 6�1�/78�9� (1.2)
(see e.g. [CG01]). The function � will also be referred to as
’forward mapping’ or’coding function’ and % to as ’backward
mapping’ or ’decoding function.’ The outputvariables are usually
referred to as ’scores’ or ’features’ in the literature. Here, the
latter
will additionally be denoted ’latent space realizations,’ or
simply ’latent variables’ with
reference to chapter 2. Using the common Machine Learning
jargon, the process of
determining the scores and estimating models for the involved
functions from the input
data will be referred to as training and the set of input data
to as training data. De-
pending on the application at hand, not all of the three
sub-problems of dimensionality
reduction necessarily need to be solved. If, for example, the
aim of the dimensionality
reduction task is to obtain a representation amenable to
convenient visualization of a
given dataset, only this representation – in other words, only a
set of suitable scores –
is needed. If the purpose is noise reduction or any kind of
pattern detection, models for
both the coding and decoding function become necessary in
addition. Only the presence
of a model for � is demanded if any kind of pattern production
is aimed at.In those cases in which a model for � or % is needed,
the concept of generalization
becomes crucial. The problem of generalization in general refers
the fact that usu-
ally only a finite sized training data set is available for
adapting a model, which shall
afterwards be applied to new data not in this set. The
optimization of the expected per-
formance on the unseen data is the actual objective of the
learning task. As a means to
avoid adapting to noise present in the input data, which
obviates good generalization and
is usually referred to as overfitting, generally the
’flexibility’ of the model is restrained
3
-
1. Introduction
by some kind of complexity control. The determination of a
suitable complexity control,
referred to as model selection in the literature, is then
conducted along with the training
of the actual model and has to make use of the available data in
some way. A widely
used approach to model selection which plays an important role
in this thesis is cross
validation. It denotes the method of partitioning the available
data into training and test
sets and using the first for adapting a model using some
specific complexity and the
second to assessing the resulting performance and adjusting the
complexity as required.
An iteration of this procedure using different training set/test
set partitionings may be
used to improve reliability of the overall outcome. In the
special case of the training set
comprising ��� � elements and test set � element giving rise to
� train/test iterationsthis is referred to as leave-one-out cross
validation.
Generally, the notion of generalization is closely connected to
the notion of a test set
and the error that a trained model gives rise to on this test
set. Since in UL a test set only
contains input elements, the term generalization here usually
concerns some projection
error in the input space. However, following [RS00], in this
thesis the term generaliza-
tion will be used in a somewhat broader sense and will denote
the general problem of
applying the forward or the backward mapping to new input or
output elements.
1.2. Nonlinear Dimensionality Reduction /
Overview
The oldest and best understood method for dimensionality
reduction is Principal Com-
ponent Analysis (PCA), which is based on the spectral
decomposition of the data covari-
ance matrix as described in detail below. As a linear model, PCA
has some important
advantages over many of the nonlinear models discussed in this
thesis, in particular with
regard to generalization.
Many nonlinear generalizations to PCA have been proposed. A
broad class of these
nonlinear models, that will be referred to as ’projection
models’ in the following and that
in particular capture some of the generalization properties of
PCA, can be represented
in a unified way within the ’Generalized Regression Framework.’
This framework also
4
-
1. Introduction
builds the starting point for derivation of UKR. This general
framework for UL shall be
delineated in chapter 2. The second class of methods in UL –
that include some kind
of density estimation as described above – will be referred to
as ’generative models’
in the following. These models are also captured within this
framework. Since the
generative part of the generalized regression framework also
builds the starting point
for a generative variant of UKR (which is not subject of this
thesis, however), it will
also be sketched in chapter 2.
Besides these methods there is a second broad class of methods
that have been pro-
posed within the last years which are not captured within this
general framework and
might therefore be conceived of as heuristic methods. Virtually
all of these rely on a
spectral decomposition of some data proximity matrix that gives
rise especially to an
efficient computation of latent space realizations of the input
data. As these nonlinear
spectral methods in contrast to PCA do not allow for a
straightforward estimation of
the involved mappings they have specific problems with regard to
generalization. These
kinds of methods will be described in detail in chapter 3.
The UKR method for NLDR can be posed in two ways giving rise to
models belong-
ing to the ’projection’ and to the ’spectral’ class,
respectively. Both will be described
in chapter 4. Chapter 4 will also present practical
considerations regarding their imple-
mentation in detail and illustrate this with several data sets.
In addition, issues regarding
generalization will be dealt with in some detail, and in
particular a solution to combine
the generalization capability of the projection models with the
efficiency of spectral
models will be proposed.
In chapter 5 some applications of UKR used as a method to
perform nonlinear di-
mensionality reduction will be presented, where in particular
the different prospects
arising from latent space realization, applicability of � and
applicability of % will befocused on.
Chapter 6 gives a review and summary of the issues dealt with in
this thesis.
5
-
1. Introduction
1.3. Conventions and Notation
In the following, scalar and vector valued variables will be
denoted by lowercase italic
letters, e.g. � . Matrices will be denoted by uppercase italic
letters, e.g.�
. Only real
valued vectors and matrices will be used. The set of ���
-dimensional input vectors � �and the set of ��� -dimensional
output vectors � � , ��� ��������/�� , will be represented bythe
��� � matrix � and by the ��� � matrix � , respectively.
A vector of ones will be denoted , a vector of zeros . In the
cases where thesenotations are used, the dimensionality of these
vectors will always be obvious from the
context. The mean of some dataset represented for example by the
matrix � is definedas �� . The dataset � will be said to be mean
centered if it holds that �� �� .
6
-
2. Unsupervised Learning as
Generalized Regression
The UKR method investigated in this thesis arises naturally as a
nonparametric instance
from the ’Generalized Regression’ framework for Unsupervised
Learning[Mei00]. In
this chapter this framework for Unsupervised Learning is be
briefly reviewed, an over-
view over the two kinds of models it gives rise to is given,
which are Projection Models
and Generative Models, and it is shown how some of the known
algorithms in Unsuper-
vised Learning are reflected within this framework.
2.1. Conventional Regression
The purpose of regression is to model a functional relationship
between random vari-
ables that is assumed to be made up of a systematic part and
some unpredictable, ad-
ditive noise. More precisely, let � denote a function that maps
an input random vector� � "� onto an output random vector ��� � and
let � � "� denote some (zeromean) random vector that corresponds to
the unpredictable noise. The relationship to
be modeled is then assumed to be given by
� � ��. � 2 � ���� .�� 2���� � (2.1)
7
-
2. Unsupervised Learning as Generalized Regression
To this end an approximation ��� to � is chosen from a set of
candidate functions withrespect to the objective to minimize the
expected prediction error:
� � � ��������� ����� � � ��. � 2 ����� . �
� 2 � � � � � (2.2)Under appropriate circumstances the function
��� that satisfies this demand can be shownto be given by the
conditional expectation [HTF01]:
� � � ��� ��� ��� � � � � .5��� � 2 � � � (2.3)It is referred to
as the regression function and its range as the regression
manifold. In
practice, ��� needs to be estimated from finite data sets for
the input and output variables.One way this can be achieved is by
replacing 2.2 with the empirical prediction error
��� � � �"!$# � �1� � ��. � �32 � �� (2.4)
and minimizing this functional with respect to the function
parameters. The input vari-
ables in this case are practically treated as fixed parameters
instead of as random vectors.
Another option is to take into consideration a parameterized
model�.0��� ��%'& 2 of the
conditional probability density function of the output
variables, given the input vari-
ables, for some parameter vector & . Under the assumption of
the sample elements tobe drawn independently and all from the same
distribution, the product density function
specifies the sample probability density, so that the estimation
of � can be realized bymaximization thereof with respect to the
function parameters. In practice equivalently
the negative logarithm, (� � � �
�)!$#+*�, � � .0�9�-� � � %'& 2 (2.5)is minimized, which
gives rise to a learning principle generally referred to as
maximum
likelihood.
8
-
2. Unsupervised Learning as Generalized Regression
Yet another option that will not be used any further in this
thesis, but which is men-
tioned here for the sake of completeness, is to regard the
parameters as random vari-
ables themselves and learning as an update of their probability
distribution, known as
Bayesian learning [HS89].
2.2. Unsupervised Regression
The task of UL can be approached by utilizing a modified version
of the regression
model. As detailed in [Mei00] the difference then lies in the
usage of the input variables.
In the supervised case regression amounts to the estimation of a
functional relationship
utilizing a sample set for the input and their related output
variable realizations. In UL
the input variable realizations are conceived of as missing and
therefore in the need to
be estimated together with the functional relationship. The
distinction can be expressed
by referring to the input variables in an unsupervised setting
as latent variables1.
With regard to the practical learning task two important
differences to the supervised
case arise from the use of latent variables, the first of them
affecting the definition of the
learning problem: Since the input variables are not given in
advance, one has to decide
on a suitable domain for them. For that purpose several
distinctions of the type of latent
variables have to be taken into account, all leading to
different types of regression man-
ifolds. An important distinction is between deterministic and
random latent variables
leading to models referred to as projection models and
generative models, respectively.
Another distinction of the latent variable types is between
continuous and discrete ones.
Together with the option of choosing a class of candidate
functions, where in particular
the distinction between linear and nonlinear functions is of
interest, the two dimensions
along which a classification of the latent variables is
possible, allow for the formulation
of a wide spectrum of models of UL known from the literature.
Some of these will be
sketched in the subsequent sections.
The second novelty of unsupervised regression as compared to the
supervised case
1The term latent variable has a longstanding history in
statistical modeling and is closely related to theway it is used
here. The informal definition given here is completely sufficient
and self-containedwith regard to the way this term will be used in
the following.
9
-
2. Unsupervised Learning as Generalized Regression
regards the necessity to make use of some kind of learning
scheme. Since the more
ambitious goal of finding latent variable realizations in
addition to parameters defining
a suitable functional relationship needs to be tackled here, one
has to conceive of a way
to accomplish these tasks simultaneously. For deterministic
latent variables a generally
applicable approach to achieve this twofold objective is the
’Projection - Regression -
Scheme.’ It is obtained from an iteration of a ’Projection’ -
Step, used to find optimal
values for the latent variables, given some values for the
function parameters, and a
’Regression’ - Step, used to re-estimate function parameters,
while the values for the
variables are kept constant. This optimization scheme can be
thought of as a deter-
ministic analog to the well known EM-algorithm, which can be
applied in case of a
generative model. Both, projection and generative models, the
use of their respective
learning schemes and examples of their applications will be
described in detail below.
2.3. Optimization
As indicated above, within the Generalized Regression framework
learning in general
will be achieved by an iteration of a minimization of 2.4 or 2.5
with regard to the func-
tion parameters and an update of the latent variable
realizations. However, the presence
of a usually very large number of parameters in UL, owing to the
fact that the latent
variables need to be estimated here, as well, often cause the
respective objective func-
tions to be fraught with local minima. Therefore, unless a
closed form solution for the
concerned functions exists, the success of the learning task
depends crucially on the ini-
tialization or - if no indication regarding auspicious areas in
parameter space is available
- on the use of an optimization strategy that helps to avoid or
at least to diminish the
chance of getting trapped in a local minimum.
The well-known method of Simulated Annealing tries to achieve
this by allowing
for random influences to appeal during the parameter update in
an iterative optimiza-
tion process. This alleviates the chance of getting stuck in a
local minimum. Gradu-
ally reducing these random influences, called annealing by
analogy to the temperature
controlled process of crystal growing, can then result in the
probability that the global
minimum of the objective function is actually achieved to be
asymptotically, in the limit
10
-
2. Unsupervised Learning as Generalized Regression
of an annealing schedule infinitely slow, to be equal to
one.
An alternative strategy, which will be applied in this thesis,
in particular for the ap-
proach described in section 4.2, chapter 4, is the method of
Homotopy. This strategy is
based on a set of transformations of the original error function
into simpler or smoother
functions with a smaller number of local minima. Minimization of
the original func-
tion is then performed by starting with the most simple function
present and gradually
reducing the degree of smoothing during minimization until the
original error function
is received. Often, a suitable transformation arises
automatically from the need to im-
pose a complexity control. Homotopy in this case amounts to
gradually releasing the
constraints that the complexity control poses. This is in
particular the case for the UKR
model as shown later.
2.4. Projection Models
By using deterministic latent variables one obtains the class of
models that is of particu-
lar concern with respect to dimensionality reduction and
therefore of special interest for
the methods described in this thesis. The latent variables are
in this case treated formally
as parameters that need to be estimated along with the function
parameters. Since the
backward mapping is modeled via some kind of projection, these
models are generally
referred to as Projection Models. In detail, this means that the
score that corresponds
to an observable data space element is given by its projection
index which is formally
defined as that latent space element that yields a minimal
reconstruction error under � .The dependency on a particular model
for � is often symbolized using the expression ���for the backward
mapping.
In the following the function class will be restricted to
contain functions of the form:
��. � 2 � ��� . � 2 (2.6)
with parameter matrix�
and�
being a vector of basis functions to be specified before-
hand. The two aforementioned optimization steps (Projection- and
Regression-Step) are
11
-
2. Unsupervised Learning as Generalized Regression
then given by:
6� �/� ��� � ����
��9� � ��� . � 2
���
�*� ����������� (2.7)
and
6� � ����� ����
�
��9� � ��� . � �32
���� (2.8)
Note that in terms of the mappings � and % , involved in NLDR,
optimization usingthis procedure only directly concerns � , while
an optimal % is rather ’plugged in’ insteadof being adapted. In
[Mal98] the importance of this proceeding is pointed out in a
comparison of Principal Curves [HS89] which own this property,
too, and so called
Autoassociative Neural Networks (see [Kra91]) which do not. In
particular, the presence
of so called ambiguity points that cause the projection index
defined as above to be a
discontinuous function generally let the latter variant, where
the backward mapping is
a continuous function, fail to correctly approximate the given
dataset. The importance
of this finding for the UKR method resides in the fact that this
method can be posed in
two distinct ways. The first straightforwardly gives rise to a
(nonparametric) projection
model similar to those described in this chapter, in particular
with the backward mapping
defined as proposed above, while the second variant does not and
can empirically shown
to be flawed accordingly. The conclusions to be drawn from this
finding will be detailed
in 4.
As stated above, several methods for NLDR known from the
literature may be for-
malized within the framework described in this section by
varying the latent variable
types and the class of candidate functions. In the following,
some of the possible deci-
sions on the latent variables and candidate functions and the
algorithms they give rise to
shall be sketched.
12
-
2. Unsupervised Learning as Generalized Regression
2.4.1. Principal Axes
By defining the latent variable domain to be � , i.e. using
(deterministic) continueslatent variables and restricting the
function class to linear functions:
��. � 2 ��� ��� � ��� � � �� ����� (2.9)
the resulting learning method is essentially equal to Principal
Component Analysis. Al-
though the Projection-Regression-Scheme could be applied, the
special, linear structure
in this case gives rise to a closed form solution. Specifically,
this is obtained through an
eigenvalue decomposition of the input sample covariance matrix
and therefore will be
described in more detail in chapter 3, where also other, in
particular recently developed
nonlinear methods with this special property shall be
delineated.
2.4.2. Principal Curves
Restriction of the latent variable domain to a closed interval
��� � � , on the real line, and� defined as
��. � 2 � ��� . � 2 � �� ��� (2.10)
gives rise to nonlinear (one-dimensional) principal manifolds,
or ’principal curves.’ In
fact, this way a generalization of the Principal Curves model
proposed by [KKLZ00]
is achieved, that are modeled by polygonal line segments there.
The restriction of the
latent variable domain is necessary here, as in this nonlinear
case the absence of such a
restriction would result in an interpolation of the training
data points.
2.4.3. Principal Points
By using discrete latent variables and defining
��. � 2 � ��� . � 2 � ��� ���!������/���� � � ��� (2.11)
13
-
2. Unsupervised Learning as Generalized Regression
��� .0� 2 ��� � � (2.12)
one straightforwardly obtains a learning method generally known
as Vector Quantiza-
tion. The columns of�
then represent prototype or codebook vectors, the estimation
of
which using the general optimization scheme equals the well
known K-means clustering
algorithm.
2.4.4. Local Principal Axes
The use of a mixture of discrete and continuous latent variables
together with a linear
dependency on the continuous variables can be interpreted as a
generalization of the
vector quantization approach described above. The principal
points are in this case
replaced by continuous linear manifolds. This way one obtains a
method for nonlinear
dimensionality reduction that is known as ’Local PCA’ in the
literature and that can be
used to model nonlinear relationships by residing to the
assumption of local linearity.
The absence of a global coordinate system inherent to this
approach, however, in-
volves serious shortcomings as described, for example, in
[TdSL00]. In fact, the LLE
method, which will be introduced in 3.2.1 originally arose from
a series of attempts to
provide a coordination of locally linear models in order to
overcome these shortcomings
(see also [RS00] and [VVK02], e.g.).
2.5. Generative Models
The kinds of models in UL that include some kind of density
estimation are referred to
as generative models [Mei00]. These models arise automatically
from the generalized
regression framework by regarding the latent variables as random
variables with non-
trivial distributions. Since the UKR model can be formulated in
a way to obtain also a
generative variant, these kinds of models shall be sketched here
in short.
If one uses random latent variables, optimization by
minimization of 2.4 is no longer
possible, as the interpretation of the latent variables as
parameters no longer applies. In-
stead, the maximum likelihood approach (or some kind of Bayesian
learning, which
14
-
2. Unsupervised Learning as Generalized Regression
is omitted in this thesis, as stated) needs to be used.
Furthermore, the Projection-
Regression optimization scheme is no longer applicable. It is
replaced by an analogous
scheme, known as EM-algorithm, as mentioned above. The detailed
description of this
algorithm shall be omitted here, as there exists a large body of
literature on this topic
(see, e.g. [Bil97] or [Cou96]). In short, the resulting
optimization scheme closely re-
sembles the PR-scheme with the update of latent variable
realizations being replaced by
an update of their probability distribution in this case. In the
following a few models
and some novelties that arise from the use of random latent
variables will be sketched.
Assuming spherical Gaussian noise and a linear dependency on
Gaussian latent vari-
ables one obtains a generative counterpart of the linear model
given in 2.4.1. For a pre-
defined latent space dimensionality � , the generative version
yields an equal solution.However, as an extension to the projection
model, by making use of the special role the
noise variance plays in the generative case, a dimensionality
estimating version can be
obtained, by predefined the noise variance.
For non-Gaussian latent variables an estimation scheme for the
well-known Inde-
pendent Component Analysis ([Hyv99]) can easily be derived
making use of the non-
trivial latent variable distribution here by incorporating the
assumption of their being
statistically independent.
As a generative version of the principal points model (2.4.3) a
method for density
estimation generally known as ’mixture of Gaussians’ arises
straightforwardly from a
(spherical) Gaussian noise assumption. In addition probabilistic
versions of the cluster-
ing algorithm and of the local PCA model become possible, for
example.
15
-
3. Spectral Methods for
Dimensionality Reduction
One broad class of methods for dimensionality reduction that
differs from most of the
approaches delineated in the previous chapter is the class of
spectral methods. The
main difference is that these methods do not deploy any
iterative optimization scheme.
Instead, they rely on an objective function that has an
efficiently computable global
optimum. The by far most important and widely used instance is
Principal Compo-
nent Analysis. Recent developments regarding the applicability
of spectral methods to
nonlinear learning problems, however, has led to a current rise
in their popularity, too.
The point of contact for practically all these methods is that
they rely on an optimal-
ity criterion that can be posed as a quadratic form. Therefore,
these methods rely on
some variant of the Rayleigh Ritz theorem, which states in
short, that for some quadratic
form � minimizer (maximizer) of tr .�������� 2 with respect to
the � �"� matrix � , subjectto ����� � , and � � , is the matrix
������ containing the � eigenvectors correspond-ing to the �
smallest (largest) eigenvalues of (symmetric) � (see, e.q. [HJ94]
or [Jol86]).In other words, all these methods rely on an eigenvalue
decomposition (EVD) of some
matrix � , hence the name ’spectral methods.’ A further
commonness is that � is definedas some kind of data affinity matrix
of a given dataset in all cases, as recently pointed
out by [BH02].
In the following the different spectral approaches to
dimensionality reduction will
be described and their specific characteristics, in particular
in view of the spectral UKR
variant to be introduced in 4.3, shall be pointed out, beginning
with the linear variants
16
-
3. Spectral Methods for Dimensionality Reduction
and PCA. A common drawback of the nonlinear generalizations of
PCAis that, although
these inherit the efficiency of their linear counterpart, they
do not share the same merits
in terms of generalization, as will be described in detail
later. This entails particular
problems with regard to potential applications as pointed out in
chapter 1.
3.1. Linear Models
3.1.1. Principal Component Analysis
Principal Component Analysis (PCA) can be regarded as the oldest
and most well-
known method for dimensionality reduction1. It can be derived
from the objective to
maximize the variance of the projection of a given �
-dimensional dataset onto a � -dimensional subspace. The quadratic
form that preoccupies this objective is simply the
(symmetric and positive definite) sample covariance matrix
([Jol86]). Precisely, assum-
ing � to be mean centered, let the spectral decomposition of �
��� be: � � ��� ����� �and let
������ denote the matrix containing the normalized eigenvectors
corresponding to
the � largest eigenvalues as columns. The matrix � of the latent
space vectors that meetthe maximal variance objective is then given
by
� � � �� �� � . Similarly, generalizationto some new observable
space element � is performed simply by left-multiplication with� ��
�� , while the application of the forward mapping to some new
latent space element
�
is performed by left-multiplication with���� . In other words,
here one has the unique
case of a method that gives rise to the direct estimation of the
involved functions, as la-
tent space realizations are in fact obtained subsequently by
applying the obtained model
for � to the given input dataset.
3.1.2. Multidimensional Scaling
The method of Multidimensional Scaling (MDS) deviates somewhat
from the other
methods for dimensionality reduction exposed in this thesis,
because it is not used as
a method to determine a low dimensional representation of a high
dimensional dataset,
1Although the derivation of PCA draws from much broader (but
related) objectives, here it will betreated as a method for
dimensionality reduction only.
17
-
3. Spectral Methods for Dimensionality Reduction
but it obtains some dissimilarity measure as input instead. In
fact, it thereby fails to meet
the general definition given in chapter 1. It will be described
here in short, nevertheless,
as it is generally classified as a method for dimensionality
reduction in the literature and
makes up a crucial step of the Isomap algorithm portrayed
below.
In detail, MDS addresses the problem of finding a (usually ’low’
dimensional) data
set from a set of pairwise dissimilarities, such that the
distances between the resulting
data points approximate the dissimilarities as closely as
possible. Given a dissimilarity
matrix2 � with an EVD � � � � � � the optimal data set is given
by � ���� . As stated,the abandonment of the need to use a data set
from some euclidean space as input gives
rise to applications that go beyond generic dimensionality
reduction. Often, some kind
of subjective similarity judgment is used as a basis to obtain �
, which allows for thevisualization of ’psychological spaces’
[Krz96], for example. If � contains the pairwiseeuclidean distances
of some real dataset, however, the solution is the same as that
from
PCA.
3.2. Nonlinear Models
3.2.1. Locally Linear Embedding
A method that incorporates an EVD in order to accomplish
nonlinear dimensionality
reduction and that has arisen a great deal of attention recently
is the method of Locally
Linear Embedding (LLE) [RS00]. It is essentially based on
geometrical intuitions as
it attempts to determine a lower dimensional embedding of a
given dataset that retains
local neighborhood relations between datapoints by making use of
the following three-
step algorithm:
1. for each ��� define � the index set of � nearest
neighbors
2Generally, the matrix needs some simple preprocessing, which
will not be detailed here.
18
-
3. Spectral Methods for Dimensionality Reduction
2. set � � � � � � if���� � and minimize with respect to � � �
the objective:� �)!$# � �1� � ������ �$� � � � ��� s.t. ����� �$� �
� � (3.1)
3. minimize with respect to � � the objective:� �"!$# � � � � �
� !$# �$� � � � ��� (3.2)
�� �� ���� (3.3)
with
� � ��� .�$� � 2 � � � �� � � (3.4)The purpose of the second
step is to discover those weights that give rise to an optimal
reconstruction of each observable space datapoint by its
neighbors. This step requires
to solve a constrained least squares fit and has a closed form
solution. In the case
of ��� � some kind of regularization heuristic is necessary,
however. In the third stepthose latent variable realizations are
then sought that minimize the average (latent space)
error, if reconstructed from their neighbors with the same
weights as their observable
space counterparts. By writing 3.3 as tr . ���� � � �/2 it is
obvious that this step givesrise to a quadratic form in the latent
variables. Therefore, by requiring
� � and� � � � � , the solution is given by the eigenvectors
belonging to the � (second to)smallest eigenvalues of � � � ��� � .
Particularly advantageous here is the that � issparse, allowing for
an efficient solution.
Overall, in tending to preserve the weights with which a
datapoint is reconstructed
from its neighbors under the sum-to-one constraint, the authors
of the LLE algorithm
state that it tends to preserve exactly those properties of each
neighborhood that are
invariant under rescalings, rotations, and translations. Hence,
the algorithm provides
19
-
3. Spectral Methods for Dimensionality Reduction
a mapping from observable to latent data space that tends to be
linear for each neigh-
borhood. In other words, LLE determines a low dimensional global
representation of
a manifold embedded in a higher dimensional space by assuming it
to be arranged in
linear patches. This assumption also defines the tender spot of
the LLE algorithm, since
a violation of it lets the algorithm fail. This is in particular
the case, if noise is present
in the data.
Generalization
The authors of the LLE algorithm propose two ways of
generalizing a trained model
to new latent or observable space elements. The first,
non-parametric, approach is in
principle a straightforward re-application of the main
procedures that lie beneath the
learning algorithm itself: for a new latent or observable vector
generalization of � or% , is achieved by: (i) identifying the new
datapoints’ neighbors among the training orlatent variable set,
respectively, (ii) determining corresponding reconstruction
weights,
and (iii) evaluating the function as a linear combination of the
found neighbors with
their corresponding weights.
The theoretical justification for this kind of generalization
lies in the same geomet-
rical intuitions that the lle-algorithm itself is based on. In
particular the assumption of
local linearity is crucial for this approach to generalization
to work properly, so that
the dependency on noise-free data applies here in the same
manner as above. The fact
that the concept of generalization is most essentially based on
the presence of noise,
however, thereby represents a serious problem for this
approach.
The second, parametric, approach to generalization that is
proposed by the authors
is to train a supervised model on the input-output data pairs
that are available after the
application of LLE. The problems affecting non-parametric
generalization can be cir-
cumvented this way. The resulting overall algorithm consists of
two separate parts - an
unsupervised and a supervised part - with two unrelated
objectives: optimal reconstruc-
tion of the latent data elements from their neighbors for the
first and minimization of the
expected prediction error for the second (see 2.1). A final
model assessment is there-
fore based on the prediction error in the space of the
observable variables. The problem
is that it is not at all clear to what extend complying with the
first objective is able to
20
-
3. Spectral Methods for Dimensionality Reduction
preoccupy the second. This is in clear contrast to the
projection models described in
2.4 that estimate latent variable realizations and a model for
the forward mapping at the
same time and both with the objective to minimize the observable
space error.
3.2.2. Isomap
A method for NDLR that arose similar attention as LLE is the
Isomap (isometric feature
mapping) algorithm proposed by [TdSL00]. It can be regarded as a
heuristic approach,
too, but it is based on completely different intuitions than
LLE. Isomap similarly em-
anates from the observable data being distributed along a
low-dimensional manifold,
but abstains from the assumption of linear patches. It seeks to
find a low-dimensional
embedding that preserves distances between datapoints as
measured along the manifold
- so called ’geodesic’ (locally shortest) distances. The crucial
point in this algorithm is
therefore to derive these geodesic distances from the datapoints
(more precisely, their
euclidean distances) in an efficient way. To achieve this the
authors propose the three
step algorithm of (i) computing a topology-preserving network
representation of the
data, (ii) computing the shortest-path distance between any two
points which can effi-
ciently be done by using dynamic programming, (iii) determining
the low-dimensional
representation that preserves the computed distances as closely
as possible using Multi-
dimensional Scaling (MDS) on these distances.
With regard to generalization, Isomap shares the same
shortcomings of LLE, be-
cause of its likewise rather heuristic quality. In fact, as in
contrast to LLE not even a
heuristic generalization of the involved mappings can be
naturally derived from the in-
tuitions the algorithm is based upon, the authors suggest to
train a supervised model on
the obtained completed dataset – resulting in the same
shortcomings that hold above.
3.2.3. Kernel PCA
Another spectral approach to nonlinear dimensionality reduction,
which at first glance
resembles the UKR variant to be described in 4.3 with regard to
its incorporation of both
kernel and spectral methods, is Kernel PCA. A closer look
reveals important differences,
however.
21
-
3. Spectral Methods for Dimensionality Reduction
Kernel PCA stems from a line of research on kernel based methods
that is based on
the idea to incorporate an implicit (usually highly nonlinear)
mapping to some higher
dimensional feature space by exploiting the finding that dot
products in such a feature
space can equivalently be computed using the original data space
vectors alone through
the help of kernel functions [Bur98], which is often referred to
as ’kernel trick.’ In
order to apply this idea to Unsupervised Learning, the classical
PCA approach may be
recast in a form that makes use of inner products only. In
[SSM99] the derivation of
this formulation is given resulting in an algorithm to compute
principal components in
a higher dimensional feature space. Technically this amounts to
performing the EVD of
the matrix of pairwisely evaluated kernel function.
22
-
4. Unsupervised Kernel Regression
This chapter describes the method of Unsupervised Kernel
Regression as recently intro-
duced by [Mei03]. In a first step the concept of Nonparametric
Regression is introduced.
Then two distinct ways of extending this concept to Unsupervised
Learning will be de-
lineated in the subsequent two sections. In addition, a
combination of the objective
functions these two variants give rise to which has in
particular proven to be useful with
regard to practical considerations, will be proposed in the
section that follows.
4.1. Unsupervised Nonparametric Regression
4.1.1. The Nadaraya Watson Estimator
Chapter 2 introduced the purpose of regression as that of
modeling a functional relation-
ship between variables by choosing that element from a
parameterized set of candidate
functions that minimizes the empirical prediction error on a
training data set, which is
asymptotically equivalent to taking the conditional
expectation:
��. � 2 � � � ��� ��� � � � � .0��� � 2 � � (repeated) � (4.1)In
contrast to 2.1, where the approximation of the regression function
has been real-
ized by minimization of the empirical prediction error or
maximization of the data log-
likelihood, one obtains a nonparametric variant of the
regression estimation, if one aims
directly at modeling the conditional expectation. This can be
achieved by considering
23
-
4. Unsupervised Kernel Regression
a non-parametric estimator 6�
of the joint probability density function�. �
� 2 of the in-
volved variables, taking into account that 4.1 can be relocated
to yield:
��. � 2 ��� ��. �
� 2 � �
��. �
� 2 � � � (4.2)
By utilizing the multivariate Kernel Density Estimator (KDE) to
model the joint density
[Sco92]:
6�. �
� 2 � ��
� �)!$# � � . � � �32 � � .0�
�9� 2 (4.3)
with� . �,�� 2 being multivariate Kernel functions and � . �+��
2 denoting the unnormalized
portions thereof, so that with the normalization constants � �
and � �� �� . �,�� 2 ��� �
�� � � . �+���2 ��� � � (4.4)
and � �� . �,�� 2 ��� �
�� � � . �,�� 2 ��� ��� (4.5)
hold, the estimate of the regression function becomes1
��. � 2 �� �� !$# � . � � � 2 � �� � !$# � . � � 2 (4.6)
1The symbol will be overloaded to denote both the observable
space and latent space Kernel func-tions, and later on also the
matrix of kernel functions. Since the dataspace kernel functions
cancel outin 4.6, here they are only specified with latent space
arguments; later on, however, they will be usedto accommodate data
space arguments analogously.
24
-
4. Unsupervised Kernel Regression
which is known as the Nadaraya Watson Estimator [Bis96]. In this
thesis only the
spherical Gaussian
� . � � � � 2 ��������. ��� � � � � � � � ��� 2 (4.7)and the
Epanechnikov kernel function
� . � � � � 2 ��� � # � � � � � � � � � if � � � � � � � �� ��!
otherwise (4.8)
will be deployed and both denoted by � . �,�� 2 . The function
used will be indicated at therespective places.
The function parameter � determines the kernel bandwidth and
provides a means toadjust the model complexity for the Nadaraya
Watson Estimator.
4.1.2. Unsupervised Nonparametric Regression
As in the parametric case, the transition to unsupervised
regression is made by regarding
the regressors as latent. In contrast to the parametric case,
however, that poses the
twofold objective of finding suitable latent variable
realizations along with parameters
defining a functional relationship, here both objectives are
achieved at the same time
by merely taking care of finding suitable latent variable
realizations, because of the
nonparametric nature of the problem. This way the ’double
burden’ that problems in
UL hitherto gave rise to is eliminated, resulting in methods
that resemble those from
SL, because they depend on the estimation of only one class of
parameters.
As pointed out in [Mei03], in the unsupervised case the Nadaraya
Watson Estimator
may be deployed in two ways. The first is to treat the latent
variables in 4.6 as parame-
ters to be estimated. By measuring the observable space error
one obtains the objective
function dealt with in detail in the next section. The second
way is to compute the
latent variable realizations by simply applying the Nadaraya
Watson Estimator to the
observed variables, that are regarded as input in this case. In
other words, this variant
amounts to computing the nonparametric regression function in
the opposite direction.
25
-
4. Unsupervised Kernel Regression
The objective function for this approach is obtained by
measuring the latent space error.
In that case, a nontrivial coupling of the resulting latent
variable realizations has to be
accounted for, a problem that can be solved efficiently by a
spectral decomposition as
described in 3. This variant will be described in 4.3.
4.2. Observable Space Error Minimization
To obtain a loss function for learning suitable latent variable
realizations one might con-
ceive of 4.6 as being parameterized by the latent data
matrix�
and measure the mean
square reconstruction error on the observed variables[Mei03].
The resulting objective
function of this UKR variant, denoted oUKR in the following, is
given by:
� . � 2 � ��� �)!$# � �1� �
� �� !$# � . � � � � 2 � �� � !$# � . � � � 2 ��� (4.9)� ��
�� � ���� (4.10)
with
� � � � � . � � � � 2� � !$# � . � � � 2 � � � � �Since the
effect of any variation of the kernel bandwidth could be
equivalently
caused by a change in the average scale of the latent variable
realizations, in the fol-
lowing, if not stated otherwise, the kernel bandwidth will be
conceived of as being
constant ( � ����� � ) and any influence on the model complexity
will be accounted for oreffected by the latent variable norms
only.�
– and by virtue of 4.6 at the same time � – will then be
estimated by minimizing4.10. Since a minimization without further
restrictions on the objective function would
drive the latent variable scales to infinity, however, with the
reconstruction error for
the training data set at the same time approaching zero, it is
obvious that some kind of
complexity control has to be imposed in order for the
optimization problem to be well
26
-
4. Unsupervised Kernel Regression
defined [Mei03]. Practical considerations regarding both
minimization and restriction
of the model complexity will be delineated in the next
section.
The generalization of a trained model to new latent space or
observable space el-
ements is straightforward. Since a model for � is learned along
with suitable latentvariable realizations, the oUKR method closely
resembles the projection models de-
scribed in 2.4. The only difference is that no iterative
training procedure is necessary
here, because no parameters need to be estimated. Application of
the regression func-
tion to new latent space elements is therefore possible simply
by plugging these into
4.6. And the backward mapping % can be defined analogously to
2.7:
%�.5� 2 � � ������� ��
�� � ��. � 2
���� � (4.11)This requires to solve a nonlinear optimization
problem, that should be initialized suit-
ably. A straightforward choice as an initialization is
��� � � ����������� � � � ��. � �52 ����
�*� ��������/��- (4.12)
requiring a search over the number of training data points.
From the reasoning laid out so far a generative model is
obtained simply by regard-
ing the latent variables as random variables as in 2.5 and
maximizing the conditional
log-likelihood (see chapter 2) which can be derived
straightforwardly (see [Mei03] for
details).
4.2.1. Optimization
The objective function is nonlinear with respect to the latent
variables and needs to be
minimized iteratively. Since it is possible for the gradient of
the objective function to be
computed analytically, some gradient based optimization scheme
may be used. For the
partial derivatives with respect to the latent variables it
holds:
�� . � 2�
� � � �� !$# � � !$# � �� � � �� � � � � � � (4.13)
27
-
4. Unsupervised Kernel Regression
with�
denoting the � �
column of�
.
Since there are � latent vectors with � components each, it is
obvious that the timecomplexity for computation of the gradient
amounts to at least � .0� � � ��2 . This is in-deed the complexity
class for computation of the gradient, if one pre-computes � �� �
and � ��������� � � � . Computation of these terms gives rise to
costs of � .0� � � 2 and � . � � � ��2 ,respectively. While this is
clear for the first expression, the second one deserves special
attention. It holds:
� �� � � � � ��� � �
� � �
�� � !$# � � � �� � � � � � . � � � 2� �� !$# � . � � � 2 � � �
�
�� � !$# � � � � � �� !$# � . � � � 2� �� � � � . � � � 2�
� � �� � !$# � . � � � 2 � � . � � � 2 � �
� � !$#
� � . � � � � 2�� � �
�� � . � � �02
�� � � �
� �� �� !$# � . � � � 2
� � !$# � � � � � . � � � 2� � � �
�
� � �� !$# � . � � � 2 � � �
� � !$#
� � . � � � � 2�� � �
�� � . � � �52
�� � �
� � !$# � � � � . � � � 2 �
(4.14)
Therefore
� �� � � �� � �� � � �.�� 2 � � � � � � � � � . � � �52� � � � �
�
with
� � �� � !$# � . � � � 2
28
-
4. Unsupervised Kernel Regression
�� � � �� � !$# � � � � � . � � � 2� � � � � � � � � � . � �
�32� � � � � � � � � !$# � � � � � . � � � � 2� � � �
� � � � �� � !$#
� � . � � � � 2�� � �
� � � �� � !$# � � � � . � � � 2 �
The terms � , , � , � may be pre-computed, which amounts to a
time complexity of� . � � 2 , � . � � � ��2 , � . � � ��2 and � . �
� � 2 , respectively.
Throughout this chapter only the Gaussian kernel function will
be used. In this case
the derivative is given by
� � . � � � 2�
� � � �� � . � � �52
�� � � �
�� � � . � � � 2�. � � � � � � 2 �To further reduce time
complexity for computation of the objective function as well
as its gradient, a sparse structure of the kernel matrix might
be induced by residing to a
kernel function with finite support, such as the Epanechnikov
kernel or the differentiable
’quartic kernel’ [Sco92]. An exploration of this approach,
however, is not within the
scope of this thesis.
Ridge Regression
The above mentioned requirement to restrict the model complexity
can be met by adding
a regularization term . � 2 to the objective function that
constrains the scale of latentvariable realizations in some way.
This is is referred to as ’ridge regression’ or ’weight
decay’ in the literature (see [HS89]). The resulting penalized
objective function then
29
-
4. Unsupervised Kernel Regression
reads
��� . � 2 ��� . � 2 ��� . � 2 � (4.15)The parameter
� �� functions as a means to control the influence of the
regularizationterm and thereby provides a way to control the model
complexity. The matrix of suitable
latent variable realizations is now determined by minimizing the
penalized objective
function2:
� ���� � ��� � ���� � � . � 2 � (4.16)A straightforward choice
for the regularization term is one that equally restricts the
euclidean latent variable norms, which can be achieved by
defining
. � 2 � �� � ���� � (4.17)
Other regularization terms are conceivable, however. In fact,
these might even be
desirable in specific circumstances, since they could provide a
facility to include some
kind of top-down knowledge by imposing constraints upon the
latent variables and their
relations. This finding will be evaluated in more detail in
section 4.2.2. In the following,
if not stated otherwise, the variant given in 4.17 will be used
only.
To deal with the presence of local minima, an optimization
strategy such as the ho-
motopy scheme described in 2.3 needs to be applied for
minimization of 4.15. Here,
this practically amounts to slowly decreasing�
during optimization. In order to avoid
overfitting, the error that the projection of an independent
test set onto the resulting
manifold gives rise to might be tracked. As an illustration, the
approximations of the
two-dimensional ’noisy S-manifold,’ embedded in a
three-dimensional space, which are
depicted in 4.1, have been determined this way (see Appendix A
for details on the gen-
eration of the toy datasets used here in the following). Here,
in particular the importance
of a well chosen annealing strategy becomes obvious. While the
two panels to the right
2Note that for optimization by gradient descent the gradient
given in 4.13 needs to be modified by addingthe derivative of the
respective penalty term with respect to the latent variables.
30
-
4. Unsupervised Kernel Regression
visualize the progression of the latent variable realizations
obtained from a reasonably
chosen annealing strategy, the two panels to the left show the
effect of too fast anneal-
ing and the result of getting stuck in a local minimum leading
to an only suboptimal
final solution. The out-most illustrations have been rescaled to
accommodate visualiza-
tion of the latent variable realizations, while the innermost
panels show the same results
in a consistent scale for each of the two progressions, making
visible the enlargement
of the latent realizations throughout the annealing progress. It
is obvious that, while
the solution to the left after 20 annealing steps spans an area
that is approximately five
times larger than the solution to the right after 350 steps, the
final solution using ’slow’
annealing clearly captures the structure of the original data
set far better than the left
one.
A general drawback of using ridge regression and homotopy that
is related to the
necessity of a suitably ’slow’ annealing schedule is efficiency.
Datasets containing a
thousand or more elements in a few hundred dimensions have shown
to be problematic
to handle and therefore might ask for alternative strategies.
One option will be exposed
in the following.
Constrained Optimization
While adding a penalty term as described above can be
interpreted as imposing a soft
constraint on minimization of the objective function by
including the tendency to fa-
vor small norm solutions, it is also possible to use a strictly
constrained optimization
algorithm instead by re-defining the optimization problem as
� ���� � ��� � ���� � . � 2 (4.18)subject to �9. � 2
���! (4.19)
with � defining some nonlinear constraint. In analogy to 4.17
one might set
�9. � 2 � � � ����
� � (4.20)
31
-
4. Unsupervised Kernel Regression
−1 01
05
0
2
4
−50 0 50
−50
0
50
−50 0 50
−50
0
50
−50 0 50
−50
0
50
−50 0 50
−50
0
50
−50 0 50
−50
0
50
−50 0 50
−50
0
50
−0.5 0 0.5 1 1.5−0.5
00.5
11.5
step
: 0
−2 0 2−2
0
2
step
: 1
−20 0 20−20
0
20
step
: 2
−50 0 50
−50
0
50
step
: 3
−50 0 50
−50
0
50
step
: 4
−50 0 50
−50
0
50
step
: 20
−10 0 10
−10
0
10
−10 0 10
−10
0
10
−10 0 10
−10
0
10
−10 0 10
−10
0
10
−10 0 10
−10
0
10
−10 0 10
−10
0
10
−0.5 0 0.5 1 1.5−0.500.511.5
step
: 0
−2 0 2−2
0
2
step
: 1
−2 0 2
−2
0
2
step
: 3
−10 0 10−10
0
10
step
: 6−10 0 10
−10
0
10
step
: 35
−10 0 10
−10
0
10
step
: 350
Figure 4.1.: The effects of different annealing strategies. The
dataset on the top, consist-ing of � ��� � � datapoints sampled
from the two-dimensional ’S-manifold’distribution with spherical
Gaussian noise ( �
�� � ��� ) has been approxi-
mated using: (left) � annealing steps, with � being declined
geometricallywith factor � �,� after each step and (right) ��� �
annealing steps, with � beingdeclined with factor � ��� . Start
value for � was ��� � in all cases. The latentvariables have been
initialized randomly from a uniform distribution overthe unit
square. Note the tendency of the latent space realizations to
arrangespherically provoked by using the Frobenius norm as
regularization.
32
-
4. Unsupervised Kernel Regression
Figure 4.2.: The effects of different optimization constraints
using the two-dimensional’S-manifold’. The left plot shows the
solution obtained from lUKR withkernel bandwidth � set to ��� �
(see section 4.3). The solution was used as aninitialization for
oUKR using bound constraints as defined in 4.21 in onecase (middle
plot) and the nonlinear constraint as defined in 4.20 in
another(right plot).
for example, in order to restrict the average latent variable
norms. Alternatively, simple
bound constraints be applied. For this means, 4.19 may be
simplified to yield(
� �
� � (4.21)
with
(and
�being matrices of lower and upper bounds, respectively.
Making use of the homotopy strategy here then amounts to
gradually releasing the
constraints by directly decreasing or increasing the entries
of
(or�
, respectively, or by
increasing � . However, an important difference to ridge
regression is that, if a latent ma-trix initialization
� ����� � is available, one may abstain from homotopy and derive
suitableconstraints from the initialization instead, for example by
setting
(� � � � � � . � ����� � 2 and� � � � � � . � ����� � 2 or by
setting � �
� � ����� � � �� , with � defined as above. The questionthat
remains, of course, is where to obtain such an initialization
owning the required
property of being correctly scaled so that suitable constraints
can be derived in some
way. The solutions that the LLE algorithm yields, for example,
are not useful here as
they are arbitrarily scaled as described in 3. Nevertheless,
there is a way to obtain such
a suitably scaled initialization, as will be described in
4.4.
33
-
4. Unsupervised Kernel Regression
In anticipation of the results given there, an example of the
constrained (re-)optimi-
zation of such an initialization is depicted in figure 4.2. A
two-dimensional approxima-
tion of a dataset consisting of 500 elements of the noise-free (
� � � ) ’S-manifold’ wasdetermined using the lUKR variant that will
be described in section 4.3. The obviously
sub-optimal solution shown to the right was then used as an
initialization� ����� � for the
oUKR method. Constrained optimization was used in two different
ways on this initial-
ization. In the first case, bound constraints were used with
(and
�defined as above. In
the second, a nonlinear constraint (4.20) was used with �
defined as above. The figureshows that both strategies are able to
improve the sub-optimal initialization. In addi-
tion, the influence that a constraint has on the final solution
is visible. This influence is
obviously stronger than it has been using ridge regression.
’Built in’ Cross Validation
An alternative means of regularization that abandons the need to
estimate any hyper-
parameters and the associated necessity to embed the
minimization of 4.15 in an em-
bracing cross validation loop can be obtained using a cross
validation mechanism ’built
into’ the objective function. This is done by utilizing only a
subset of the training data
at every function evaluation, excluding in particular the data
point to be approximated
in the current evaluation step, so that an external validation
set becomes over-due. Us-
ing ’leave-one-out’ cross validation, where the only vector
excluded is the one to be
approximated, gives rise to the modified objective function:
���� �
� �"!$# � �1� � � � � . � �02 ��� (4.22)
� �� �)!$# � �1� �
� ���! � � . � � � � 2 � �� �! � � . � � � 2 � � (4.23)���
�� ����(4.24)
34
-
4. Unsupervised Kernel Regression
with
�� � � . � � � � � 2 � . � � � � 2� �! � � . � � � 2 � � � �
�Then, an optimal embedding is given by
� ���� � ����� ���� � � � . � 2 � (4.25)
While normally the ’leave-one-out’ strategy is often
problematic, because it is the
computationally most expensive cross validation variant, the
’built in’ alternative adopted
here provides a convenient ’trick’ to exploit and utilize the
coupling of all datapoints
and the related � . � � 2 complexity each function evaluation of
the UKR method, beinga kernel based method, gives rise to anyway.
Thereby, leave-one-out cross validation
becomes possible without any additional computational cost.
Applying homotopy is still possible using the built in cross
validation variant, since
one may still add a penalty term to 4.24 or constrain the
solutions accordingly. In fact,
one might even prefer the modified objective function over 4.10
for this means in order
to forgo the tracking of some test error. But alternatively, if
a suitable initialization is
available, the built in cross validation mechanism also allows
for some direct optimiza-
tion. A well chosen initialization is vital in this case,
however, because of the highly
nonlinear structure of the objective function. An illustration
is depicted in figure 4.3: A
random initialization, as depicted at the top in (a), is not
leading to a satisfactory result
as can be seen at the bottom in (a). An only suboptimal solution
that the LLE algorithm
might yield, which can happen in particular in the presence of
noise or because of a
badly chosen neighborhood size (see 3.2.1), is seen at the top
of (b) and provides the
grounds for this method to find an appropriate embedding (see
bottom of (b)).
35
-
4. Unsupervised Kernel Regression
Initialization:
Solution:
a) b)
Figure 4.3.: Illustration of the importance of a suitable
initialization for the built inleave-one-out cross validation
variant. Noisefree two-dimensional ’halfcir-cle’ dataset to the
left and one-dimensional UKR approximation to the right.While UKR
fails to find a suitable approximation for randomly
initializedlatent variables as can be seen in (a), an only
suboptimal LLE-solution pro-vides a starting point to obtain a
satisfactory result, however, as shown in(b).
4.2.2. Experiments
Visualization of the regression manifold
As stated above, since the oUKR method determines a model for �
along with suit-able latent variable realizations, the application
of the regression function to new latent
space elements is straightforward. This makes it possible to
visualize the the learned
regression manifold (for up to three-dimensional embedding
spaces) by sampling the
latent space and ’plugging’ the obtained latent space elements
into the learned model
for � . As shown in figure 4.2.2 this procedure has been used to
visualize the effects thata rescaling of the latent data matrix
�has on the resulting regression manifold. It is
obvious that for a rescaling factor � � �!� � the regression
manifold degenerates to thesample mean. The result that the
rescaling factor � � ��� � gives rise to is shown in therightmost
plot in the top row.
36
-
4. Unsupervised Kernel Regression
Figure 4.4.: Visualization of the different regression manifolds
resulting from rescalingthe latent data matrix. For the
one-dimensional ’noisy S-manifold’ datasetshown in the bottom left
corner a low-dimensional representation was ob-tained using UKR
with homotopy. The plots show in successive order fromleft to right
and from top to bottom the visualization of the regression
mani-fold obtained from sampling latent space along a regular grid
after rescalingthe latent data matrix with factors 0.0, 0.1, 0.2,
0.3, 0.5, 1.0, 2.0, 3.0, 8.0,15.0, 30.0, � �
�, � � � � , and � � � .
37
-
4. Unsupervised Kernel Regression
0 100 20010
−2
100
102
0 100 20010
−2
100
102
0 100 20010
−5
100
105
0 10 2010
−2
100
102
0 10 2010
−5
100
105
0 10 2010
−5
100
105
0 100 20010
−2
100
102
0 100 20010
−2
100
102
0 100 20010
−2
100
102
0 10 2010
−2
100
102
0 10 2010
−2
100
102
0 10 2010
−2
100
102
0 100 20010
−0.7
100.2
0 100 20010
−0.8
100.2
0 100 20010
−0.7
100.2
0 100 20010
−0.6
100.2
iterations
erro
r
0 100 20010
−0.7
100.2
0 100 20010
−0.6
100.2
Figure 4.5.: The built in cross validation error � � � (depicted
in blue) compared to theerror on an independent test set �
�������(depicted in red). The left column
shows the error progressions averaged over 50 runs, the middle
and rightcolumn show out of the 50 runs only those with the largest
deviations be-tween the two error progressions according to the
(�
norm and to the
(���
norm, respectively. In the first and third row the progressions
for the one-dimensional ’noisy S-manifold’ with � � � � � and � � �
��� , respectively,are depicted. The second and forth row show the
respective progressionsresulting from a faster annealing schedule.
The two bottom rows show theerror progressions for the
two-dimensional ’S-manifold’, again for � ���!� �and � � � ��� ,
respectively.
38
-
4. Unsupervised Kernel Regression
1 2 4 6 8 1010
−3
10−2
10−1
100
s
Etes
t
1 2 4 6 8 1010
−2
10−1
100
s1 2 4 6 8 10
10−2
10−1
100
s1 2 4 6 8 10
10−2
10−1
100
s
Figure 4.6.: The effect that a rescaling of the latent data
matrix obtained from using thebuilt in leave-one-out cross
validation variant of oUKR has on the error thatprojection of an
independent test set gives rise. The plots in the successionfrom
left to right correspond to the settings described in the caption
of figure4.5 from top to bottom for the four top plots.
Built In Cross Validation vs. Homotopy
As stated, the built in cross validation error criterion is a
promising computational short-
cut for solving the problem of model selection. In order to
provide empirical evidence
that this criterion is of real practical value, an investigation
of the distribution of the
resulting error ( � ��� ) as compared to the average error on an
independent test set (con-taining elements � ��������
�*� ��������
� ��� � � ),
������ � � �
� ������� �)!$# � � �������� � ��.5%/.0� ��� � �� 2 2 ����
has been undertaken. To this end, the UKR model has been trained
on the one-dimen-
sional and on the two-dimensional ’S-manifold’ datasets,
comprising 50 elements each,
in one setting without noise and in another with spherical
Gaussian noise ( � � �!� � ).The homotopy scheme has been applied
using the penalized objective function (4.15)
with regularization parameter�
starting with ��� � in all cases and being annealed withfactor
�!��� in one setting and for the one-dimensional case with factor �
�,� in another.
39
-
4. Unsupervised Kernel Regression
The error on a test set, consisting of � � � datapoints in each
setting, has been trackedthroughout and its average over 50 runs
for each setting is depicted together with the
cross validation error in 4.5 (first column). Overall, a strong
correlation between these
quantities can be noticed in all settings. They even converge as
the number of training
steps increases. In addition, the largest deviations along the �
� runs between the twoerror progressions according two the
(�
norm, as well as to the
(��� norm, are depicted
in columns 2 and 3, respectively. They show significant
differences only on the outset
of the learning process, revealing the actual reliability of the
cross validation criterion.
Another finding that becomes obvious from the plots is that a
rise of the test error
hardly ever occurs. Even for the settings in which the rather
’radical’ annealing factor
� �,� has been applied, the test error monotonously decreases.
This gives rise to the as-sumption that a change in
�rather leads to a slight deformation of the error surface
than
to a complete restructuring of it, so that the homotopy method
causes a local optimizer
to track a once reached local minimum during the overall
optimization process.
The built in cross validation error, on the other hand, as can
be seen in the plots
for the two-dimensional datasets, does happen to rise. The plots
in the last row show
that this can even be the case when the test error is obviously
still falling. From this
observation one may draw the conclusion that an application of
the built in cross valida-
tion criterion will rather give rise to underfitting than to
overfitting. In order to collect
more evidence for this conclusion, a further experiment has been
conducted. The final
solutions for the one-dimensional datasets have been used as an
initialization for an it-
erative minimization of � ��� . Then the effect that a rescaling
of the resulting latent datamatrix with some rescaling factor � has
on the test error has been measured. If training
by minimization of � � � leads to an over-regularization,
rescaling with a factor ��� �greater than one should result in a
decreasing test error. Figure 4.6 shows that this is
the case, indeed. While downsizing the latent variable scale
leads to an increasing test
error, enlarging leads to a decreasing error. The optimal
rescaling factor with respect to
the test error is approximately in all settings. This indicates
that built in cross valida-tion indeed tends to a slight
over-regularization. Since the reason to deploy this error
criterion is to circumvent overfitting and since the increase of
the test error that this
over-regularization gives rise to is small as can be seen in the
plots it can be concluded
40
-
4. Unsupervised Kernel Regression
that the built in leave-one-out criterion is indeed a practical
alternative to other, external,
model selection strategies.
Modeling Top-down influences: ’Implicit Clustering’
In 4.2.1 the effects that different optimization constraints and
regularization terms have
on the resulting latent space realization are mentioned,
suggesting to use these in order
to include some kind of top-down knowledge into this
unsupervised method. In fact, the
introduction of top-down influences into methods from UL is an
important and emerging
topic. In particular research in cognitive science, especially
cognitive modeling, is con-
cerned with questions regarding the integration of bottom-up and
top-down processing.
The importance of this is increasingly agreed upon and the
occurrence of this in human
brain functioning is common grounds (see [GSSSK00] or [SKK], for
example). In the
context of Unsupervised Learning, constraining of the learning
process by introduction
of some kind of knowledge provides a step towards this
direction.
Although an evaluation of the potential of the UKR method in
this context goes be-
yond the scope of this thesis, an example of the facility to
include top-down knowledge
through the use of appropriately tailored optimization
constraints shall be given here:
Figure 4.7 shows a dataset consisting of two ’half circles.’
Fitting a model to such a
disrupted manifold is not trivially possible for the known
algorithms for dimensionality
reduction. [RS00] e.g. state that it is an open question, how to
deal with non-uniformly
sampled data. However, including knowledge of the partitioning
structure through the
optimization constraints into the learning procedure, might
simplify the task. This has
been tried by using constrained optimization (see 4.18), with
nonlinear optimization
constraint
�9. � 2 � � � � � �
���
�
(4.26)
forcing the latent variable values to assemble in two groups.
Parameter � has beeninitialized to �!�+� and has been increased by
factor two after every update of the latentvariable realizations.
The second to right plot shows the effect of generalizing % tonew
observable space vectors from the same distribution and the
rightmost plot the
41
-
4. Unsupervised Kernel Regression
Figure 4.7.: ’Implicit Clustering’ (From left to right):
Partitioned dataset; latent spacerealization; projection of a test
data set onto the resulting manifold; visual-ization of the
regression manifold obtained from sampling latent space.
effect of generalizing � to new latent space elements, showing
in particular the effect ofsampling along the ’gap,’ leading to an
approximately linear connection of two adjacent
ends of the two ’half-circles.’ In forcing the latent vectors to
assemble in two groups the
procedure may be interpreted as an ’implicit’ latent space
clustering.
4.3. Latent Space Error Minimization
By turning the roles of the latent and the observable variables
and regarding the first
as outputs and the second as regressors the learning problem can
also be defined as
finding a suitable latent variable realization for every given
data vector by deploying the
Nadaraya Watson Estimator in this case in order to directly
compute the regression of �
onto � [Mei03]:
� � %/.0� 2 � �� �� !$# � .5�
� � 2 � �� � !$# � .0�
� 2 � (4.27)
Measuring the error in latent data space in this case leads to
the objective function
���� � . � 2 � ��
� �)!$# � � � �
� �� !$# � .5�9�
� � 2 � �� � !$# � .5�9� � 2 ��� (4.28)42
-
4. Unsupervised Kernel Regression
� ��� � ��� ���� (4.29)
with
��� � � � � .0�9�
� � 2� � !$# � .0�1�
� 2 � � � � � (4.30)which defines a quadratic form in the latent
variables and can therefore be solved by a
spectral decomposition, as detailed below. The resulting UKR
variant will be denoted
lUKR in the following.
Note, that while the UKR variant described in the previous
section is based upon the
simultaneous – and in fact associated – determination of latent
variable realizations and
a model for � , here determination of the latent data matrix is
connected to, and there-fore leading to, an implicit estimation for
the backward mapping % , instead. This hasgot important
implications regarding generalization: Since no model for � is
learned,generalization of the regression function to new latent
space elements drops out. As
a consequence, generalization of the backward mapping to new
observable space ele-
ments by projection as in 4.11 is not possible, either. The
implicitly learned model for
the backward mapping therefore remains the only alternative.
However, as indicated in
2.4, this variant can be problemati