Top Banner
Unsupervised Kernel Regression for Nonlinear Dimensionality Reduction Diplomarbeit an der Technischen Fakultät der Universität Bielefeld Februar 2003 Roland Memisevic Betreuer: Prof. Helge Ritter Universität Bielefeld AG Neuroinformatik Universitätsstr. 25 33615 Bielefeld Dr. Peter Meinicke Universität Bielefeld AG Neuroinformatik Universitätsstr. 25 33615 Bielefeld
85

Department of Computer Science, University of Torontorfm/pubs/ukr.pdf · 2007. 7. 19. · Department of Computer Science, University of Toronto

Jan 27, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Unsupervised Kernel Regression

    for Nonlinear Dimensionality Reduction

    Diplomarbeit an der Technischen Fakultät

    der Universität Bielefeld

    Februar 2003

    Roland Memisevic

    Betreuer:

    Prof. Helge Ritter

    Universität Bielefeld

    AG Neuroinformatik

    Universitätsstr. 25

    33615 Bielefeld

    Dr. Peter Meinicke

    Universität Bielefeld

    AG Neuroinformatik

    Universitätsstr. 25

    33615 Bielefeld

  • Contents

    1. Introduction 1

    1.1. Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2. Nonlinear Dimensionality Reduction / Overview . . . . . . . . . . . . 4

    1.3. Conventions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2. Unsupervised Learning as Generalized Regression 7

    2.1. Conventional Regression . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.2. Unsupervised Regression . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.3. Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.4. Projection Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.4.1. Principal Axes . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.4.2. Principal Curves . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.4.3. Principal Points . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.4.4. Local Principal Axes . . . . . . . . . . . . . . . . . . . . . . . 14

    2.5. Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3. Spectral Methods for Dimensionality Reduction 16

    3.1. Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.1.1. Principal Component Analysis . . . . . . . . . . . . . . . . . . 17

    3.1.2. Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . 17

    3.2. Nonlinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.2.1. Locally Linear Embedding . . . . . . . . . . . . . . . . . . . . 18

    i

  • Contents

    3.2.2. Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3.2.3. Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4. Unsupervised Kernel Regression 23

    4.1. Unsupervised Nonparametric Regression . . . . . . . . . . . . . . . . . 23

    4.1.1. The Nadaraya Watson Estimator . . . . . . . . . . . . . . . . . 23

    4.1.2. Unsupervised Nonparametric Regression . . . . . . . . . . . . 25

    4.2. Observable Space Error Minimization . . . . . . . . . . . . . . . . . . 26

    4.2.1. Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    4.2.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    4.3. Latent Space Error Minimization . . . . . . . . . . . . . . . . . . . . . 42

    4.3.1. Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    4.3.2. Choosing the Observable Space Kernel Bandwidth . . . . . . . 44

    4.4. Combination of Latent and Observable Space Error Minimization . . . 50

    4.4.1. Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.4.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    5. Applications 62

    5.1. Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    5.2. Pattern detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    5.3. Pattern production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    6. Conclusions 73

    A. Generation of the Toy Datasets 79

    ii

  • 1. Introduction

    This thesis investigates the recently proposed method of ’Unsupervised Kernel Regres-

    sion’ (UKR). The theoretical placement of this method as a means to pursue Nonlin-

    ear Dimensionality Reduction (NLDR) is analyzed, the technicalities involved with its

    practical implementation are inspected, and its applicability to real world problems is

    explored.

    The UKR method stems from the area of Machine Learning which is concerned

    with the development of algorithms that discover patterns in data. More specifically,

    in relying on the key idea to let a system learn from examples what is important for a

    specific task, within this area methods are being developed that help to classify, detect,

    manipulate and produce patterns in a wide range of areas. These methods are not only of

    increasing practical interest, as they may be used to cope with the ever growing amount

    of data available in electronic form today, but also play an important role in the area

    of Artificial Intelligence, where they act as models of those mechanisms that lie at the

    heart of our own cognitive capabilities.

    The methods developed in the area of Machine Learning can be divided into two

    broad classes. Those belonging to the first are concerned with generalizing knowledge

    presented by means of examples to new, unseen data. Inspired by biological learning,

    where any kind of knowledge residing in the examples needs to be pointed at by a

    teacher who corrects and thereby adjusts the learning system, the respective area is gen-

    erally referred to as Supervised Learning (SL). It is contrasted and complemented by

    Unsupervised Learning (UL), which aims at developing systems that in the absence of

    prior knowledge automatically discover meaningful information hidden in the example

    1

  • 1. Introduction

    data. These methods thereby achieve the goal of accounting for the variability in the

    data and to provide alternative representations. Besides providing some kind of pre-

    processing which is often crucial to simplify a subsequent SL task these systems give

    rise to many further applications. Some of these will be sketched throughout. It is this

    second class of methods that UKR belongs to.

    The two main tasks of UL can be defined as dimensionality reduction and density

    estimation. The UKR method may be cast in two distinct ways leading to a variant

    that pursues dimensionality reduction and a second variant that includes some kind of

    density estimation. This thesis examines only the first variant, and brief references to

    the closely related second method will be made at the appropriate places.

    1.1. Dimensionality Reduction

    Generally, methods in dimensionality reduction discover more compact representations

    of their input data, while at the same time they try to keep the usually resulting infor-

    mation loss at a minimum. These methods thereby minimize the required storage space

    or bandwidth capacity for saving or transmitting the data, which gives rise to some of

    their most widespread applications. Additionally, the resulting representations are of-

    ten expected to capture the meaning inherent in the data more explicitly. This provides

    the basis for applications such as denoising and visualization and has also led to an in-

    creasing interest of the area in computational neuroscience for these methods, where the

    awareness that many higher level cognitive capabilities are not possible without some

    kind of dimensionality reduction is common grounds.

    The basis for the reasoning adopted in a large body of methods in Machine Learning

    in general and in dimensionality reduction virtually exclusively is to represent the input

    and output data for these methods as sets of real valued vectors. For some given set of

    input vectors dimensionality reduction then amounts to computing an equally sized set

    of output vectors with lower dimensionality that fits the meaning of the input data set as

    closely as possible. Since the lower dimensional representations in many circumstances

    can be thought of as representing the ’real’ or ’original’ meaning of the data more closely

    than the (often noisy) input data, they will formally be denoted by the letter � , whereas

    2

  • 1. Introduction

    the vectors that are the input to the respective algorithm will be written � , although theconverse notation can often be found in the literature.

    For a given set of � input vectors ���������

    ��������������� , the problem of dimension-ality reduction is then defined as that of finding a corresponding set of output vectors

    � �������

    ��������������� , a mapping ���! "�$# "� and a mapping %&�! '�"# (� such that) �*������+�,�+

    �-

    %/.0�1�324� � � (1.1)��. � �524� 6�1�/78�9� (1.2)

    (see e.g. [CG01]). The function � will also be referred to as ’forward mapping’ or’coding function’ and % to as ’backward mapping’ or ’decoding function.’ The outputvariables are usually referred to as ’scores’ or ’features’ in the literature. Here, the latter

    will additionally be denoted ’latent space realizations,’ or simply ’latent variables’ with

    reference to chapter 2. Using the common Machine Learning jargon, the process of

    determining the scores and estimating models for the involved functions from the input

    data will be referred to as training and the set of input data to as training data. De-

    pending on the application at hand, not all of the three sub-problems of dimensionality

    reduction necessarily need to be solved. If, for example, the aim of the dimensionality

    reduction task is to obtain a representation amenable to convenient visualization of a

    given dataset, only this representation – in other words, only a set of suitable scores –

    is needed. If the purpose is noise reduction or any kind of pattern detection, models for

    both the coding and decoding function become necessary in addition. Only the presence

    of a model for � is demanded if any kind of pattern production is aimed at.In those cases in which a model for � or % is needed, the concept of generalization

    becomes crucial. The problem of generalization in general refers the fact that usu-

    ally only a finite sized training data set is available for adapting a model, which shall

    afterwards be applied to new data not in this set. The optimization of the expected per-

    formance on the unseen data is the actual objective of the learning task. As a means to

    avoid adapting to noise present in the input data, which obviates good generalization and

    is usually referred to as overfitting, generally the ’flexibility’ of the model is restrained

    3

  • 1. Introduction

    by some kind of complexity control. The determination of a suitable complexity control,

    referred to as model selection in the literature, is then conducted along with the training

    of the actual model and has to make use of the available data in some way. A widely

    used approach to model selection which plays an important role in this thesis is cross

    validation. It denotes the method of partitioning the available data into training and test

    sets and using the first for adapting a model using some specific complexity and the

    second to assessing the resulting performance and adjusting the complexity as required.

    An iteration of this procedure using different training set/test set partitionings may be

    used to improve reliability of the overall outcome. In the special case of the training set

    comprising ��� � elements and test set � element giving rise to � train/test iterationsthis is referred to as leave-one-out cross validation.

    Generally, the notion of generalization is closely connected to the notion of a test set

    and the error that a trained model gives rise to on this test set. Since in UL a test set only

    contains input elements, the term generalization here usually concerns some projection

    error in the input space. However, following [RS00], in this thesis the term generaliza-

    tion will be used in a somewhat broader sense and will denote the general problem of

    applying the forward or the backward mapping to new input or output elements.

    1.2. Nonlinear Dimensionality Reduction /

    Overview

    The oldest and best understood method for dimensionality reduction is Principal Com-

    ponent Analysis (PCA), which is based on the spectral decomposition of the data covari-

    ance matrix as described in detail below. As a linear model, PCA has some important

    advantages over many of the nonlinear models discussed in this thesis, in particular with

    regard to generalization.

    Many nonlinear generalizations to PCA have been proposed. A broad class of these

    nonlinear models, that will be referred to as ’projection models’ in the following and that

    in particular capture some of the generalization properties of PCA, can be represented

    in a unified way within the ’Generalized Regression Framework.’ This framework also

    4

  • 1. Introduction

    builds the starting point for derivation of UKR. This general framework for UL shall be

    delineated in chapter 2. The second class of methods in UL – that include some kind

    of density estimation as described above – will be referred to as ’generative models’

    in the following. These models are also captured within this framework. Since the

    generative part of the generalized regression framework also builds the starting point

    for a generative variant of UKR (which is not subject of this thesis, however), it will

    also be sketched in chapter 2.

    Besides these methods there is a second broad class of methods that have been pro-

    posed within the last years which are not captured within this general framework and

    might therefore be conceived of as heuristic methods. Virtually all of these rely on a

    spectral decomposition of some data proximity matrix that gives rise especially to an

    efficient computation of latent space realizations of the input data. As these nonlinear

    spectral methods in contrast to PCA do not allow for a straightforward estimation of

    the involved mappings they have specific problems with regard to generalization. These

    kinds of methods will be described in detail in chapter 3.

    The UKR method for NLDR can be posed in two ways giving rise to models belong-

    ing to the ’projection’ and to the ’spectral’ class, respectively. Both will be described

    in chapter 4. Chapter 4 will also present practical considerations regarding their imple-

    mentation in detail and illustrate this with several data sets. In addition, issues regarding

    generalization will be dealt with in some detail, and in particular a solution to combine

    the generalization capability of the projection models with the efficiency of spectral

    models will be proposed.

    In chapter 5 some applications of UKR used as a method to perform nonlinear di-

    mensionality reduction will be presented, where in particular the different prospects

    arising from latent space realization, applicability of � and applicability of % will befocused on.

    Chapter 6 gives a review and summary of the issues dealt with in this thesis.

    5

  • 1. Introduction

    1.3. Conventions and Notation

    In the following, scalar and vector valued variables will be denoted by lowercase italic

    letters, e.g. � . Matrices will be denoted by uppercase italic letters, e.g.�

    . Only real

    valued vectors and matrices will be used. The set of ��� -dimensional input vectors � �and the set of ��� -dimensional output vectors � � , ��� ��������/�� , will be represented bythe ��� � matrix � and by the ��� � matrix � , respectively.

    A vector of ones will be denoted , a vector of zeros . In the cases where thesenotations are used, the dimensionality of these vectors will always be obvious from the

    context. The mean of some dataset represented for example by the matrix � is definedas �� . The dataset � will be said to be mean centered if it holds that �� �� .

    6

  • 2. Unsupervised Learning as

    Generalized Regression

    The UKR method investigated in this thesis arises naturally as a nonparametric instance

    from the ’Generalized Regression’ framework for Unsupervised Learning[Mei00]. In

    this chapter this framework for Unsupervised Learning is be briefly reviewed, an over-

    view over the two kinds of models it gives rise to is given, which are Projection Models

    and Generative Models, and it is shown how some of the known algorithms in Unsuper-

    vised Learning are reflected within this framework.

    2.1. Conventional Regression

    The purpose of regression is to model a functional relationship between random vari-

    ables that is assumed to be made up of a systematic part and some unpredictable, ad-

    ditive noise. More precisely, let � denote a function that maps an input random vector� � "� onto an output random vector ��� � and let � � "� denote some (zeromean) random vector that corresponds to the unpredictable noise. The relationship to

    be modeled is then assumed to be given by

    � � ��. � 2 � ���� .�� 2���� � (2.1)

    7

  • 2. Unsupervised Learning as Generalized Regression

    To this end an approximation ��� to � is chosen from a set of candidate functions withrespect to the objective to minimize the expected prediction error:

    � � � ��������� ����� � � ��. � 2 ����� . �

    � 2 � � � � � (2.2)Under appropriate circumstances the function ��� that satisfies this demand can be shownto be given by the conditional expectation [HTF01]:

    � � � ��� ��� ��� � � � � .5��� � 2 � � � (2.3)It is referred to as the regression function and its range as the regression manifold. In

    practice, ��� needs to be estimated from finite data sets for the input and output variables.One way this can be achieved is by replacing 2.2 with the empirical prediction error

    ��� � � �"!$# � �1� � ��. � �32 � �� (2.4)

    and minimizing this functional with respect to the function parameters. The input vari-

    ables in this case are practically treated as fixed parameters instead of as random vectors.

    Another option is to take into consideration a parameterized model�.0��� ��%'& 2 of the

    conditional probability density function of the output variables, given the input vari-

    ables, for some parameter vector & . Under the assumption of the sample elements tobe drawn independently and all from the same distribution, the product density function

    specifies the sample probability density, so that the estimation of � can be realized bymaximization thereof with respect to the function parameters. In practice equivalently

    the negative logarithm, (� � � �

    �)!$#+*�, � � .0�9�-� � � %'& 2 (2.5)is minimized, which gives rise to a learning principle generally referred to as maximum

    likelihood.

    8

  • 2. Unsupervised Learning as Generalized Regression

    Yet another option that will not be used any further in this thesis, but which is men-

    tioned here for the sake of completeness, is to regard the parameters as random vari-

    ables themselves and learning as an update of their probability distribution, known as

    Bayesian learning [HS89].

    2.2. Unsupervised Regression

    The task of UL can be approached by utilizing a modified version of the regression

    model. As detailed in [Mei00] the difference then lies in the usage of the input variables.

    In the supervised case regression amounts to the estimation of a functional relationship

    utilizing a sample set for the input and their related output variable realizations. In UL

    the input variable realizations are conceived of as missing and therefore in the need to

    be estimated together with the functional relationship. The distinction can be expressed

    by referring to the input variables in an unsupervised setting as latent variables1.

    With regard to the practical learning task two important differences to the supervised

    case arise from the use of latent variables, the first of them affecting the definition of the

    learning problem: Since the input variables are not given in advance, one has to decide

    on a suitable domain for them. For that purpose several distinctions of the type of latent

    variables have to be taken into account, all leading to different types of regression man-

    ifolds. An important distinction is between deterministic and random latent variables

    leading to models referred to as projection models and generative models, respectively.

    Another distinction of the latent variable types is between continuous and discrete ones.

    Together with the option of choosing a class of candidate functions, where in particular

    the distinction between linear and nonlinear functions is of interest, the two dimensions

    along which a classification of the latent variables is possible, allow for the formulation

    of a wide spectrum of models of UL known from the literature. Some of these will be

    sketched in the subsequent sections.

    The second novelty of unsupervised regression as compared to the supervised case

    1The term latent variable has a longstanding history in statistical modeling and is closely related to theway it is used here. The informal definition given here is completely sufficient and self-containedwith regard to the way this term will be used in the following.

    9

  • 2. Unsupervised Learning as Generalized Regression

    regards the necessity to make use of some kind of learning scheme. Since the more

    ambitious goal of finding latent variable realizations in addition to parameters defining

    a suitable functional relationship needs to be tackled here, one has to conceive of a way

    to accomplish these tasks simultaneously. For deterministic latent variables a generally

    applicable approach to achieve this twofold objective is the ’Projection - Regression -

    Scheme.’ It is obtained from an iteration of a ’Projection’ - Step, used to find optimal

    values for the latent variables, given some values for the function parameters, and a

    ’Regression’ - Step, used to re-estimate function parameters, while the values for the

    variables are kept constant. This optimization scheme can be thought of as a deter-

    ministic analog to the well known EM-algorithm, which can be applied in case of a

    generative model. Both, projection and generative models, the use of their respective

    learning schemes and examples of their applications will be described in detail below.

    2.3. Optimization

    As indicated above, within the Generalized Regression framework learning in general

    will be achieved by an iteration of a minimization of 2.4 or 2.5 with regard to the func-

    tion parameters and an update of the latent variable realizations. However, the presence

    of a usually very large number of parameters in UL, owing to the fact that the latent

    variables need to be estimated here, as well, often cause the respective objective func-

    tions to be fraught with local minima. Therefore, unless a closed form solution for the

    concerned functions exists, the success of the learning task depends crucially on the ini-

    tialization or - if no indication regarding auspicious areas in parameter space is available

    - on the use of an optimization strategy that helps to avoid or at least to diminish the

    chance of getting trapped in a local minimum.

    The well-known method of Simulated Annealing tries to achieve this by allowing

    for random influences to appeal during the parameter update in an iterative optimiza-

    tion process. This alleviates the chance of getting stuck in a local minimum. Gradu-

    ally reducing these random influences, called annealing by analogy to the temperature

    controlled process of crystal growing, can then result in the probability that the global

    minimum of the objective function is actually achieved to be asymptotically, in the limit

    10

  • 2. Unsupervised Learning as Generalized Regression

    of an annealing schedule infinitely slow, to be equal to one.

    An alternative strategy, which will be applied in this thesis, in particular for the ap-

    proach described in section 4.2, chapter 4, is the method of Homotopy. This strategy is

    based on a set of transformations of the original error function into simpler or smoother

    functions with a smaller number of local minima. Minimization of the original func-

    tion is then performed by starting with the most simple function present and gradually

    reducing the degree of smoothing during minimization until the original error function

    is received. Often, a suitable transformation arises automatically from the need to im-

    pose a complexity control. Homotopy in this case amounts to gradually releasing the

    constraints that the complexity control poses. This is in particular the case for the UKR

    model as shown later.

    2.4. Projection Models

    By using deterministic latent variables one obtains the class of models that is of particu-

    lar concern with respect to dimensionality reduction and therefore of special interest for

    the methods described in this thesis. The latent variables are in this case treated formally

    as parameters that need to be estimated along with the function parameters. Since the

    backward mapping is modeled via some kind of projection, these models are generally

    referred to as Projection Models. In detail, this means that the score that corresponds

    to an observable data space element is given by its projection index which is formally

    defined as that latent space element that yields a minimal reconstruction error under � .The dependency on a particular model for � is often symbolized using the expression ���for the backward mapping.

    In the following the function class will be restricted to contain functions of the form:

    ��. � 2 � ��� . � 2 (2.6)

    with parameter matrix�

    and�

    being a vector of basis functions to be specified before-

    hand. The two aforementioned optimization steps (Projection- and Regression-Step) are

    11

  • 2. Unsupervised Learning as Generalized Regression

    then given by:

    6� �/� ��� � ����

    ��9� � ��� . � 2

    ���

    �*� ����������� (2.7)

    and

    6� � ����� ����

    ��9� � ��� . � �32

    ���� (2.8)

    Note that in terms of the mappings � and % , involved in NLDR, optimization usingthis procedure only directly concerns � , while an optimal % is rather ’plugged in’ insteadof being adapted. In [Mal98] the importance of this proceeding is pointed out in a

    comparison of Principal Curves [HS89] which own this property, too, and so called

    Autoassociative Neural Networks (see [Kra91]) which do not. In particular, the presence

    of so called ambiguity points that cause the projection index defined as above to be a

    discontinuous function generally let the latter variant, where the backward mapping is

    a continuous function, fail to correctly approximate the given dataset. The importance

    of this finding for the UKR method resides in the fact that this method can be posed in

    two distinct ways. The first straightforwardly gives rise to a (nonparametric) projection

    model similar to those described in this chapter, in particular with the backward mapping

    defined as proposed above, while the second variant does not and can empirically shown

    to be flawed accordingly. The conclusions to be drawn from this finding will be detailed

    in 4.

    As stated above, several methods for NLDR known from the literature may be for-

    malized within the framework described in this section by varying the latent variable

    types and the class of candidate functions. In the following, some of the possible deci-

    sions on the latent variables and candidate functions and the algorithms they give rise to

    shall be sketched.

    12

  • 2. Unsupervised Learning as Generalized Regression

    2.4.1. Principal Axes

    By defining the latent variable domain to be � , i.e. using (deterministic) continueslatent variables and restricting the function class to linear functions:

    ��. � 2 ��� ��� � ��� � � �� ����� (2.9)

    the resulting learning method is essentially equal to Principal Component Analysis. Al-

    though the Projection-Regression-Scheme could be applied, the special, linear structure

    in this case gives rise to a closed form solution. Specifically, this is obtained through an

    eigenvalue decomposition of the input sample covariance matrix and therefore will be

    described in more detail in chapter 3, where also other, in particular recently developed

    nonlinear methods with this special property shall be delineated.

    2.4.2. Principal Curves

    Restriction of the latent variable domain to a closed interval ��� � � , on the real line, and� defined as

    ��. � 2 � ��� . � 2 � �� ��� (2.10)

    gives rise to nonlinear (one-dimensional) principal manifolds, or ’principal curves.’ In

    fact, this way a generalization of the Principal Curves model proposed by [KKLZ00]

    is achieved, that are modeled by polygonal line segments there. The restriction of the

    latent variable domain is necessary here, as in this nonlinear case the absence of such a

    restriction would result in an interpolation of the training data points.

    2.4.3. Principal Points

    By using discrete latent variables and defining

    ��. � 2 � ��� . � 2 � ��� ���!������/���� � � ��� (2.11)

    13

  • 2. Unsupervised Learning as Generalized Regression

    ��� .0� 2 ��� � � (2.12)

    one straightforwardly obtains a learning method generally known as Vector Quantiza-

    tion. The columns of�

    then represent prototype or codebook vectors, the estimation of

    which using the general optimization scheme equals the well known K-means clustering

    algorithm.

    2.4.4. Local Principal Axes

    The use of a mixture of discrete and continuous latent variables together with a linear

    dependency on the continuous variables can be interpreted as a generalization of the

    vector quantization approach described above. The principal points are in this case

    replaced by continuous linear manifolds. This way one obtains a method for nonlinear

    dimensionality reduction that is known as ’Local PCA’ in the literature and that can be

    used to model nonlinear relationships by residing to the assumption of local linearity.

    The absence of a global coordinate system inherent to this approach, however, in-

    volves serious shortcomings as described, for example, in [TdSL00]. In fact, the LLE

    method, which will be introduced in 3.2.1 originally arose from a series of attempts to

    provide a coordination of locally linear models in order to overcome these shortcomings

    (see also [RS00] and [VVK02], e.g.).

    2.5. Generative Models

    The kinds of models in UL that include some kind of density estimation are referred to

    as generative models [Mei00]. These models arise automatically from the generalized

    regression framework by regarding the latent variables as random variables with non-

    trivial distributions. Since the UKR model can be formulated in a way to obtain also a

    generative variant, these kinds of models shall be sketched here in short.

    If one uses random latent variables, optimization by minimization of 2.4 is no longer

    possible, as the interpretation of the latent variables as parameters no longer applies. In-

    stead, the maximum likelihood approach (or some kind of Bayesian learning, which

    14

  • 2. Unsupervised Learning as Generalized Regression

    is omitted in this thesis, as stated) needs to be used. Furthermore, the Projection-

    Regression optimization scheme is no longer applicable. It is replaced by an analogous

    scheme, known as EM-algorithm, as mentioned above. The detailed description of this

    algorithm shall be omitted here, as there exists a large body of literature on this topic

    (see, e.g. [Bil97] or [Cou96]). In short, the resulting optimization scheme closely re-

    sembles the PR-scheme with the update of latent variable realizations being replaced by

    an update of their probability distribution in this case. In the following a few models

    and some novelties that arise from the use of random latent variables will be sketched.

    Assuming spherical Gaussian noise and a linear dependency on Gaussian latent vari-

    ables one obtains a generative counterpart of the linear model given in 2.4.1. For a pre-

    defined latent space dimensionality � , the generative version yields an equal solution.However, as an extension to the projection model, by making use of the special role the

    noise variance plays in the generative case, a dimensionality estimating version can be

    obtained, by predefined the noise variance.

    For non-Gaussian latent variables an estimation scheme for the well-known Inde-

    pendent Component Analysis ([Hyv99]) can easily be derived making use of the non-

    trivial latent variable distribution here by incorporating the assumption of their being

    statistically independent.

    As a generative version of the principal points model (2.4.3) a method for density

    estimation generally known as ’mixture of Gaussians’ arises straightforwardly from a

    (spherical) Gaussian noise assumption. In addition probabilistic versions of the cluster-

    ing algorithm and of the local PCA model become possible, for example.

    15

  • 3. Spectral Methods for

    Dimensionality Reduction

    One broad class of methods for dimensionality reduction that differs from most of the

    approaches delineated in the previous chapter is the class of spectral methods. The

    main difference is that these methods do not deploy any iterative optimization scheme.

    Instead, they rely on an objective function that has an efficiently computable global

    optimum. The by far most important and widely used instance is Principal Compo-

    nent Analysis. Recent developments regarding the applicability of spectral methods to

    nonlinear learning problems, however, has led to a current rise in their popularity, too.

    The point of contact for practically all these methods is that they rely on an optimal-

    ity criterion that can be posed as a quadratic form. Therefore, these methods rely on

    some variant of the Rayleigh Ritz theorem, which states in short, that for some quadratic

    form � minimizer (maximizer) of tr .�������� 2 with respect to the � �"� matrix � , subjectto ����� � , and � � , is the matrix ������ containing the � eigenvectors correspond-ing to the � smallest (largest) eigenvalues of (symmetric) � (see, e.q. [HJ94] or [Jol86]).In other words, all these methods rely on an eigenvalue decomposition (EVD) of some

    matrix � , hence the name ’spectral methods.’ A further commonness is that � is definedas some kind of data affinity matrix of a given dataset in all cases, as recently pointed

    out by [BH02].

    In the following the different spectral approaches to dimensionality reduction will

    be described and their specific characteristics, in particular in view of the spectral UKR

    variant to be introduced in 4.3, shall be pointed out, beginning with the linear variants

    16

  • 3. Spectral Methods for Dimensionality Reduction

    and PCA. A common drawback of the nonlinear generalizations of PCAis that, although

    these inherit the efficiency of their linear counterpart, they do not share the same merits

    in terms of generalization, as will be described in detail later. This entails particular

    problems with regard to potential applications as pointed out in chapter 1.

    3.1. Linear Models

    3.1.1. Principal Component Analysis

    Principal Component Analysis (PCA) can be regarded as the oldest and most well-

    known method for dimensionality reduction1. It can be derived from the objective to

    maximize the variance of the projection of a given � -dimensional dataset onto a � -dimensional subspace. The quadratic form that preoccupies this objective is simply the

    (symmetric and positive definite) sample covariance matrix ([Jol86]). Precisely, assum-

    ing � to be mean centered, let the spectral decomposition of � ��� be: � � ��� ����� �and let

    ������ denote the matrix containing the normalized eigenvectors corresponding to

    the � largest eigenvalues as columns. The matrix � of the latent space vectors that meetthe maximal variance objective is then given by

    � � � �� �� � . Similarly, generalizationto some new observable space element � is performed simply by left-multiplication with� �� �� , while the application of the forward mapping to some new latent space element

    is performed by left-multiplication with���� . In other words, here one has the unique

    case of a method that gives rise to the direct estimation of the involved functions, as la-

    tent space realizations are in fact obtained subsequently by applying the obtained model

    for � to the given input dataset.

    3.1.2. Multidimensional Scaling

    The method of Multidimensional Scaling (MDS) deviates somewhat from the other

    methods for dimensionality reduction exposed in this thesis, because it is not used as

    a method to determine a low dimensional representation of a high dimensional dataset,

    1Although the derivation of PCA draws from much broader (but related) objectives, here it will betreated as a method for dimensionality reduction only.

    17

  • 3. Spectral Methods for Dimensionality Reduction

    but it obtains some dissimilarity measure as input instead. In fact, it thereby fails to meet

    the general definition given in chapter 1. It will be described here in short, nevertheless,

    as it is generally classified as a method for dimensionality reduction in the literature and

    makes up a crucial step of the Isomap algorithm portrayed below.

    In detail, MDS addresses the problem of finding a (usually ’low’ dimensional) data

    set from a set of pairwise dissimilarities, such that the distances between the resulting

    data points approximate the dissimilarities as closely as possible. Given a dissimilarity

    matrix2 � with an EVD � � � � � � the optimal data set is given by � ���� . As stated,the abandonment of the need to use a data set from some euclidean space as input gives

    rise to applications that go beyond generic dimensionality reduction. Often, some kind

    of subjective similarity judgment is used as a basis to obtain � , which allows for thevisualization of ’psychological spaces’ [Krz96], for example. If � contains the pairwiseeuclidean distances of some real dataset, however, the solution is the same as that from

    PCA.

    3.2. Nonlinear Models

    3.2.1. Locally Linear Embedding

    A method that incorporates an EVD in order to accomplish nonlinear dimensionality

    reduction and that has arisen a great deal of attention recently is the method of Locally

    Linear Embedding (LLE) [RS00]. It is essentially based on geometrical intuitions as

    it attempts to determine a lower dimensional embedding of a given dataset that retains

    local neighborhood relations between datapoints by making use of the following three-

    step algorithm:

    1. for each ��� define � the index set of � nearest neighbors

    2Generally, the matrix needs some simple preprocessing, which will not be detailed here.

    18

  • 3. Spectral Methods for Dimensionality Reduction

    2. set � � � � � � if���� � and minimize with respect to � � � the objective:� �)!$# � �1� � ������ �$� � � � ��� s.t. ����� �$� � � � (3.1)

    3. minimize with respect to � � the objective:� �"!$# � � � � � � !$# �$� � � � ��� (3.2)

    �� �� ���� (3.3)

    with

    � � ��� .�$� � 2 � � � �� � � (3.4)The purpose of the second step is to discover those weights that give rise to an optimal

    reconstruction of each observable space datapoint by its neighbors. This step requires

    to solve a constrained least squares fit and has a closed form solution. In the case

    of ��� � some kind of regularization heuristic is necessary, however. In the third stepthose latent variable realizations are then sought that minimize the average (latent space)

    error, if reconstructed from their neighbors with the same weights as their observable

    space counterparts. By writing 3.3 as tr . ���� � � �/2 it is obvious that this step givesrise to a quadratic form in the latent variables. Therefore, by requiring

    � � and� � � � � , the solution is given by the eigenvectors belonging to the � (second to)smallest eigenvalues of � � � ��� � . Particularly advantageous here is the that � issparse, allowing for an efficient solution.

    Overall, in tending to preserve the weights with which a datapoint is reconstructed

    from its neighbors under the sum-to-one constraint, the authors of the LLE algorithm

    state that it tends to preserve exactly those properties of each neighborhood that are

    invariant under rescalings, rotations, and translations. Hence, the algorithm provides

    19

  • 3. Spectral Methods for Dimensionality Reduction

    a mapping from observable to latent data space that tends to be linear for each neigh-

    borhood. In other words, LLE determines a low dimensional global representation of

    a manifold embedded in a higher dimensional space by assuming it to be arranged in

    linear patches. This assumption also defines the tender spot of the LLE algorithm, since

    a violation of it lets the algorithm fail. This is in particular the case, if noise is present

    in the data.

    Generalization

    The authors of the LLE algorithm propose two ways of generalizing a trained model

    to new latent or observable space elements. The first, non-parametric, approach is in

    principle a straightforward re-application of the main procedures that lie beneath the

    learning algorithm itself: for a new latent or observable vector generalization of � or% , is achieved by: (i) identifying the new datapoints’ neighbors among the training orlatent variable set, respectively, (ii) determining corresponding reconstruction weights,

    and (iii) evaluating the function as a linear combination of the found neighbors with

    their corresponding weights.

    The theoretical justification for this kind of generalization lies in the same geomet-

    rical intuitions that the lle-algorithm itself is based on. In particular the assumption of

    local linearity is crucial for this approach to generalization to work properly, so that

    the dependency on noise-free data applies here in the same manner as above. The fact

    that the concept of generalization is most essentially based on the presence of noise,

    however, thereby represents a serious problem for this approach.

    The second, parametric, approach to generalization that is proposed by the authors

    is to train a supervised model on the input-output data pairs that are available after the

    application of LLE. The problems affecting non-parametric generalization can be cir-

    cumvented this way. The resulting overall algorithm consists of two separate parts - an

    unsupervised and a supervised part - with two unrelated objectives: optimal reconstruc-

    tion of the latent data elements from their neighbors for the first and minimization of the

    expected prediction error for the second (see 2.1). A final model assessment is there-

    fore based on the prediction error in the space of the observable variables. The problem

    is that it is not at all clear to what extend complying with the first objective is able to

    20

  • 3. Spectral Methods for Dimensionality Reduction

    preoccupy the second. This is in clear contrast to the projection models described in

    2.4 that estimate latent variable realizations and a model for the forward mapping at the

    same time and both with the objective to minimize the observable space error.

    3.2.2. Isomap

    A method for NDLR that arose similar attention as LLE is the Isomap (isometric feature

    mapping) algorithm proposed by [TdSL00]. It can be regarded as a heuristic approach,

    too, but it is based on completely different intuitions than LLE. Isomap similarly em-

    anates from the observable data being distributed along a low-dimensional manifold,

    but abstains from the assumption of linear patches. It seeks to find a low-dimensional

    embedding that preserves distances between datapoints as measured along the manifold

    - so called ’geodesic’ (locally shortest) distances. The crucial point in this algorithm is

    therefore to derive these geodesic distances from the datapoints (more precisely, their

    euclidean distances) in an efficient way. To achieve this the authors propose the three

    step algorithm of (i) computing a topology-preserving network representation of the

    data, (ii) computing the shortest-path distance between any two points which can effi-

    ciently be done by using dynamic programming, (iii) determining the low-dimensional

    representation that preserves the computed distances as closely as possible using Multi-

    dimensional Scaling (MDS) on these distances.

    With regard to generalization, Isomap shares the same shortcomings of LLE, be-

    cause of its likewise rather heuristic quality. In fact, as in contrast to LLE not even a

    heuristic generalization of the involved mappings can be naturally derived from the in-

    tuitions the algorithm is based upon, the authors suggest to train a supervised model on

    the obtained completed dataset – resulting in the same shortcomings that hold above.

    3.2.3. Kernel PCA

    Another spectral approach to nonlinear dimensionality reduction, which at first glance

    resembles the UKR variant to be described in 4.3 with regard to its incorporation of both

    kernel and spectral methods, is Kernel PCA. A closer look reveals important differences,

    however.

    21

  • 3. Spectral Methods for Dimensionality Reduction

    Kernel PCA stems from a line of research on kernel based methods that is based on

    the idea to incorporate an implicit (usually highly nonlinear) mapping to some higher

    dimensional feature space by exploiting the finding that dot products in such a feature

    space can equivalently be computed using the original data space vectors alone through

    the help of kernel functions [Bur98], which is often referred to as ’kernel trick.’ In

    order to apply this idea to Unsupervised Learning, the classical PCA approach may be

    recast in a form that makes use of inner products only. In [SSM99] the derivation of

    this formulation is given resulting in an algorithm to compute principal components in

    a higher dimensional feature space. Technically this amounts to performing the EVD of

    the matrix of pairwisely evaluated kernel function.

    22

  • 4. Unsupervised Kernel Regression

    This chapter describes the method of Unsupervised Kernel Regression as recently intro-

    duced by [Mei03]. In a first step the concept of Nonparametric Regression is introduced.

    Then two distinct ways of extending this concept to Unsupervised Learning will be de-

    lineated in the subsequent two sections. In addition, a combination of the objective

    functions these two variants give rise to which has in particular proven to be useful with

    regard to practical considerations, will be proposed in the section that follows.

    4.1. Unsupervised Nonparametric Regression

    4.1.1. The Nadaraya Watson Estimator

    Chapter 2 introduced the purpose of regression as that of modeling a functional relation-

    ship between variables by choosing that element from a parameterized set of candidate

    functions that minimizes the empirical prediction error on a training data set, which is

    asymptotically equivalent to taking the conditional expectation:

    ��. � 2 � � � ��� ��� � � � � .0��� � 2 � � (repeated) � (4.1)In contrast to 2.1, where the approximation of the regression function has been real-

    ized by minimization of the empirical prediction error or maximization of the data log-

    likelihood, one obtains a nonparametric variant of the regression estimation, if one aims

    directly at modeling the conditional expectation. This can be achieved by considering

    23

  • 4. Unsupervised Kernel Regression

    a non-parametric estimator 6�

    of the joint probability density function�. �

    � 2 of the in-

    volved variables, taking into account that 4.1 can be relocated to yield:

    ��. � 2 ��� ��. �

    � 2 � �

    ��. �

    � 2 � � � (4.2)

    By utilizing the multivariate Kernel Density Estimator (KDE) to model the joint density

    [Sco92]:

    6�. �

    � 2 � ��

    � �)!$# � � . � � �32 � � .0�

    �9� 2 (4.3)

    with� . �,�� 2 being multivariate Kernel functions and � . �+�� 2 denoting the unnormalized

    portions thereof, so that with the normalization constants � � and � �� �� . �,�� 2 ��� �

    �� � � . �+���2 ��� � � (4.4)

    and � �� . �,�� 2 ��� �

    �� � � . �,�� 2 ��� ��� (4.5)

    hold, the estimate of the regression function becomes1

    ��. � 2 �� �� !$# � . � � � 2 � �� � !$# � . � � 2 (4.6)

    1The symbol will be overloaded to denote both the observable space and latent space Kernel func-tions, and later on also the matrix of kernel functions. Since the dataspace kernel functions cancel outin 4.6, here they are only specified with latent space arguments; later on, however, they will be usedto accommodate data space arguments analogously.

    24

  • 4. Unsupervised Kernel Regression

    which is known as the Nadaraya Watson Estimator [Bis96]. In this thesis only the

    spherical Gaussian

    � . � � � � 2 ��������. ��� � � � � � � � ��� 2 (4.7)and the Epanechnikov kernel function

    � . � � � � 2 ��� � # � � � � � � � � � if � � � � � � � �� ��! otherwise (4.8)

    will be deployed and both denoted by � . �,�� 2 . The function used will be indicated at therespective places.

    The function parameter � determines the kernel bandwidth and provides a means toadjust the model complexity for the Nadaraya Watson Estimator.

    4.1.2. Unsupervised Nonparametric Regression

    As in the parametric case, the transition to unsupervised regression is made by regarding

    the regressors as latent. In contrast to the parametric case, however, that poses the

    twofold objective of finding suitable latent variable realizations along with parameters

    defining a functional relationship, here both objectives are achieved at the same time

    by merely taking care of finding suitable latent variable realizations, because of the

    nonparametric nature of the problem. This way the ’double burden’ that problems in

    UL hitherto gave rise to is eliminated, resulting in methods that resemble those from

    SL, because they depend on the estimation of only one class of parameters.

    As pointed out in [Mei03], in the unsupervised case the Nadaraya Watson Estimator

    may be deployed in two ways. The first is to treat the latent variables in 4.6 as parame-

    ters to be estimated. By measuring the observable space error one obtains the objective

    function dealt with in detail in the next section. The second way is to compute the

    latent variable realizations by simply applying the Nadaraya Watson Estimator to the

    observed variables, that are regarded as input in this case. In other words, this variant

    amounts to computing the nonparametric regression function in the opposite direction.

    25

  • 4. Unsupervised Kernel Regression

    The objective function for this approach is obtained by measuring the latent space error.

    In that case, a nontrivial coupling of the resulting latent variable realizations has to be

    accounted for, a problem that can be solved efficiently by a spectral decomposition as

    described in 3. This variant will be described in 4.3.

    4.2. Observable Space Error Minimization

    To obtain a loss function for learning suitable latent variable realizations one might con-

    ceive of 4.6 as being parameterized by the latent data matrix�

    and measure the mean

    square reconstruction error on the observed variables[Mei03]. The resulting objective

    function of this UKR variant, denoted oUKR in the following, is given by:

    � . � 2 � ��� �)!$# � �1� �

    � �� !$# � . � � � � 2 � �� � !$# � . � � � 2 ��� (4.9)� ��

    �� � ���� (4.10)

    with

    � � � � � . � � � � 2� � !$# � . � � � 2 � � � � �Since the effect of any variation of the kernel bandwidth could be equivalently

    caused by a change in the average scale of the latent variable realizations, in the fol-

    lowing, if not stated otherwise, the kernel bandwidth will be conceived of as being

    constant ( � ����� � ) and any influence on the model complexity will be accounted for oreffected by the latent variable norms only.�

    – and by virtue of 4.6 at the same time � – will then be estimated by minimizing4.10. Since a minimization without further restrictions on the objective function would

    drive the latent variable scales to infinity, however, with the reconstruction error for

    the training data set at the same time approaching zero, it is obvious that some kind of

    complexity control has to be imposed in order for the optimization problem to be well

    26

  • 4. Unsupervised Kernel Regression

    defined [Mei03]. Practical considerations regarding both minimization and restriction

    of the model complexity will be delineated in the next section.

    The generalization of a trained model to new latent space or observable space el-

    ements is straightforward. Since a model for � is learned along with suitable latentvariable realizations, the oUKR method closely resembles the projection models de-

    scribed in 2.4. The only difference is that no iterative training procedure is necessary

    here, because no parameters need to be estimated. Application of the regression func-

    tion to new latent space elements is therefore possible simply by plugging these into

    4.6. And the backward mapping % can be defined analogously to 2.7:

    %�.5� 2 � � ������� ��

    �� � ��. � 2

    ���� � (4.11)This requires to solve a nonlinear optimization problem, that should be initialized suit-

    ably. A straightforward choice as an initialization is

    ��� � � ����������� � � � ��. � �52 ����

    �*� ��������/��- (4.12)

    requiring a search over the number of training data points.

    From the reasoning laid out so far a generative model is obtained simply by regard-

    ing the latent variables as random variables as in 2.5 and maximizing the conditional

    log-likelihood (see chapter 2) which can be derived straightforwardly (see [Mei03] for

    details).

    4.2.1. Optimization

    The objective function is nonlinear with respect to the latent variables and needs to be

    minimized iteratively. Since it is possible for the gradient of the objective function to be

    computed analytically, some gradient based optimization scheme may be used. For the

    partial derivatives with respect to the latent variables it holds:

    �� . � 2�

    � � � �� !$# � � !$# � �� � � �� � � � � � � (4.13)

    27

  • 4. Unsupervised Kernel Regression

    with�

    denoting the � �

    column of�

    .

    Since there are � latent vectors with � components each, it is obvious that the timecomplexity for computation of the gradient amounts to at least � .0� � � ��2 . This is in-deed the complexity class for computation of the gradient, if one pre-computes � �� � and � ��������� � � � . Computation of these terms gives rise to costs of � .0� � � 2 and � . � � � ��2 ,respectively. While this is clear for the first expression, the second one deserves special

    attention. It holds:

    � �� � � � � ��� � �

    � � �

    �� � !$# � � � �� � � � � � . � � � 2� �� !$# � . � � � 2 � � � �

    �� � !$# � � � � � �� !$# � . � � � 2� �� � � � . � � � 2�

    � � �� � !$# � . � � � 2 � � . � � � 2 � �

    � � !$#

    � � . � � � � 2�� � �

    �� � . � � �02

    �� � � �

    � �� �� !$# � . � � � 2

    � � !$# � � � � � . � � � 2� � � �

    � � �� !$# � . � � � 2 � � �

    � � !$#

    � � . � � � � 2�� � �

    �� � . � � �52

    �� � �

    � � !$# � � � � . � � � 2 �

    (4.14)

    Therefore

    � �� � � �� � �� � � �.�� 2 � � � � � � � � � . � � �52� � � � � �

    with

    � � �� � !$# � . � � � 2

    28

  • 4. Unsupervised Kernel Regression

    �� � � �� � !$# � � � � � . � � � 2� � � � � � � � � � . � � �32� � � � � � � � � !$# � � � � � . � � � � 2� � � �

    � � � � �� � !$#

    � � . � � � � 2�� � �

    � � � �� � !$# � � � � . � � � 2 �

    The terms � , , � , � may be pre-computed, which amounts to a time complexity of� . � � 2 , � . � � � ��2 , � . � � ��2 and � . � � � 2 , respectively.

    Throughout this chapter only the Gaussian kernel function will be used. In this case

    the derivative is given by

    � � . � � � 2�

    � � � �� � . � � �52

    �� � � �

    �� � � . � � � 2�. � � � � � � 2 �To further reduce time complexity for computation of the objective function as well

    as its gradient, a sparse structure of the kernel matrix might be induced by residing to a

    kernel function with finite support, such as the Epanechnikov kernel or the differentiable

    ’quartic kernel’ [Sco92]. An exploration of this approach, however, is not within the

    scope of this thesis.

    Ridge Regression

    The above mentioned requirement to restrict the model complexity can be met by adding

    a regularization term . � 2 to the objective function that constrains the scale of latentvariable realizations in some way. This is is referred to as ’ridge regression’ or ’weight

    decay’ in the literature (see [HS89]). The resulting penalized objective function then

    29

  • 4. Unsupervised Kernel Regression

    reads

    ��� . � 2 ��� . � 2 ��� . � 2 � (4.15)The parameter

    � �� functions as a means to control the influence of the regularizationterm and thereby provides a way to control the model complexity. The matrix of suitable

    latent variable realizations is now determined by minimizing the penalized objective

    function2:

    � ���� � ��� � ���� � � . � 2 � (4.16)A straightforward choice for the regularization term is one that equally restricts the

    euclidean latent variable norms, which can be achieved by defining

    . � 2 � �� � ���� � (4.17)

    Other regularization terms are conceivable, however. In fact, these might even be

    desirable in specific circumstances, since they could provide a facility to include some

    kind of top-down knowledge by imposing constraints upon the latent variables and their

    relations. This finding will be evaluated in more detail in section 4.2.2. In the following,

    if not stated otherwise, the variant given in 4.17 will be used only.

    To deal with the presence of local minima, an optimization strategy such as the ho-

    motopy scheme described in 2.3 needs to be applied for minimization of 4.15. Here,

    this practically amounts to slowly decreasing�

    during optimization. In order to avoid

    overfitting, the error that the projection of an independent test set onto the resulting

    manifold gives rise to might be tracked. As an illustration, the approximations of the

    two-dimensional ’noisy S-manifold,’ embedded in a three-dimensional space, which are

    depicted in 4.1, have been determined this way (see Appendix A for details on the gen-

    eration of the toy datasets used here in the following). Here, in particular the importance

    of a well chosen annealing strategy becomes obvious. While the two panels to the right

    2Note that for optimization by gradient descent the gradient given in 4.13 needs to be modified by addingthe derivative of the respective penalty term with respect to the latent variables.

    30

  • 4. Unsupervised Kernel Regression

    visualize the progression of the latent variable realizations obtained from a reasonably

    chosen annealing strategy, the two panels to the left show the effect of too fast anneal-

    ing and the result of getting stuck in a local minimum leading to an only suboptimal

    final solution. The out-most illustrations have been rescaled to accommodate visualiza-

    tion of the latent variable realizations, while the innermost panels show the same results

    in a consistent scale for each of the two progressions, making visible the enlargement

    of the latent realizations throughout the annealing progress. It is obvious that, while

    the solution to the left after 20 annealing steps spans an area that is approximately five

    times larger than the solution to the right after 350 steps, the final solution using ’slow’

    annealing clearly captures the structure of the original data set far better than the left

    one.

    A general drawback of using ridge regression and homotopy that is related to the

    necessity of a suitably ’slow’ annealing schedule is efficiency. Datasets containing a

    thousand or more elements in a few hundred dimensions have shown to be problematic

    to handle and therefore might ask for alternative strategies. One option will be exposed

    in the following.

    Constrained Optimization

    While adding a penalty term as described above can be interpreted as imposing a soft

    constraint on minimization of the objective function by including the tendency to fa-

    vor small norm solutions, it is also possible to use a strictly constrained optimization

    algorithm instead by re-defining the optimization problem as

    � ���� � ��� � ���� � . � 2 (4.18)subject to �9. � 2

    ���! (4.19)

    with � defining some nonlinear constraint. In analogy to 4.17 one might set

    �9. � 2 � � � ����

    � � (4.20)

    31

  • 4. Unsupervised Kernel Regression

    −1 01

    05

    0

    2

    4

    −50 0 50

    −50

    0

    50

    −50 0 50

    −50

    0

    50

    −50 0 50

    −50

    0

    50

    −50 0 50

    −50

    0

    50

    −50 0 50

    −50

    0

    50

    −50 0 50

    −50

    0

    50

    −0.5 0 0.5 1 1.5−0.5

    00.5

    11.5

    step

    : 0

    −2 0 2−2

    0

    2

    step

    : 1

    −20 0 20−20

    0

    20

    step

    : 2

    −50 0 50

    −50

    0

    50

    step

    : 3

    −50 0 50

    −50

    0

    50

    step

    : 4

    −50 0 50

    −50

    0

    50

    step

    : 20

    −10 0 10

    −10

    0

    10

    −10 0 10

    −10

    0

    10

    −10 0 10

    −10

    0

    10

    −10 0 10

    −10

    0

    10

    −10 0 10

    −10

    0

    10

    −10 0 10

    −10

    0

    10

    −0.5 0 0.5 1 1.5−0.500.511.5

    step

    : 0

    −2 0 2−2

    0

    2

    step

    : 1

    −2 0 2

    −2

    0

    2

    step

    : 3

    −10 0 10−10

    0

    10

    step

    : 6−10 0 10

    −10

    0

    10

    step

    : 35

    −10 0 10

    −10

    0

    10

    step

    : 350

    Figure 4.1.: The effects of different annealing strategies. The dataset on the top, consist-ing of � ��� � � datapoints sampled from the two-dimensional ’S-manifold’distribution with spherical Gaussian noise ( �

    �� � ��� ) has been approxi-

    mated using: (left) � annealing steps, with � being declined geometricallywith factor � �,� after each step and (right) ��� � annealing steps, with � beingdeclined with factor � ��� . Start value for � was ��� � in all cases. The latentvariables have been initialized randomly from a uniform distribution overthe unit square. Note the tendency of the latent space realizations to arrangespherically provoked by using the Frobenius norm as regularization.

    32

  • 4. Unsupervised Kernel Regression

    Figure 4.2.: The effects of different optimization constraints using the two-dimensional’S-manifold’. The left plot shows the solution obtained from lUKR withkernel bandwidth � set to ��� � (see section 4.3). The solution was used as aninitialization for oUKR using bound constraints as defined in 4.21 in onecase (middle plot) and the nonlinear constraint as defined in 4.20 in another(right plot).

    for example, in order to restrict the average latent variable norms. Alternatively, simple

    bound constraints be applied. For this means, 4.19 may be simplified to yield(

    � �

    � � (4.21)

    with

    (and

    �being matrices of lower and upper bounds, respectively.

    Making use of the homotopy strategy here then amounts to gradually releasing the

    constraints by directly decreasing or increasing the entries of

    (or�

    , respectively, or by

    increasing � . However, an important difference to ridge regression is that, if a latent ma-trix initialization

    � ����� � is available, one may abstain from homotopy and derive suitableconstraints from the initialization instead, for example by setting

    (� � � � � � . � ����� � 2 and� � � � � � . � ����� � 2 or by setting � �

    � � ����� � � �� , with � defined as above. The questionthat remains, of course, is where to obtain such an initialization owning the required

    property of being correctly scaled so that suitable constraints can be derived in some

    way. The solutions that the LLE algorithm yields, for example, are not useful here as

    they are arbitrarily scaled as described in 3. Nevertheless, there is a way to obtain such

    a suitably scaled initialization, as will be described in 4.4.

    33

  • 4. Unsupervised Kernel Regression

    In anticipation of the results given there, an example of the constrained (re-)optimi-

    zation of such an initialization is depicted in figure 4.2. A two-dimensional approxima-

    tion of a dataset consisting of 500 elements of the noise-free ( � � � ) ’S-manifold’ wasdetermined using the lUKR variant that will be described in section 4.3. The obviously

    sub-optimal solution shown to the right was then used as an initialization� ����� � for the

    oUKR method. Constrained optimization was used in two different ways on this initial-

    ization. In the first case, bound constraints were used with

    (and

    �defined as above. In

    the second, a nonlinear constraint (4.20) was used with � defined as above. The figureshows that both strategies are able to improve the sub-optimal initialization. In addi-

    tion, the influence that a constraint has on the final solution is visible. This influence is

    obviously stronger than it has been using ridge regression.

    ’Built in’ Cross Validation

    An alternative means of regularization that abandons the need to estimate any hyper-

    parameters and the associated necessity to embed the minimization of 4.15 in an em-

    bracing cross validation loop can be obtained using a cross validation mechanism ’built

    into’ the objective function. This is done by utilizing only a subset of the training data

    at every function evaluation, excluding in particular the data point to be approximated

    in the current evaluation step, so that an external validation set becomes over-due. Us-

    ing ’leave-one-out’ cross validation, where the only vector excluded is the one to be

    approximated, gives rise to the modified objective function:

    ���� �

    � �"!$# � �1� � � � � . � �02 ��� (4.22)

    � �� �)!$# � �1� �

    � ���! � � . � � � � 2 � �� �! � � . � � � 2 � � (4.23)���

    �� ����(4.24)

    34

  • 4. Unsupervised Kernel Regression

    with

    �� � � . � � � � � 2 � . � � � � 2� �! � � . � � � 2 � � � � �Then, an optimal embedding is given by

    � ���� � ����� ���� � � � . � 2 � (4.25)

    While normally the ’leave-one-out’ strategy is often problematic, because it is the

    computationally most expensive cross validation variant, the ’built in’ alternative adopted

    here provides a convenient ’trick’ to exploit and utilize the coupling of all datapoints

    and the related � . � � 2 complexity each function evaluation of the UKR method, beinga kernel based method, gives rise to anyway. Thereby, leave-one-out cross validation

    becomes possible without any additional computational cost.

    Applying homotopy is still possible using the built in cross validation variant, since

    one may still add a penalty term to 4.24 or constrain the solutions accordingly. In fact,

    one might even prefer the modified objective function over 4.10 for this means in order

    to forgo the tracking of some test error. But alternatively, if a suitable initialization is

    available, the built in cross validation mechanism also allows for some direct optimiza-

    tion. A well chosen initialization is vital in this case, however, because of the highly

    nonlinear structure of the objective function. An illustration is depicted in figure 4.3: A

    random initialization, as depicted at the top in (a), is not leading to a satisfactory result

    as can be seen at the bottom in (a). An only suboptimal solution that the LLE algorithm

    might yield, which can happen in particular in the presence of noise or because of a

    badly chosen neighborhood size (see 3.2.1), is seen at the top of (b) and provides the

    grounds for this method to find an appropriate embedding (see bottom of (b)).

    35

  • 4. Unsupervised Kernel Regression

    Initialization:

    Solution:

    a) b)

    Figure 4.3.: Illustration of the importance of a suitable initialization for the built inleave-one-out cross validation variant. Noisefree two-dimensional ’halfcir-cle’ dataset to the left and one-dimensional UKR approximation to the right.While UKR fails to find a suitable approximation for randomly initializedlatent variables as can be seen in (a), an only suboptimal LLE-solution pro-vides a starting point to obtain a satisfactory result, however, as shown in(b).

    4.2.2. Experiments

    Visualization of the regression manifold

    As stated above, since the oUKR method determines a model for � along with suit-able latent variable realizations, the application of the regression function to new latent

    space elements is straightforward. This makes it possible to visualize the the learned

    regression manifold (for up to three-dimensional embedding spaces) by sampling the

    latent space and ’plugging’ the obtained latent space elements into the learned model

    for � . As shown in figure 4.2.2 this procedure has been used to visualize the effects thata rescaling of the latent data matrix

    �has on the resulting regression manifold. It is

    obvious that for a rescaling factor � � �!� � the regression manifold degenerates to thesample mean. The result that the rescaling factor � � ��� � gives rise to is shown in therightmost plot in the top row.

    36

  • 4. Unsupervised Kernel Regression

    Figure 4.4.: Visualization of the different regression manifolds resulting from rescalingthe latent data matrix. For the one-dimensional ’noisy S-manifold’ datasetshown in the bottom left corner a low-dimensional representation was ob-tained using UKR with homotopy. The plots show in successive order fromleft to right and from top to bottom the visualization of the regression mani-fold obtained from sampling latent space along a regular grid after rescalingthe latent data matrix with factors 0.0, 0.1, 0.2, 0.3, 0.5, 1.0, 2.0, 3.0, 8.0,15.0, 30.0, � �

    �, � � � � , and � � � .

    37

  • 4. Unsupervised Kernel Regression

    0 100 20010

    −2

    100

    102

    0 100 20010

    −2

    100

    102

    0 100 20010

    −5

    100

    105

    0 10 2010

    −2

    100

    102

    0 10 2010

    −5

    100

    105

    0 10 2010

    −5

    100

    105

    0 100 20010

    −2

    100

    102

    0 100 20010

    −2

    100

    102

    0 100 20010

    −2

    100

    102

    0 10 2010

    −2

    100

    102

    0 10 2010

    −2

    100

    102

    0 10 2010

    −2

    100

    102

    0 100 20010

    −0.7

    100.2

    0 100 20010

    −0.8

    100.2

    0 100 20010

    −0.7

    100.2

    0 100 20010

    −0.6

    100.2

    iterations

    erro

    r

    0 100 20010

    −0.7

    100.2

    0 100 20010

    −0.6

    100.2

    Figure 4.5.: The built in cross validation error � � � (depicted in blue) compared to theerror on an independent test set �

    �������(depicted in red). The left column

    shows the error progressions averaged over 50 runs, the middle and rightcolumn show out of the 50 runs only those with the largest deviations be-tween the two error progressions according to the

    (�

    norm and to the

    (���

    norm, respectively. In the first and third row the progressions for the one-dimensional ’noisy S-manifold’ with � � � � � and � � � ��� , respectively,are depicted. The second and forth row show the respective progressionsresulting from a faster annealing schedule. The two bottom rows show theerror progressions for the two-dimensional ’S-manifold’, again for � ���!� �and � � � ��� , respectively.

    38

  • 4. Unsupervised Kernel Regression

    1 2 4 6 8 1010

    −3

    10−2

    10−1

    100

    s

    Etes

    t

    1 2 4 6 8 1010

    −2

    10−1

    100

    s1 2 4 6 8 10

    10−2

    10−1

    100

    s1 2 4 6 8 10

    10−2

    10−1

    100

    s

    Figure 4.6.: The effect that a rescaling of the latent data matrix obtained from using thebuilt in leave-one-out cross validation variant of oUKR has on the error thatprojection of an independent test set gives rise. The plots in the successionfrom left to right correspond to the settings described in the caption of figure4.5 from top to bottom for the four top plots.

    Built In Cross Validation vs. Homotopy

    As stated, the built in cross validation error criterion is a promising computational short-

    cut for solving the problem of model selection. In order to provide empirical evidence

    that this criterion is of real practical value, an investigation of the distribution of the

    resulting error ( � ��� ) as compared to the average error on an independent test set (con-taining elements � ��������

    �*� ��������

    � ��� � � ),

    ������ � � �

    � ������� �)!$# � � �������� � ��.5%/.0� ��� � �� 2 2 ����

    has been undertaken. To this end, the UKR model has been trained on the one-dimen-

    sional and on the two-dimensional ’S-manifold’ datasets, comprising 50 elements each,

    in one setting without noise and in another with spherical Gaussian noise ( � � �!� � ).The homotopy scheme has been applied using the penalized objective function (4.15)

    with regularization parameter�

    starting with ��� � in all cases and being annealed withfactor �!��� in one setting and for the one-dimensional case with factor � �,� in another.

    39

  • 4. Unsupervised Kernel Regression

    The error on a test set, consisting of � � � datapoints in each setting, has been trackedthroughout and its average over 50 runs for each setting is depicted together with the

    cross validation error in 4.5 (first column). Overall, a strong correlation between these

    quantities can be noticed in all settings. They even converge as the number of training

    steps increases. In addition, the largest deviations along the � � runs between the twoerror progressions according two the

    (�

    norm, as well as to the

    (��� norm, are depicted

    in columns 2 and 3, respectively. They show significant differences only on the outset

    of the learning process, revealing the actual reliability of the cross validation criterion.

    Another finding that becomes obvious from the plots is that a rise of the test error

    hardly ever occurs. Even for the settings in which the rather ’radical’ annealing factor

    � �,� has been applied, the test error monotonously decreases. This gives rise to the as-sumption that a change in

    �rather leads to a slight deformation of the error surface than

    to a complete restructuring of it, so that the homotopy method causes a local optimizer

    to track a once reached local minimum during the overall optimization process.

    The built in cross validation error, on the other hand, as can be seen in the plots

    for the two-dimensional datasets, does happen to rise. The plots in the last row show

    that this can even be the case when the test error is obviously still falling. From this

    observation one may draw the conclusion that an application of the built in cross valida-

    tion criterion will rather give rise to underfitting than to overfitting. In order to collect

    more evidence for this conclusion, a further experiment has been conducted. The final

    solutions for the one-dimensional datasets have been used as an initialization for an it-

    erative minimization of � ��� . Then the effect that a rescaling of the resulting latent datamatrix with some rescaling factor � has on the test error has been measured. If training

    by minimization of � � � leads to an over-regularization, rescaling with a factor ��� �greater than one should result in a decreasing test error. Figure 4.6 shows that this is

    the case, indeed. While downsizing the latent variable scale leads to an increasing test

    error, enlarging leads to a decreasing error. The optimal rescaling factor with respect to

    the test error is approximately in all settings. This indicates that built in cross valida-tion indeed tends to a slight over-regularization. Since the reason to deploy this error

    criterion is to circumvent overfitting and since the increase of the test error that this

    over-regularization gives rise to is small as can be seen in the plots it can be concluded

    40

  • 4. Unsupervised Kernel Regression

    that the built in leave-one-out criterion is indeed a practical alternative to other, external,

    model selection strategies.

    Modeling Top-down influences: ’Implicit Clustering’

    In 4.2.1 the effects that different optimization constraints and regularization terms have

    on the resulting latent space realization are mentioned, suggesting to use these in order

    to include some kind of top-down knowledge into this unsupervised method. In fact, the

    introduction of top-down influences into methods from UL is an important and emerging

    topic. In particular research in cognitive science, especially cognitive modeling, is con-

    cerned with questions regarding the integration of bottom-up and top-down processing.

    The importance of this is increasingly agreed upon and the occurrence of this in human

    brain functioning is common grounds (see [GSSSK00] or [SKK], for example). In the

    context of Unsupervised Learning, constraining of the learning process by introduction

    of some kind of knowledge provides a step towards this direction.

    Although an evaluation of the potential of the UKR method in this context goes be-

    yond the scope of this thesis, an example of the facility to include top-down knowledge

    through the use of appropriately tailored optimization constraints shall be given here:

    Figure 4.7 shows a dataset consisting of two ’half circles.’ Fitting a model to such a

    disrupted manifold is not trivially possible for the known algorithms for dimensionality

    reduction. [RS00] e.g. state that it is an open question, how to deal with non-uniformly

    sampled data. However, including knowledge of the partitioning structure through the

    optimization constraints into the learning procedure, might simplify the task. This has

    been tried by using constrained optimization (see 4.18), with nonlinear optimization

    constraint

    �9. � 2 � � � � � �

    ���

    (4.26)

    forcing the latent variable values to assemble in two groups. Parameter � has beeninitialized to �!�+� and has been increased by factor two after every update of the latentvariable realizations. The second to right plot shows the effect of generalizing % tonew observable space vectors from the same distribution and the rightmost plot the

    41

  • 4. Unsupervised Kernel Regression

    Figure 4.7.: ’Implicit Clustering’ (From left to right): Partitioned dataset; latent spacerealization; projection of a test data set onto the resulting manifold; visual-ization of the regression manifold obtained from sampling latent space.

    effect of generalizing � to new latent space elements, showing in particular the effect ofsampling along the ’gap,’ leading to an approximately linear connection of two adjacent

    ends of the two ’half-circles.’ In forcing the latent vectors to assemble in two groups the

    procedure may be interpreted as an ’implicit’ latent space clustering.

    4.3. Latent Space Error Minimization

    By turning the roles of the latent and the observable variables and regarding the first

    as outputs and the second as regressors the learning problem can also be defined as

    finding a suitable latent variable realization for every given data vector by deploying the

    Nadaraya Watson Estimator in this case in order to directly compute the regression of �

    onto � [Mei03]:

    � � %/.0� 2 � �� �� !$# � .5�

    � � 2 � �� � !$# � .0�

    � 2 � (4.27)

    Measuring the error in latent data space in this case leads to the objective function

    ���� � . � 2 � ��

    � �)!$# � � � �

    � �� !$# � .5�9�

    � � 2 � �� � !$# � .5�9� � 2 ��� (4.28)42

  • 4. Unsupervised Kernel Regression

    � ��� � ��� ���� (4.29)

    with

    ��� � � � � .0�9�

    � � 2� � !$# � .0�1�

    � 2 � � � � � (4.30)which defines a quadratic form in the latent variables and can therefore be solved by a

    spectral decomposition, as detailed below. The resulting UKR variant will be denoted

    lUKR in the following.

    Note, that while the UKR variant described in the previous section is based upon the

    simultaneous – and in fact associated – determination of latent variable realizations and

    a model for � , here determination of the latent data matrix is connected to, and there-fore leading to, an implicit estimation for the backward mapping % , instead. This hasgot important implications regarding generalization: Since no model for � is learned,generalization of the regression function to new latent space elements drops out. As

    a consequence, generalization of the backward mapping to new observable space ele-

    ments by projection as in 4.11 is not possible, either. The implicitly learned model for

    the backward mapping therefore remains the only alternative. However, as indicated in

    2.4, this variant can be problemati