Journal of Machine Learning Research VV (YYYY) PP-PP Submitted 4/09; Revised 12/09; Published MM/YY Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization Jarkko Venna JARKKO. VENNA@TKK. FI Jaakko Peltonen JAAKKO. PELTONEN@TKK. FI Kristian Nybo KRISTIAN. NYBO@TKK. FI Helena Aidos HELENA. AIDOS@TKK. FI Samuel Kaski SAMUEL. KASKI @TKK. FI Aalto University School of Science and Technology Department of Information and Computer Science P.O. Box 15400, FI-00076 Aalto, Finland Editor: Yoshua Bengio Abstract Nonlinear dimensionality reduction methods are often used to visualize high-dimensional data, al- though the existing methods have been designed for other related tasks such as manifold learning. It has been difficult to assess the quality of visualizations since the task has not been well-defined. We give a rigorous definition for a specific visualization task, resulting in quantifiable goodness measures and new visualization methods. The task is information retrieval given the visualization: to find similar data based on the similarities shown on the display. The fundamental tradeoff be- tween precision and recall of information retrieval can then be quantified in visualizations as well. The user needs to give the relative cost of missing similar points vs. retrieving dissimilar points, after which the total cost can be measured. We then introduce a new method NeRV (neighbor retrieval visualizer) which produces an optimal visualization by minimizing the cost. We further derive a variant for supervised visualization; class information is taken rigorously into account when computing the similarity relationships. We show empirically that the unsupervised version outperforms existing unsupervised dimensionality reduction methods in the visualization task, and the supervised version outperforms existing supervised methods. Keywords: information retrieval, manifold learning, multidimensional scaling, nonlinear dimen- sionality reduction, visualization 1. Introduction Visualization of high-dimensional data sets is one of the traditional applications of nonlinear di- mensionality reduction methods. In high-dimensional data, such as experimental data where each dimension corresponds to a different measured variable, dependencies between different dimensions often restrict the data points to a manifold whose dimensionality is much lower than the dimension- ality of the data space. Many methods are designed for manifold learning, that is, to find and unfold the lower-dimensional manifold. There has been a research boom in manifold learning since 2000, and there now exist many methods that are known to unfold at least certain kinds of manifolds suc- cessfully. Some of the successful methods include isomap (Tenenbaum et al., 2000), locally linear embedding (LLE; Roweis and Saul, 2000), Laplacian eigenmap (LE; Belkin and Niyogi, 2002a), and maximum variance unfolding (MVU; Weinberger and Saul, 2006). c YYYY Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos and Samuel Kaski.
40
Embed
Information Retrieval Perspective to Nonlinear ...research.cs.aalto.fi/pml/papers/jmlr10_preprint.pdf · The cost function of similarity visualization (1) bears a close relationship
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Journal of Machine Learning Research VV (YYYY) PP-PP Submitted 4/09; Revised 12/09; Published MM/YY
Information Retrieval Perspective to Nonlinear Dimensionality
It has turned out that the manifold learning methods are not necessarily good for information
visualization. Several methods had severe difficulties when the output dimensionality was fixed
to two for visualization purposes (Venna and Kaski, 2007a). This is natural since they have been
designed to find a manifold, not to compress it into a lower dimensionality.
In this paper we discuss the specific visualization task of projecting the data to points on a two-
dimensional display. Note that this task is different from manifold learning, in case the inherent
dimensionality of the manifold is higher than two and the manifold cannot be represented perfectly
in two dimensions. As the representation is necessarily imperfect, defining and using a measure
of goodness of the representation is crucial. However, in spite of the large amount of research
into methods for extracting manifolds, there has been very little discussion on what a good two-
dimensional representation should be like and how the goodness should be measured. In a recent
survey of 69 papers on dimensionality reduction from years 2000–2006 (Venna, 2007) it was found
that 28 (≈ 40%) of the papers only presented visualizations of toy or real data sets as a proof of
quality. Most of the more quantitative approaches were based on one of two strategies. The first is
to measure preservation of all pairwise distances or the order of all pairwise distances. Examples of
this approach include the multidimensional scaling (MDS)-type cost functions like Sammon’s cost
and Stress, methods that relate the distances in the input space to the output space, and various cor-
relation measures that assess the preservation of all pairwise distances. The other common quality
assurance strategy is to classify the data in the low-dimensional space and report the classification
performance.
The problem with using the above approaches to measure visualization performance is that their
connection to visualization is unclear and indirect at best. Unless the purpose of the visualization
is to help with a classification task, it is not obvious what the classification accuracy of a projection
reveals about its goodness as a visualization. Preservation of pairwise distances, the other widely
adopted principle, is a well-defined goal; it is a reasonable goal if the analyst wishes to use the
visualization to assess distances between selected pairs of data points, but we argue that this is not
the typical way how an analyst would use a visualization, at least in the early stages of analysis when
no hypothesis about the data has yet been formed. Most approaches including ours are based on
pairwise distances at heart, but we take into account the context of each pairwise distance, yielding
a more natural way of evaluating visualization performance; the resulting method has a natural and
rigorous interpretation which we discuss below and in the following sections.
In this paper we make rigorous the specific information visualization task of projecting a high-
dimensional data set onto a two-dimensional plane for visualizing similarity relationships. This task
has a very natural mapping into an information retrieval task as will be discussed in Section 2. The
conceptualization as information retrieval explicitly reveals the necessary tradeoff between preci-
sion and recall, of making true similarities visible and avoiding false similarities. The tradeoff can
be quantified exactly once costs have been assigned to each of the two error types, and once the total
cost has been defined, it can be optimized as will be discussed in Section 3. We then show that the
resulting method, called NeRV for neighbor retrieval visualizer, can be further extended to super-
vised visualization, and that both the unsupervised and supervised methods empirically outperform
their alternatives. NeRV includes the previous method called stochastic neighbor embedding (SNE;
Hinton and Roweis, 2002) as a special case where the tradeoff is set so that only recall is maximized;
thus we give a new information retrieval interpretation to SNE.
2
DIMENSIONALITY REDUCTION FOR VISUALIZATION
This paper extends our earlier conference paper (Venna and Kaski, 2007b) which introduced
the ideas in a preliminary form with preliminary experiments. The current paper gives the full
justification and comprehensive experiments, and also introduces the supervised version of NeRV.
2. Visualization as Information Retrieval
In this section we define formally the specific visualization task; this is a novel formalization of
visualization as an information retrieval task. We first give the definition for a simplified setup in
Section 2.1, and then generalize it in Section 2.2.
2.1 Similarity Visualization with Binary Neighborhood Relationships
In the following we first define the specific visualization task and a cost function for it; we then
show that the cost function is related to the traditional information retrieval measures precision and
recall.
2.1.1 TASK DEFINITION: SIMILARITY VISUALIZATION
Let {xi}Ni=1 be a set of input data samples, and let each sample i have an input neighborhood Pi,
consisting of samples that are close to i. Typically, Pi might consist of all input samples (other than
i itself) that fall within some radius of i, or alternatively Pi might consist of a fixed number of input
samples most similar to i. In either case, let ri be the size of the set Pi.
The goal of similarity visualization is to produce low-dimensional output coordinates {yi}Ni=1
for the input data, usable in visual information retrieval. Given any sample i as a query, in visual
information retrieval samples are retrieved based on the visualization; the retrieved result is a set
Qi of samples that are close to yi in the visualization; we call Qi the output neighborhood. The Qi
typically consists of all input samples j (other than i itself) whose visualization coordinates y j are
within some radius of yi in the visualization, or alternatively Qi might consist of a fixed number
of input samples whose output coordinates are nearest to yi. In either case, let ki be the number
of points in the set Qi. The number of points in Qi may be different from the number of points in
Pi; for example, if many points have been placed close to yi in the visualization, then retrieving all
points within a certain radius of yi might yield too many retrieved points, compared to how many
are neighbors in the input space. Figure 1 illustrates the setup.
The remaining question is what is a good visualization, that is, what is the cost function. Denote
the number of samples that are in both Qi and Pi by NTP,i (true positives), samples that are in Qi but
not in Pi by NFP,i (false positives), and samples that are in Pi but not Qi by NMISS,i (misses). Assume
the user has assigned a cost CFP for each false positive and CMISS for each miss. The total cost Ei
for query i, summed over all data points, then is
Ei = NFP,iCFP +NMISS,iCMISS . (1)
2.1.2 RELATIONSHIP TO PRECISION AND RECALL
The cost function of similarity visualization (1) bears a close relationship to the traditional measures
of information retrieval, precision and recall. If we allow CMISS to be a function of the total number
of relevant points r, more specifically CMISS(ri) =C′MISS/ri, and take the cost per retrieved point by
3
VENNA, PELTONEN, NYBO, AIDOS AND KASKI
falsepositives
Input space
miss
Output space (visualization)
*
**
i
i
Q
Px
y
i
i
*
**
Figure 1: Diagram of the types of errors in visualization.
dividing by ki, the total cost becomes
E(ki,ri) =1
kiE(ri) =
1
ki(NFP,iCFP +NMISS,iCMISS(ri))
= CFP
NFP,i
ki+C′MISS
ki
NMISS,i
ri
= CFP(1−precision(i))+C′MISS
ki(1− recall(i)) .
The traditional definition of precision for a single query is
precision(i) =NTP,i
ki= 1−
NFP,i
ki,
and recall is
recall(i) =NTP,i
ri= 1−
NMISS,i
ri.
Hence, fixing the costs CFP and CMISS and minimizing (1) corresponds to maximizing a specific
weighted combination of precision and recall.
Finally, to assess performance of the full visualization the cost needs to be averaged over all
samples (queries) which yields mean precision and recall of the visualization.
2.1.3 DISCUSSION
Given a high-dimensional data set, it is generally not possible to show all the similarity relation-
ships within the data on a low-dimensional display; therefore, all linear or nonlinear dimensionality
reduction methods need to make a tradeoff about which kinds of similarity relationships they aim
to show on the display. Equation (1) fixes the tradeoff given the costs of the two kinds of errors.
Figure 2 illustrates this tradeoff (computed with methods introduced in Section 3) with a toy ex-
ample where a three-dimensional sphere surface is visualized in two dimensions. If we take some
4
DIMENSIONALITY REDUCTION FOR VISUALIZATION
query point in the visualization and retrieve a set of points close-by in the visualization, in display
A such retrieval yields few false positives but many misses, whereas in display B the retrieval yields
few misses but many false positives. The tradeoff can also be seen in the (mean) precision-recall
curves for the two visualizations, where the number of retrieved points is varied to yield the curve.
Visualization A reaches higher values of precision, but the precision drops much before high recall
is reached. Visualization B has lower precision at the left end of the curve, but precision does not
drop as much even when high recall is reached.
Note that in order to quantify the tradeoff, both precision and recall need to be used. This
requires a rich enough retrieval model, in the sense that the number of retrieved points can be
different from the number of relevant points, so that precision and recall get different values. It is
well-known in information retrieval that if the numbers of relevant and retrieved items (here points)
are equal, precision and recall become equal. The recent “local continuity” criterion (Equation
9 in Chen and Buja, 2009) is simply precision/recall under this constraint; we thus give a novel
information retrieval interpretation of it as a side result. Such a criterion is useful but it gives
only a limited view of the quality of visualizations, because it corresponds to a limited retrieval
model and cannot fully quantify the precision-recall tradeoff. In this paper we will use fixed-radius
neighborhoods (defined more precisely in Section 2.2) in the visualizations, which naturally yields
differing numbers of retrieved and relevant points.
The simple visualization setup presented in this section is a novel formulation of visualization
and useful as a clearly defined starting point. However, for practical use it has a shortcoming: the
overly simple binary fixed-size neighborhoods do not take into account grades of relevance. The
cost function does not penalize violating the original similarity ordering of neighbor samples; and
the cost function penalizes all neighborhood violations with the same cost. Next we will introduce
a more practical visualization setup.
2.2 Similarity Visualization with Continuous Neighborhood Relationships
We generalize the simple binary neighborhood case by defining probabilistic neighborhoods both in
the (i) input and (ii) output spaces, and (iii) replacing the binary precision and recall measures with
probabilistic ones. It will finally be shown that for binary neighborhoods, interpreted as a constant
high probability of being a neighbor within the neighborhood set and a constant low probability
elsewhere, the measures reduce to the standard precision and recall.
2.2.1 PROBABILISTIC MODEL OF RETRIEVAL
We start by defining the neighborhood in the output space, and do that by defining a probability
distribution over the neighbor points. Such a distribution is interpretable as a model about how the
user does the retrieval given the visualization display.
Given the location of the query point on the display, yi, suppose that the user selects one point
at a time for inspection. Denote by q j|i the probability that the user chooses y j. If we can define
such probabilities, they will define a probabilistic model of retrieval for the neighbors of yi.
The form of q j|i can be defined by a few axiomatic choices and a few arbitrary ones. Since the
q j|i are a probability distribution over j for each i, they must be nonnegative and sum to one over
j; therefore we can represent them as q j|i = exp(− fi, j)/∑k 6=i exp(− fi,k) where fi, j ∈ R. The fi, jshould be an increasing function of distance (dissimilarity) between yi and y j; we further assume
that fi, j depends only on yi and y j and not on the other points yk. It remains to choose the form of
5
VENNA, PELTONEN, NYBO, AIDOS AND KASKI
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
A
B
mean recall
mea
n p
reci
sion
A B
Figure 2: Demonstration of the tradeoff between false positives and misses. Top left: A three-
dimensional data set sampled from the surface of a sphere; only the front hemisphere
is shown for clarity. The glyph shapes (size, elongation, and angle) show the three-
dimensional coordinates of each point; the colors in the online version show the same
information. Bottom: Two embeddings of the data set. In the embedding A, the sphere
has been cut open and folded out. This embedding eliminates false positives, but there
are some misses because points on different sides of the tear end up far away from each
other. In contrast, the embedding B minimizes the number of misses by simply squashing
the sphere flat; this results in a large number of false positives because points on opposite
sides of the sphere are mapped close to each other. Top right: mean precision-mean recall
curves with input neighborhood size r= 75, as a function of the output neighborhood size
k, for the two projections. The embedding A has better precision (yielding higher values
at the left end of the curve) whereas the embedding B has better recall (yielding higher
values at the right end of the curve).
6
DIMENSIONALITY REDUCTION FOR VISUALIZATION
fi, j. In general there should not be any reason to favor any particular neighbor point, and hence the
form should not depend on j. It could depend on i, however; we assume it has a simple quadratic
form fi, j = ||yi−y j||2/σ2
i where ||yi−y j|| is the Euclidean distance and the positive multiplier 1/σ2i
allows the function to grow at an individual rate for each i. This yields the definition
q j|i =exp(−
‖yi−y j‖2
σ2i
)
∑k 6=i exp(− ‖yi−yk‖2
σ2i
). (2)
2.2.2 PROBABILISTIC MODEL OF RELEVANCE
We extend the simple binary neighborhoods of input data samples to probabilistic neighborhoods
as follows. Suppose that if the user was choosing the neighbors of a query point i in the original
data space, she would choose point j with probability p j|i. The p j|i define a probabilistic model of
relevance for the original data, and are equivalent to a neighborhood around i: the higher the chance
of choosing this neighbor, the larger its relevance to i.
We define the probability p j|i analogously to q j|i, as
p j|i =exp(−
d(xi,x j)2
σ2i
)
∑k 6=i exp(− d(xi,xk)2
σ2i
), (3)
where d(·, ·) is a suitable difference measure in the original data, and xi refers to the point in the
original data that is represented by yi in the visualization. Some data sets may provide the values of
d(·, ·) directly; otherwise the analyst can choose a difference measure suitable for the data feature
vectors. Later in this paper we will use both the simple Euclidean distance and a more complicated
distance measure that incorporates additional information about the data.
Given known values of d(·, ·), the above definition of the neighborhood p j|i can be motivated by
the same arguments as q j|i. That is, the given form of p j|i is a good choice if no other information
about the original neighborhoods is available. Other choices are possible too; in particular, if the
data directly includes neighbor probabilities, they can simply be used as the p j|i. Likewise, if more
accurate models of user behavior are available, they can be plugged in place of q j|i. The forms of
p j|i and q j|i need not be the same.
For each point i, the scaling parameter σi controls how quickly the probabilities p j|i fall off
with distance. These parameters could be fixed by prior knowledge, but without such knowledge it
is reasonable to set the σi by specifying how much flexibility there should be about the choice of
neighbors. That is, we set σi to a value that makes the entropy of the p·|i distribution equal to logk,
where k is a rough upper limit for the number of relevant neighbors, set by the user. We use the
same relative scale σi both in the input and output spaces (Equations 2 and 3).
2.2.3 COST FUNCTIONS
The remaining task is to measure how well the retrieval done in the output space, given the visual-
ization, matches the true relevances defined in the input space. Both were above defined in terms of
distributions, and a natural candidate for the measure is the Kullback-Leibler divergence, defined as
D(pi,qi) = ∑j 6=i
p j|i logp j|i
q j|i
7
VENNA, PELTONEN, NYBO, AIDOS AND KASKI
where pi and qi are the neighbor distributions for a particular point i, in the input space and in the
visualization respectively. For the particular probability distributions defined above the Kullback-
Leibler divergence turns out to be intimately related to precision and recall. Specifically, for any
query i, the Kullback-Leibler divergence D(pi,qi) is a generalization of recall, and D(qi, pi) is
a generalization of precision; for simple “binary” neighborhood definitions, the Kullback-Leibler
divergences and the precision-recall measures become equivalent. The proof is in Appendix A.
We call D(qi, pi) smoothed precision and D(pi,qi) smoothed recall. To evaluate a complete
visualization rather than a single query, we define aggregate measures in the standard fashion: mean
smoothed precision is defined as Ei[D(qi, pi)] and mean smoothed recall as Ei[D(pi,qi)], where E
denotes expectation and the means are taken over queries (data points i).
Mean smoothed precision and recall are analogous to mean precision and recall in that we
cannot in general reach the optimum of both simultaneously. We return to Figure 2 which illustrates
the tradeoff for nonlinear projections of a three-dimensional sphere surface. The subfigure A was
created by maximizing mean smoothed precision; the sphere has been cut open and folded out,
which minimizes the number of false positives but also incurs some misses because some points
located on opposite edges of the point cloud were originally close to each other on the sphere. The
subfigure B was created by maximizing mean smoothed recall; the sphere is squashed flat, which
minimizes the number of misses, as all the points that were close to each other in the original data
are close to each other in the visualization. However, there are then a large number of false positives
because opposite sides of the sphere have been mapped on top of each other, so that many points
that appear close to each other in the visualization are actually originally far away from each other.
2.2.4 EASIER-TO-INTERPRET ALTERNATIVE GOODNESS MEASURES
Mean smoothed precision and recall are rigorous and well-motivated measures of visualization per-
formance, but they have one practical shortcoming for human analysts: the errors have no upper
bound, and the scale will tend to depend on the data set. The measures are very useful for compar-
ing several visualizations of the same data, and will turn out to be useful as optimization criteria,
but we would additionally like to have measures where the plain numbers are easily interpretable.
We address this by introducing mean rank-based smoothed precision and recall: simply replace the
distances in the definitions of p j|i and q j|i with ranks, so that the probability for the nearest neighbor
uses a distance of 1, the probability for the second nearest neighbor a distance of 2, and so on. This
imposes an upper bound on the error because the worst case scenario is that the ranks in the data set
are reversed in the visualization. Dividing the errors by their upper bounds gives us measures that lie
in the interval [0,1] regardless of the data and are thus much easier to interpret. The downside is that
substituting ranks for distances makes the measures disregard much of the neighborhood structure in
the data, so we suggest using mean rank-based smoothed precision and recall as easier-to-interpret,
but less discriminating complements to, rather than replacements of, mean smoothed precision and
recall.
3. Neighborhood Retrieval Visualizer (NeRV)
In Section 2 we defined similarity visualization as an information retrieval task. The quality of
a visualization can be measured by the two loss functions, mean smoothed precision and recall.
These measures generalize the straightforward precision and recall measures to non-binary neigh-
borhoods. They have the further advantage of being continuous and differentiable functions of the
8
DIMENSIONALITY REDUCTION FOR VISUALIZATION
output visualization coordinates. It is then easy to use the measures as optimization criteria for a
visualization method. We now introduce a visualization algorithm that optimizes visual information
retrieval performance. We call the algorithm the neighborhood retrieval visualizer (NeRV).
As demonstrated in Figure 2, precision and recall cannot in general be minimized simultane-
ously, and the user has to choose which loss function (average smoothed precision or recall) is more
important, by assigning a cost for misses and a cost for false positives. Once these costs have been
assigned, the visualization task is simply to minimize the total cost. In practice the relative cost of
false positives to misses is given as a parameter λ. The NeRV cost function then becomes
ENeRV = λEi[D(pi,qi)]+ (1−λ)Ei[D(qi, pi)]
∝ λ∑i
∑j 6=i
p j|i logp j|i
q j|i+(1−λ)∑
i∑j 6=i
q j|i logq j|i
p j|i(4)
where, for example, setting λ to 0.1 indicates that the user considers an error in precision
(1−0.1)/0.1 = 9 times as expensive as a similar error in recall.
To optimize the cost function (4) with respect to the output coordinates yi of each data point,
we use a standard conjugate gradient algorithm. The computational complexity of each iteration
is O(dn2), where n is the number of data points and d the dimension of the projection. (In our
earlier conference paper a coarse approximate algorithm was required for speed; this turned out
to be unnecessary, and the O(dn2) complexity does not require any approximation.) Note that if
a pairwise distance matrix in the input space is not directly provided as data, it can as usual be
computed from input features; this is a one-time computation done at the start of the algorithm and
takes O(Dn2) time, where D is the input dimensionality.
In general, NeRV optimizes a user-defined cost which forms a tradeoff between mean smoothed
precision and mean smoothed recall. If we set λ = 1 in Equation (4), we obtain the cost function of
stochastic neighbor embedding (SNE; see Hinton and Roweis, 2002). Hence we get as a side result
a new interpretation of SNE as a method that maximizes mean smoothed recall.
3.0.5 PRACTICAL ADVICE ON OPTIMIZATION
After computing the distance matrix from the input data, we scale the input distances so that the
average distance is equal to 1. We use a random projection onto the unit square as a starting point
for the algorithm. Even this simple choice has turned out to give better results than alternatives; a
more intelligent initialization, such as projecting the data using principal component analysis, can
of course also be used.
To speed up convergence and avoid local minima, we apply a further initialization step: we run
ten rounds of conjugate gradient (two conjugate gradient steps per round), and after each round
decrease the neighborhood scaling parameters σi used in Equations (2) and (3). Initially, we set the
σi to half the diameter of the input data. We decrease them linearly so that the final value makes
the entropy of the p j|i distribution equal to an effective number of neighbors k, which is the choice
recommended in Section 2.2. This initialization step has the same complexity O(dn2) per iteration
as the rest of the algorithm. After this initialization phase we perform twenty standard conjugate
gradient steps.
9
VENNA, PELTONEN, NYBO, AIDOS AND KASKI
4. Using NeRV for Unsupervised Visualization
It is easy to apply NeRV for unsupervised dimensionality reduction. As in any unsupervised anal-
ysis, the analyst first chooses a suitable unsupervised similarity or distance measure for the input
data; for vector-valued input data this can be the standard Euclidean distance (which we will use
here), or it can be some other measure suggested by domain knowledge. Once the analyst has spec-
ified the relative importance of precision and recall by choosing a value for λ, the NeRV algorithm
computes the embedding based on the distances it is given.
In this section we will make extensive experiments comparing the performance of NeRV with
other dimensionality reduction methods on unsupervised visualization of several data sets, including
both benchmark data sets and real-life bioinformatics data sets. In the following subsections, we
describe the comparison methods and data sets, briefly discuss the experimental methodology, and
present the results.
4.1 Comparison Methods for Unsupervised Visualization
For the task of unsupervised visualization we compare the performance of NeRV with the follow-
ing unsupervised nonlinear dimensionality reduction methods: principal component analysis (PCA;
Hotelling, 1933), metric multidimensional scaling (MDS; see Borg and Groenen, 1997), locally lin-
ear embedding (LLE; Roweis and Saul, 2000), Laplacian eigenmap (LE; Belkin and Niyogi, 2002a),
Hessian-based locally linear embedding (HLLE; Donoho and Grimes, 2003), isomap (Tenenbaum
et al., 2000), curvilinear component analysis (CCA; Demartines and Herault, 1997), curvilinear dis-
tance analysis (CDA; Lee et al., 2004), maximum variance unfolding (MVU; Weinberger and Saul,
2006), landmark maximum variance unfolding (LMVU; Weinberger et al., 2005), and our previous
method local MDS (LMDS; Venna and Kaski, 2006).
Principal component analysis (PCA; Hotelling, 1933) finds linear projections that maximally
preserve the variance in the data. More technically, the projection directions can be found by solving
for the eigenvalues and eigenvectors of the covariance matrix Cx of the input data points. The
eigenvectors corresponding to the two or three largest eigenvalues are collected into a matrix A, and
the data points xi can then be visualized by projecting them with yi = Axi, where yi is the obtained
low-dimensional representation of xi. PCA is very closely related to linear multidimensional scaling
(linear MDS, also called classical scaling; Torgerson, 1952; Gower, 1966), which tries to find low-
dimensional coordinates preserving squared distances. It can be shown (Gower, 1966) that when
the dimensionality of the sought solutions is the same and the distance measure is Euclidean, the
projection of the original data to the PCA subspace equals the configuration of points found by
linear MDS. This implies that PCA tries to preserve the squared distances between data points, and
that linear MDS finds a solution that is a linear projection of the original data.
Traditional multidimensional scaling (MDS; see Borg and Groenen, 1997) exists in several dif-
ferent variants, but they all have a common goal: to find a configuration of output coordinates that
preserves the pairwise distance matrix of the input data. For the comparison experiments we chose
metric MDS which is the simplest nonlinear MDS method; its cost function (Kruskal, 1964), called
the raw stress, is
E = ∑i, j
(d(xi,x j)−d(yi,y j))2, (5)
10
DIMENSIONALITY REDUCTION FOR VISUALIZATION
where d(xi,x j) is the distance of points xi and x j in the input space and d(yi,y j) is the distance of
their corresponding representations (locations) yi and y j in the output space. This cost function is
minimized with respect to the representations yi.
Isomap (Tenenbaum et al., 2000) is an interesting variant of MDS, which again finds a config-
uration of output coordinates matching a given distance matrix. The difference is that Isomap does
not compute pairwise input-space distances as simple Euclidean distances but as geodesic distances
along the manifold of the data (technically, along a graph formed by connecting all k-nearest neigh-
bors). Given these geodesic distances the output coordinates are found by standard linear MDS.
When output coordinates are found for such input distances, the manifold structure in the original
data becomes unfolded; it has been shown (Bernstein et al., 2000) that this algorithm is asymptot-
ically able to recover certain types of manifolds. We used the isomap implementation available at
http://isomap.stanford.edu in the experiments.
Curvilinear component analysis (CCA; Demartines and Herault, 1997) is a variant of MDS that
tries to preserve only distances between points that are near each other in the visualization. This is
achieved by weighting each term in the MDS cost function (5) by a coefficient that depends on the
corresponding pairwise distance in the visualization. In the implementation we use, the coefficient
is simply a step function that equals 1 if the distance is below a predetermined threshold and 0 if it
is larger.
Curvilinear distance analysis (CDA; Lee et al., 2000, 2004) is an extension of CCA. The idea is
to replace the Euclidean distances in the original space with geodesic distances in the same manner
as in the isomap algorithm. Otherwise the algorithm stays the same.
Local MDS (LMDS; Venna and Kaski, 2006) is our earlier method, an extension of CCA that
focuses on local proximities with a tunable cost function tradeoff. It can be seen as a first step in the
development of the ideas of NeRV.
The locally linear embedding (LLE; Roweis and Saul, 2000) algorithm is based on the assump-
tion that the data manifold is smooth enough and is sampled densely enough, such that each data
point lies close to a locally linear subspace on the manifold. LLE makes a locally linear approx-
imation of the whole data manifold: LLE first estimates a local coordinate system for each data
point, by calculating linear coefficients that reconstruct the data point as well as possible from its k
nearest neighbors. To unfold the manifold, LLE finds low-dimensional coordinates that preserve the
previously estimated local coordinate systems as well as possible. Technically, LLE first minimizes
the reconstruction error E(W) = ∑i‖xi−∑ jWi, jx j‖2 with respect to the coefficients Wi, j, under the
constraints that Wi, j = 0 if i and j are not neighbors, and ∑ jWi, j = 1. Given the weights, the low-
dimensional configuration of points is next found by minimizing E(Y) = ∑i ‖yi−∑ jWi, jy j‖2 with
respect to the low-dimensional representation yi of each data point.
The Laplacian eigenmap (LE; see Belkin and Niyogi, 2002a) uses a graph embedding approach.
An undirected k-nearest-neighbor graph is formed, where each data point is a vertex. Points i and j
are connected by an edge with weight Wi, j = 1 if j is among the k nearest neighbors of i, otherwise
the edge weight is set to zero; this simple weighting method has been found to work well in practice
(Belkin and Niyogi, 2002b). To find a low-dimensional embedding of the graph, the algorithm tries
to put points that are connected in the graph as close to each other as possible and does not care what
happens to the other points. Technically, it minimizes 12 ∑i, j ‖yi− y j‖
2Wi, j = yTLy with respect to
the low-dimensional point locations yi, where L = D−W is the graph Laplacian and D is a diagonal
matrix with elements Dii = ∑ jWi, j . However, this cost function has an undesirable trivial solution:
putting all points in the same position would minimize the cost. This can be avoided by adding suit-
11
VENNA, PELTONEN, NYBO, AIDOS AND KASKI
able constraints. In practice the low-dimensional configuration is found by solving the generalized
eigenvalue problem Ly = λDy (Belkin and Niyogi, 2002a). The smallest eigenvalue corresponds
to the trivial solution, but the eigenvectors corresponding to the next smallest eigenvalues give the
Laplacian eigenmap solution.
The Laplacian eigenmap algorithm reduces to solving a generalized eigenvalue problem because
the cost function that is minimized is a quadratic form involving the Laplacian matrix L. The
Hessian-based locally linear embedding (HLLE; Donoho and Grimes, 2003) algorithm is similar,
but the Laplacian L is replaced by the Hessian H.
The maximum variance unfolding algorithm (MVU; Weinberger and Saul, 2006) expresses di-
mensionality reduction as a semidefinite programming problem. One way of unfolding a folded flag
is to pull its four corners apart, but not so hard as to tear the flag. MVU applies this idea to projecting
a manifold: the projection maximizes variance (pulling apart) while preserving distances between
neighbors (no tears). The constraint of local distance preservation can be expressed in terms of the
Gram matrix K of the mapping. Maximizing the variance of the mapping is equivalent to maxi-
mizing the trace of K under a set of constraints, which, it turns out, can be done using semidefinite
programming.
A notable disadvantage of MVU is the time required to solve a semidefinite program for n× n
matrices when the number of data points n is large. Landmark MVU (LMVU; Weinberger et al.,
2005) addresses this issue by significantly reducing the size of the semidefinite programming prob-
lem. Like LLE, LMVU assumes that the data manifold is sufficiently smooth and densely sampled
that it is locally approximately linear. Instead of embedding all the data points directly as MVU
does, LMVU randomly chooses m≪ n inputs as so-called landmarks. Because of the local linear-
ity assumption, the other data points can be approximately reconstructed from the landmarks using
a linear transformation. It follows that the Gram matrix K can be approximated using the m×m
submatrix of inner products between landmarks. Hence we only need to optimize over m×m matri-
ces, a much smaller semidefinite program. Other recent approaches for speeding up MVU include
matrix factorization based on a graph Laplacian (Weinberger et al., 2007).
In addition to the above comparison methods, other recent work on dimensionality reduction in-
cludes minimum volume embedding (MVE; Shaw and Jebara, 2007), which is similar to MVU, but
where MVU maximizes the whole trace of the Gram matrix (the sum of all eigenvalues), MVE max-
imizes the sum of the first few eigenvalues and minimizes the sum of the rest, in order to preserve
the largest amount of eigenspectrum energy in the few dimensions that remain after dimensionality
reduction. In practice, a variational upper bound of the resulting criterion is optimized.
Very recently, a number of unsupervised methods have been compared by van der Maaten et al.
(2009) in terms of classification accuracy and our old criteria trustworthiness-continuity.
4.2 Data Sets for Unsupervised Visualization
We used two synthetic benchmark data sets and four real-life data sets for our experiments.
The plain s-curve data set is an artificial set sampled from an S-shaped two-dimensional surface
embedded in three-dimensional space. An almost perfect two-dimensional representation should be
possible for a non-linear dimensionality reduction method, so this data set works as a sanity check.
The noisy s-curve data set is otherwise identical to the plain s-curve data set, but significant
spherical normally distributed noise has been added to each data point. The result is a cloud of
points where the original S-shape is difficult to discern by visual inspection.
12
DIMENSIONALITY REDUCTION FOR VISUALIZATION
The faces data set consists of ten different face images of 40 different people, for a total of 400
images. For a given subject, the images vary in terms of lighting and facial expressions. The size of
each image is 64×64 pixels, with 256 grey levels per pixel. The data set is available for download
at http://www.cs.toronto.edu/˜roweis/data.html.
The mouse gene expression data set is a collection of gene expression profiles from different
mouse tissues (Su et al., 2002). Expression of over 13,000 mouse genes had been measured in
45 tissues. We used an extremely simple filtering method, similar to that originally used by Su
et al. (2002), to select the genes for visualization. Of the mouse genes clearly expressed (average
difference in Affymetrix chips, AD > 200) in at least one of the 45 tissues (dimensions), a random
sample of 1600 genes (points) was selected. After this the variance in each tissue was normalized
to unity.
The gene expression compendium data set is a large collection of human gene expression arrays
(http://dags.stanford.edu/cancer; Segal et al., 2004). Since the current implementations of
all methods do not tolerate missing data we removed samples with missing values altogether. First
we removed genes that were missing from more than 300 arrays. Then we removed the arrays
for which values were still missing. This resulted in a data set containing 1278 points and 1339
dimensions.
The sea-water temperature time series data set (Liitiainen and Lendasse, 2007) is a time series
of weekly temperature measurements of sea water over several years. Each data point is a time
window of 52 weeks, which is shifted one week forward for the next data point. Altogether there
are 823 data points and 52-dimensions.
4.3 Methodology for the Unsupervised Experiments
The performance of NeRV was compared with 11 unsupervised dimensionality reduction methods
described in Section 4.1, namely principal component analysis (PCA), metric multidimensional