Lecture 8: Multidimensional scaling Advanced Applied Multivariate Analysis STAT 2221, Spring 2015 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department of Mathematical Sciences Binghamton University, State University of New York E-mail: [email protected]1 / 46
46
Embed
Lecture 8: Multidimensional scaling - Department of … · Lecture 8: Multidimensional scaling ... STAT 2221, Spring 2015 ... arrive at some optimal low-dimensional con guration (p
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Perception of Color in human visionMDS reproduces the well-known two-dimensional color circle.
8 / 46
Distance, dissimilarity and similarity
Distance, dissimilarity and similarity (or proximity) are defined forany pair of objects in any space. In mathematics, a distancefunction (that gives a distance between two objects) is also calledmetric, satisfying
1 d(x , y) ≥ 0,
2 d(x , y) = 0 if and only if x = y ,
3 d(x , y) = d(y , x),
4 d(x , z) ≤ d(x , y) + d(y , z).
Given a set of dissimilarities, one can ask whether these values aredistances and, moreover, whether they can even be interpreted asEuclidean distances
9 / 46
Euclidean and non-Euclidean distance
Given a dissimilarity (distance) matrix D = (dij), MDS seeks tofind x1, . . . , xn ∈ Rp (called a configuration) so that
dij ≈ ‖x i − x j‖2 as close as possible.
Oftentimes, for some large p, there always exists a configurationx1, . . . , xn with exact/perfect distance match dij ≡ ‖x i − x j‖2. Insuch a case the distance d involved is called a Euclidean distance.There are, however, cases where the dissimilarity is distance, butthere exists no configuration in any p with perfect match
dij 6= ‖x i − x j‖2 , for some i , j .
Such a distance is called non-Euclidean distance.
10 / 46
non-Euclidean distanceRadian distance function on a circle is a metric.Cannot be embedded in R1 (in other words, cannot findx1, . . . , x4 ∈ R to match the distance)(Not for any Rp, not shown here)
Nevertheless, MDS seeks to find an optimal configuration x i
that gives dij ≈ ‖x i − x j‖2 as close as possible.
11 / 46
The next section would be . . . . . .
1 Introduction to MDS
2 Classical MDS
3 Metric and non-Metric MDS
12 / 46
classical Multidimensional Scaling(cMDS)–theory
Suppose for now we have Euclidean distance matrix D = (dij).
The objective of classical Multidimensional Scaling (cMDS) is tofind X = [x1, . . . , xn] so that ‖x i − x j‖ = dij . Such a solution isnot unique, because if X is the solution, then x
∗i := x i + c , c ∈ Rq
also satisfies∥∥∥x∗i − x∗j ∥∥∥ = ‖(x i + c)− (x j + c)‖ = ‖x i − x j‖ = dij . Any
location c can be used, but the assumption of centeredconfiguration, i.e.,
n∑i=1
x i = 0 (1)
serves well for the purpose of dimension reduction.
13 / 46
In short, the cMDS finds the centered configurationx1, . . . , xn ∈ Rq for some q ≥ n − 1 so that their pairwisedistances are the same as those corresponding distances in D.
We may find the n × n Gram matrix B = X′X, rather than X. TheGram matrix is the matrix of inner products. Denote the ijthelement of B as bij . We have
d2ij = bii + bjj − 2bij , (2)
from the fact d2ij = ‖x i − x j‖2 = x
′ix i + x
′jx j − 2x ′ix j .
Remember, we seek to solve bij ’s from dij ’s (see the next fewslides.)
14 / 46
The centering constraint (1) leads to
n∑i=1
bij =n∑
i=1
x′ix j =
n∑i=1
q∑k=1
xikxjk =
q∑k=1
xjk
n∑i=1
xik = 0,
for j = 1, . . . , n. Hence, the sum of each row or column of Bis 0.
With a notation T = trace(B) =∑n
i=1 bii , we have
n∑i=1
d2ij = T +nbjj ,
n∑j=1
d2ij = T +nbii ,
n∑j=1
n∑i=1
d2ij = 2nT . (3)
15 / 46
Combining (2) and (3), the solution is unique:
bij = −1/2(d2ij − d2
·j − d2i · + d2
··),
where d2·j is the average of
{d2ij , i = 1, . . . , n
}for each j , d2
i · is the
average of{d2ij , j = 1, . . . , n
}for each i , and d2
·· is the average of{d2ij , i , j = 1, . . . , n
}, or equivalently
B = −1/2CD2C′,
where D2 = {d2ij} and C is the centering matrix.
A solution X is then given by the eigen-decomposition ofB(:= X′X). That is, for B = VΛV ′,
X = Λ1/2V ′. (4)
16 / 46
The space on which X lies is the eigenspace spanned by rowsof V .
Consider PCA based on {x i} (centered) throughsingular-decomposition. We have X = UΘV ′, and the PCscores are Z = U ′X = ΘV ′, also in the space spanned by rowsof V .
It would turn out that U = Iq and Θ = Λ1/2
The first coordinate of X has the largest variation (recall theinterpretation of X using PCA scores above)
If we wish to reduce the dimension to p ≤ q, then the first prows of X, X(p), best preserves the distances dij among allother linear dimension reduction of X.
X(p) = Λ1/2p V ′p,
where Λp is the first p × p submatrix of Λ, Vp is the first pcolumns of V .
17 / 46
To see that the first p coordinates of x i indeed best preserve thedistance, note that the distance between x i and x j ∈ Rq is
d2ij = ‖x i − x j‖2 =
∥∥∥x (1−p)i − x (1−p)j
∥∥∥2 +∥∥∥x (∗)i − x
(∗)j
∥∥∥2where x
(1−p)i is the subvector of x i which we keep and x
(∗)i is the
part we throw away. It is easy to see that since the variation of
x(∗)i is small, the value of
∥∥∥x (∗)i − x(∗)j
∥∥∥2 is small too (on average).
18 / 46
cMDS remarks
cMDS gives configurations X(p) in Rp for any dimension1 ≤ p ≤ q.
Configuration is centered.
Coordinates are given by the principal scores, ordered fromlargest-to-smallest variation.
Dimension reduction from X = X(q) to X(p) (p < q) is sameas PCA (cutting some PC scores out).
Leads to exact solution if the dissimilarity is based onEuclidean distances
Can also be used for non-Euclidean distances, in fact, for anydissimilarities.
19 / 46
cMDS examples
Consider two worked examples:
1 with Euclidean geometry (tetrahedron – unit edge length)2 with circular geometry
And the airline distances example (Izenman 13.1.1)
20 / 46
cMDS examples: tetrahedronPairwise distance matrix for tetrahedron (with distance 1)
D =
0 1 1 11 0 1 11 1 0 11 1 1 0
,
leading to the gram matrix B(4×4) with eigenvalues (.5, .5, .5, 0).Using dimension p = 3, we have perfectly retrieved thetetrahedron.
leading to the gram matrix B(4×4) with eigenvalues
diag(Λ) = (5.6117,−1.2039,−0.0000, 2.2234)
In retrieving the coordinate matrix X , we cannot take a squarerootof Λ since it gives complex numbers.Remedy: Keep only positive eigenvalues and correspondingcoordinates. In this case, take coordinates 1 and 4. This is theprice we pay to represent non-Euclidean geometry by Euclideangeometry.
22 / 46
cMDS examples: circular distancesUsing dimension p = 2 (cannot use p > 2), configuration X(2) is
Compare the original distance matrix D and approximated distancematrix D = ‖xi − xj‖2:
Take the first 3 largest eigenvalues (inspection of scree plot)
25 / 46
cMDS examples: Airline distances
26 / 46
cMDS examples: Airline distances
27 / 46
The next section would be . . . . . .
1 Introduction to MDS
2 Classical MDS
3 Metric and non-Metric MDS
28 / 46
Distance Scaling
classical MDS
seeks to find an optimal configuration x i that givesdij ≈ dij = ‖x i − x j‖2 as close as possible.
Distance Scaling
Relaxing dij ≈ dij from cMDS by allowing
dij ≈ f (dij), for some monotone function f .
Called metric MDS if dissimilarities dij are quantitative
Called non-metric MDS if dissimilarities dij are qualitative(e.g. ordinal).
Unlike cMDS, distance scaling is an optimization processminimizing stress function, and is solved by iterativealgorithms.
29 / 46
The (usual) metric MDS
Given a (low) dimension p and a monotone function f , metricMDS seeks to find an optimal configuration X ⊂ Rp that givesf (dij) ≈ dij = ‖x i − x j‖2 as close as possible.
The function f can be taken to be a parametric monotonicfunction, such as f (dij) = α + βdij .
‘As close as possible’ is now explicitly stated by square loss
stress = L(dij) =
1∑`<k d
2`k
∑i<j
(dij − f (dij))2
12
,
and the metric MDS minimizes L(dij) over all dij and α, β.
The usual metric MDS is the special case f (dij) = dij ; itssolution (from optimization) = that of classical MDS.
30 / 46
Sammon mapping
Sammon mapping is a generalization of the usual metric MDS.
Sammon’s stress (to be minimized) is
Sammon’s stress(dij) =1∑
`<k d`k
∑i<j
(dij − dij)2
dij
This weighting system normalizes the squared-errors inpairwise distances by using the distance in the original space.As a result, Sammon mapping preserves the small dij better,giving them a greater degree of importance in the fittingprocedure than for larger values of dij . Useful in identifyingclusters.
Optimal solution is found by numerical computation (initialvalue supplied by cMDS).
31 / 46
cMDS vs. Sammon Mapping
Izenman Figure 13.9 (lower panel)
Results of cMDS and Sammon mapping for p = 2: Sammonmapping better preserves inter-distances for smallerdissimilarities, while proportionally squeezes theinter-distances for larger dissimilarities.
There is NO truth here.
32 / 46
Non-metric MDSIn many applications of MDS, dissimilarities are known only bytheir rank order, and the spacing between successively rankeddissimilarities is of no interest or is unavailable
Non-metric MDS
Given a (low) dimension p, non-metric MDS seeks to find anoptimal configuration X ⊂ Rp that gives f (dij) ≈ dij = ‖x i − x j‖2as close as possible.
Unlike metric MDS, here f is much general and is onlyimplicitly defined.
f (dij) = d∗ij are called disparities, which only preserve theorder of dij , i.e.,
dij < dk` ⇔ f (dij) ≤ f (dk`) (5)
⇔ d∗ij ≤ d∗k`
33 / 46
Kruskal’s non-metric MDS
Kruskal’s non-metric MDS minimizes the stress-1
stress-1(dij , d∗ij ) =
1∑k<` d
2k`
∑i<j
(dij − d∗ij )2
12
.
Note that the original dissimilarities are only used in checking(5). In fact only the order dij < dk` < ... < dmf amongdissimilarities is needed.
34 / 46
the function f works as if it were a regression curve:approximated dissimilarities dij (Izenman calls it dij) as
observed y , disparities d∗ij (Izenman calls it dij) as thepredicted y , and the order of dissimilarities as the explanatoryvariable (x).
35 / 46
Example: Letter recognition
Wolford and Hollingsworth (1974) were interested in theconfusions made when a person attempts to identify letters of thealphabet viewed for some milliseconds only. A confusion matrixwas constructed that shows the frequency with which eachstimulus letter was mistakenly called something else. A section ofthis matrix is shown in the table below.
Is this a dissimilarity matrix?
36 / 46
Example: Letter recognition
How to deduce dissimilarities from a similarity matrix?From similarities δij , choose a maximum similarity c ≥ max δij ,
so that dij =
{c − δij , if i 6= j
0, if i = j. dij ↑, i and j are less similar.
Which method is more appropriate?Because we have deduced dissimilarities from similarities, theabsolute dissimilarities dij depend on the value of personallychosen c . This is the case where the non-metric MDS makesmost sense.However, we will also see that metric scalings (cMDS andSammon mapping) do the job as well.
How many dimension?By inspection of eigenvalues from the cMDS solution.
37 / 46
Letter recognitionFirst choose c = 21(= max δij + 1).Compare MDS with p = 2, from cMDS, Sammon mapping,and non-metric scaling (stress1):
38 / 46
Letter recognition:First choose c = 21 = max δij + 1.Compare MDS with p = 3, from cMDS, Sammon mapping,and non-metric scaling (stress1):
39 / 46
Letter recognition:
Do you see any clusters?
With c = 21 = max δij + 1, the eigenvalues of theGram-matrix B in the calculation of cMDS are:
508.5707
236.0530
124.8229
56.0627
39.7347
-0.0000
-35.5449
-97.1992
The choice of p = 2 or p = 3 seems reasonable.
40 / 46
Letter recognitionSecond choice of c = 210 = max δij + 190.Compare MDS with p = 2, from cMDS, Sammon mapping,and non-metric scaling (stress1):
41 / 46
Letter recognition:Second choice of c = 210 = max δij + 190.Compare MDS with p = 3, from cMDS, Sammon mapping,and non-metric scaling (stress1):
42 / 46
Letter recognition:
With c = 210, the eigenvalues of the Gram-matrix B in thecalculation of cMDS are:
1.0e+04 *
2.7210
2.2978
2.1084
1.9623
1.9133
1.7696
1.6842
0.0000
May need more than p > 3 dimensions.
43 / 46
Letter recognition: Summary
The structure of the data appropriate for non-metric MDS.
Kruskal’s non-metric scaling:
1 Appropriate for non-metric dissimilarities (goal is to preserveorder)
2 Optimization: susceptible to local minima (leading to differentconfigurations);
3 Time-consuming
cMDS fast, overall good.
Sammon mapping fails when c = 210.
44 / 46
Letter recognition: Summary
Clusters (C ,G ), (D,Q), (H,M,N,W ) are confirmed by acluster analysis for either choice of c .
Use agglomerative hierarchical clustering with average linkage: