Statistics 202: Data Mining c Jonathan Taylor Statistics 202: Data Mining Distances and similarities Based in part on slides from textbook, slides of Susan Holmes c Jonathan Taylor October 3, 2012 1/1
Statistics 202:Data Mining
c©JonathanTaylor
Statistics 202: Data MiningDistances and similarities
Based in part on slides from textbook, slides of Susan Holmes
c©Jonathan Taylor
October 3, 2012
1 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Similarities
Start with XXX which we assume is centered andstandardized.
The PCA loadings were given by eigenvectors of thecorrelation matrix which is a measure of similarity.
The first 2 (or any 2) PCA scores yield an n × 2 matrixthat can be visualized as a scatter plot.
Similarly, the first 2 (or any 2) PCA loadings yield anp × 2 matrix that can be visualized as a scatter plot.
2 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Olympic data
3 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Similarities
The PCA loadings were given by eigenvectors of thecorrelation matrix which is a measure of similarity.
The visualization of the cases based on PCA scores weredetermined by the eigenvectors of XXXXXXT , an n × n matrix.
To see this, remember that
XXX = U∆V T
soXXXXXXT = U∆2UT
XXXTXXX = V∆2V T
4 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Similarities
The matrix XXXXXXT is a measure of similarity between cases.
The matrix XXXTXXX is a measure of similarity betweenfeatures.
Structure in the two similarity matrices yield insight intothe set of cases, or the set of features . . .
5 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Distances
Distances are inversely related to similarities.
If A and B are similar, then d(A,B) should be small, i.e.they should be near.
If A and B are distant, then they should not be similar.
For a data matrix, there is a natural distance betweencases:
d(XXX i ,XXX k)2 = ‖XXX i −XXX k‖22
=
p∑j=1
‖XXX ij −XXX kj‖22
= (XXXXXXT )ii − 2(XXXXXXT )ik + (XXXXXXT )kk
6 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Distances
Suggests a natural transformation between a similaritymatrix S and a distance matrix D
Dik = (Sii − 2 · Sik + Skk)1/2
The reverse transformation is not so obvious. Somesuggestions from your book:
Sik = −Dik
= e−Dik
=1
1 + Dik
7 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Distances
A distance (or a metric) on a set S is a functiond : S × S → [0,+∞) that satisfies
d(x , x) = 0; d(x , y) = 0 ⇐⇒ x = yd(x , y) = d(y , x)d(x , y) ≤ d(x , z) + d(z , y)
If d(x , y) = 0 for some x 6= y then d is a pseudo-metric.
8 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Similarities
A similarity on a set S is a function s : S × S → R andshould satisfy
s(x , x) ≥ s(x , y) for all x 6= ys(x , y) = s(y , x)
By adding a constant, we can often assume thats(x , y) ≥ 0.
9 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Examples: nominal data
The simplest example for nominal data is just the discretemetric
d(x , y) =
{0 x = y
1 otherwise.
The corresponding similarity would be
s(x , y) =
{1 x = y
0 otherwise.
= 1− d(x , y)
10 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Examples: ordinal data
If S is ordered, we can think of S as (or identify S with) asubset of the non-negative integers.
If |S | = m then a natural distance is
d(x , y) =|x − y |m − 1
≤ 1
The corresponding similarity would be
s(x , y) = 1− d(x , y)
11 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Examples: vectors of continuous data
If S = Rk there are lots of distances determined by norms.
The Minkowski p or `p norm, for p ≥ 1:
d(x , y) = ‖x − y‖p =
(k∑
i=1
|xi − yi |p)1/p
Examples:
p = 2 the usual Euclidean distance, `2
p = 1 d(x , y) =∑k
i=1 |xi − yi |, the “taxicabdistance”, `1
p =∞ the d(x , y) = max1≤i≤k |xi − yi |, the supnorm, `∞
12 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Examples: vectors of continuous data
If 0 < p < 1 this is not a norm, but it still defines a metric.
The preceding transformations can be used to constructsimilarities.
Examples: binary vectors
If S = {0, 1}k ⊂ Rk the vectors can be thought of asvectors of bits.
The `1 norm counts the number of mismatched bits.
This is known as Hamming distance.
13 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Example: Mahalanobis distance
Given Σk×k that is positive definite, we defineMahalanobis distance on Rk by
dΣ(x , y) =(
(x − y)TΣ−1(x − y))1/2
This is the usual Euclidean distance, with a change ofbasis given by a rotation and stretching of the axes.
If Σ is only non-negative definite, then we can replace Σ−1
with Σ†, its pseudo-inverse. This yields a pseudo-metricbecause it fails the test
d(x , y) = 0 ⇐⇒ x = y .
14 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Example: similarities for binary vectors
Given binary vectors x , y ∈ {0, 1}k we can summarize theiragreement in a 2× 2 table
x = 0 x = 1 totalyy = 0 f00 f01 f0·y = 1 f10 f11 f1·totalx f·0 f·1 k
15 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Example: similarities for binary vectors
We define the simple matching coefficient (SMC)similarity by
SMC (x , y) =f00 + f11
f00 + f01 + f10 + f11=
f00 + f11
k
= 1− ‖x − y‖1
k
The Jaccard coefficient ignores entries where xi = yi = 0
J(x , y) =f11
f01 + f10 + f11.
16 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Example: cosine similarity & correlation
Given vectors x , y ∈ Rk the cosine similarity is defined as
−1 ≤ cos(x , y) =〈x , y〉‖x‖2‖y‖2
≤ 1.
The correlation between two vectors is defined as
−1 ≤ cor(x , y) =〈x − x · 111, y − y · 111〉‖x − x · 111‖2‖y − y · 111‖2
≤ 1.
17 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Example: correlation
An alternative, perhaps more familiar definition:
cor(x , y) =SxySx Sy
Sxy =1
k − 1
k∑i=1
(xi − x)(yi − y)
S2x =
1
k − 1
k∑i=1
(xi − x)2
S2y =
1
k − 1
k∑i=1
(yi − y)2
18 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Correlation & PCA
The matrix 1n−1XXX
TXXX was actually the matrix of pair-wise
correlations of the features. Why? How?
1
n − 1
(XXX
TXXX)ij
=1
n − 1
(D−1/2XXXTHXXXD−1/2
)The diagonal entries of D are the sample variances of eachfeature.
The inner matrix multiplication computes the pair-wisedot-products of the columns of HXXX(
XXXTHXXX)ij
=n∑
k=1
(Xki − Xi )(Xkj − Xj)
19 / 1
Statistics 202:Data Mining
c©JonathanTaylor
High positive correlation
20 / 1
Statistics 202:Data Mining
c©JonathanTaylor
High negative correlation
21 / 1
Statistics 202:Data Mining
c©JonathanTaylor
No correlation
22 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Small positive
23 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Small negative
24 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Combining similarities
In a given data set, each case may have many attributes orfeatures.
Example: see the health data set for HW 1.
To compute similarities of cases, we must pool similaritiesacross features.
In a data set with M different features, we writexi = (xi1, . . . , xiM), with each xim ∈ Sm.
25 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Combining similarities
Given similarities sm on each Sm we can define an overallsimilarity between case xi and xj by
s(xi , xj) =M∑
m=1
wmsm(xim, xjm)
with optional weights wm for each feature.
Your book modifies this to deal with “asymmetricattributes”, i.e. attributes for which Jaccard similaritymight be used.
26 / 1
Statistics 202:Data Mining
c©JonathanTaylor
27 / 1