Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine Nelson Tobias Scheffer
Universität Potsdam Institut für Informatik
Lehrstuhl Maschinelles Lernen
Maschinelles Lernen II
PCA
Christoph Sawade/Niels Landwehr/Blaine Nelson
Tobias Scheffer
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Overview
Principal Component Analysis (PCA)
Optimization problem
Kernel-PCA
Adaptation for high-dimensional data
Fisher Linear Discriminant Analysis
Directions of Maximum Covariance
Canonical Correlation Analysis
2
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PRINCIPAL COMPONENT ANALYSIS Part I
3
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Motivation
Data compression
Preprocessing (Feature Selection / Noisy Features)
Data visualization
4
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Example
Representation of Digits as an 𝑚 ×𝑚 pixel matrix
The actual number of degrees of freedom is
significantly smaller because many features
Are meaningless or
Are composites of several pixels
Goal: Reduce to a 𝑑-dimensional subspace
5
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Example
Representation of faces as an 𝑚 ×𝑚 pixel matrix
The actual number of degrees of freedom is significantly
smaller because many features
Are meaningless or
Are composites of several pixels
Goal: Reduce to a 𝑑-dimensional subspace
6
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Projection
A Projection is an idempotent linear Transformation
7
1y x
T
1u x
x
ix
1 iy x
Center point:
𝐱 =1
𝑛 𝐱𝑖
𝑛
𝑖=1
Covariance:
𝚺 =1
𝑛 𝐱𝑖 − 𝐱 𝐱𝑖 − 𝐱 T
𝑛
𝑖=1
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Projection
A Projection is an idempotent linear Transformation
Let 𝐮1 ∈ ℝ𝑚 with 𝐮1T𝐮1 = 1
𝑦1 𝐱 = 𝐮1T𝐱 constitutes a
projection onto a one-dimensional
subspace
For data in the projection‘s space, it follows that:
Center (mean): 𝑦1 𝐱 = 𝐮1T 𝐱
Variance: 1
𝑛 𝐮1
T𝐱𝑖 − 𝐮1T 𝐱
2𝑛𝑖=1 = 𝐮1
T𝚺 𝐮1
8
1y x
T
1u x
x
ix
1 iy x
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Directions of Maximum Variance
Find direction 𝐰 that maximizes projected variance
Consider random variable 𝐱~𝑃𝑋 (Assume 0-mean).
The projected variance onto (normalized) 𝐮1 is
E proj𝐮1𝐱2= E 𝐮1
T𝐱𝐱T𝐮1 = 𝐮1T 𝐸 𝐱𝐱T
𝚺𝐱𝐱
𝐮1
9
The empirical covariance
matrix (of centered data) is
𝚺 𝑥𝑥 =1𝑛𝐗𝐗T
How can we find direction 𝐮1
to maximize 𝐮1T𝚺 𝑥𝑥𝐮1?
How can we kernelize it?
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Optimization Problem
Goal: Variance of the projected data 𝐮1T𝚺 𝑥𝑥 𝐮1
should not be lost
Maximize 𝐮1T𝚺 𝑥𝑥 𝐮1 w.r.t. 𝐮1, such that 𝐮1
T𝐮1 = 1
Lagrangian: 𝐮1T𝚺 𝑥𝑥 𝐮1+ 𝜆1 1 − 𝐮1
T𝐮1
10
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Optimization Problem
Goal: Variance of the projected data 𝐮1T𝚺 𝑥𝑥 𝐮1
should not be lost
Maximize 𝐮1T𝚺 𝑥𝑥 𝐮1 w.r.t. 𝐮1, such that 𝐮1
T𝐮1 = 1
Lagrangian: 𝐮1T𝚺 𝑥𝑥 𝐮1+ 𝜆1 1 − 𝐮1
T𝐮1
Taking its derivative & setting it to 0: 𝚺 𝑥𝑥𝐮1 = 𝜆1𝐮1
… The solution 𝐮1 must be an eigenvector of 𝚺 𝑥𝑥
… The variance is the corresponding eigenvalue
Reduces PCA to determining the largest eigenvalue
The largest eigenvector is 1st principal component
11
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA
Projection of 𝐱 to the eigenspace:
𝑦1 𝐱 = 𝐮1T𝐱 𝑦 𝐱 = 𝐔T𝐱 with 𝐔 =
| |𝐮1 ⋯ 𝐮𝑚| |
Largest eigenvector is 1st principal component
The remaining principal components are orthogonal
directions which maximize the residual variance
𝑑 principal components vectors of the 𝑑 largest
eigenvalues 12
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Reverse Projection
Observation: 𝐮𝑗 form a basis for ℝ𝑚 & 𝑦𝑗 𝐱 are
the coordinates of 𝐱 in that basis
Data 𝐱𝑖 can thus be reconstructed in that basis:
𝐱𝑖 = 𝐱𝑖T𝐮𝑗
𝑚
𝑗=1
𝐮𝑗 or 𝐗 = 𝐔𝐔T𝐗
If data lies (mostly) in 𝑑-dimensional principal
subspace, we can also reconstruct the data there:
𝐱 𝑖 = 𝐱𝑖T𝐮𝑗
𝑑
𝑗=1
𝐮𝑗 or 𝐗 = 𝐔1:𝑑𝐔1:𝑑T𝐗
where 𝐔1:𝑑 is the matrix of 1st 𝑑 eigenvectors
13
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Algorithm
PCA finds dataset’s principal components, which
maximize the projected variance
Algorithm:
1. Compute data’s mean: 𝛍 = 1
𝑛 𝐱𝑖𝑛𝑖=1
2. Compute data’s covariance:
𝚺 𝑥𝑥 =1
𝑛 𝐱𝑖 − 𝛍 𝐱𝑖 − 𝛍 T𝑛𝑖=1
3. Find principal axes: 𝐔, 𝐕 = eig 𝚺 𝑥𝑥
4. Project data onto 1st d eigenvectors
𝐱 𝑖 ← 𝐔1:𝑑T 𝐱𝑖 − 𝛍
14
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Example
15
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
KERNEL PCA Part II
16
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
PCA Motivation: high-dimensional data
Computation of 𝑑 eigenvectors for 𝑚-dimensional
data is 𝑂 𝑑𝑚2
Not computable for large 𝑚
Idea: data points span a linear subspace of at most
min 𝑚, 𝑛 dimensions
Let 𝐱 = 𝟎, then with help from the data, 𝐗 ∈ ℝ𝑚×𝑛
𝚺 𝑥𝑥𝐮1 = 𝜆1𝐮1 𝐗𝐯1 = 𝑛𝜆1𝐮1, 𝐗T𝐗𝐯1 = 𝑛𝜆1𝐯1
Computation is 𝑂 𝑑𝑛2 instead of 𝑂 𝑑𝑚2 .
Has same 𝑛 − 1 eigen-solutions: 𝐮𝑖 =1
𝑛𝜆𝑖𝐗𝐯𝑖
Except for eigenvalues 0
17
𝐯1 = 𝐗T𝐮1
Kernel Matrix 𝐊𝑥𝑥 𝚺 𝑥𝑥 =1𝑛𝐗𝐗T
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Kernel-PCA
18
Requirements: Data only interact through inner
product
PCA can only capture linear subspaces
More complex features can capture non-linearity
Want to use PCA in high-dimensional spaces
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Kernel PCA Ring Data Example
19
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Kernel PCA Ring Data Example
PCA fails to capture the data’s two ring structure—
rings are not separated in the first 2 components.
20
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
21
Kernel-PCA Recap: Kernels
Linear Classifiers:
Often adequate, but not always.
Idea: Data implicitly mapped to
another space, in which they are
linearly classifiable
Image mapping:
𝐱 ↦ 𝜙 𝐱
Associated kernel:
𝜅 𝐱𝑖 , 𝐱𝑗 = 𝜙 𝐱𝑖T𝜙 𝐱𝑗
Kernel = Inner Product =
Similarity of Examples.
-
-
- +
+
+
+
-
-
-
-
-
-
+
(-)
(-)
(-)
(-)
(-)
(-)
(-)
(-) (-)
(+)
(+)
(+)
(+) (+)
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Kernel-PCA
For 𝜙 𝐱𝑖 = 0𝑛𝑖=1 , the eigenvector problem is
equivalently transformed:
𝚺𝐮𝑖 = 𝜆𝑖𝐮𝑖 𝐊𝛂𝑖 = 𝑛𝜆𝑖𝛂𝑖
Projection: 𝑦𝑖 𝐱 = 𝜙 𝐱 𝐓𝐯𝑖 = 𝛼𝑖,𝑗𝜅 𝐱, 𝐱𝑗𝑛𝑗=1
Alternative derivation via the Mercer-Map…
22
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Kernel-PCA Algorithm
Kernel-PCA finds dataset’s principal components in
an implicitly defined feature space
Algorithm:
1. Compute kernel matrix 𝐊: 𝐾𝑖𝑗 = 𝜅 𝐱𝑖 , 𝐱𝑗
2. Center the kernel matrix:
𝐊 = 𝐊 −1
𝑛𝟏𝟏T𝐊 −
1
𝑛𝐊𝟏𝟏T +
𝟏T𝐊𝟏
𝑛𝟐𝟏𝟏T
3. Find its eigenvectors: 𝐔, 𝐕 = eig 𝐊
4. Find the dual vectors: 𝛂𝑘 = 𝜆𝑘−1 2 𝐮𝑘
5. Project the data onto the subspace:
𝐱 𝑗 ← 𝛼𝑘,𝑖𝐾 𝑖𝑗𝑛
𝑖=1 𝑘=1
𝑑
= 𝛂𝑘T𝐊 ∗,𝑗 𝑘=1
𝑑
23
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Kernel PCA Ring Data Example
Kernel PCA (RBF) does capture the data’s structure
& resulting projections separate the 2 rings
24
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Kernel-PCA Mercer Map
Observation: Any symmetric matrix, 𝐊, can be split
as follows (Eigenvalue Decomposition):
𝐊 = 𝐔𝐕𝐔−1
𝐔 =| |𝐮1 ⋯ 𝐮𝑚| |
, 𝐕 =𝜆1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝜆𝑚
If 𝐊 is positive semi-definite, then 𝜆𝑖 ∈ ℝ, ∀𝑖
If eigenvectors are normalized (𝐮𝑖T𝐮𝑖 = 1), then
𝐔T = 𝐔−1
25
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
𝐊 = 𝐔𝐕𝐔T
= 𝐔𝐕1 2 𝐕1 2 𝐔T
= 𝐔𝐕1 2 𝐔𝐕1 2 T
= 𝚽 𝐗 T𝚽 𝐗 with 𝚽 𝐗 ≔
| |
𝜙 𝐱1 ⋯ 𝜙 𝐱𝑛| |
Explict feature-mapping is given by
𝐊𝑥,𝑛𝑒𝑤 = 𝚽 𝐗 T𝚽 𝐗𝑛𝑒𝑤 = 𝐔𝐕1 2 𝚽 𝐗𝑛𝑒𝑤
𝚽 𝐗𝑛𝑒𝑤 = 𝐔𝐕1 2 −1𝐊𝑥,𝑛𝑒𝑤 = 𝐕−1 2 𝐔T𝐊𝑥,𝑛𝑒𝑤
26
Kernel-PCA Mercer Map
Diagonal matrix: 𝐕𝑖𝑖 = 𝜆𝑖
Eigenvalue decomposition
𝐔T = 𝐔−1
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Explicit feature-mapping is given by
𝚽 𝐗𝑛𝑒𝑤 = 𝐕−1 2 𝐔T𝐊𝑥,𝑛𝑒𝑤
Observation: reduction to 𝑑 principal components is
equivalent to
𝚽𝑟𝑒𝑑 𝐗𝑛𝑒𝑤 = 𝐕 −1 2
𝐔T𝐊𝑥,𝑛𝑒𝑤, where 𝐕 = 𝑑𝑖𝑎𝑔 𝜆1, ⋯ , 𝜆𝑑 , 0,⋯
27
Kernel-PCA Mercer Map
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
FISHER-DISCRIMINANT ANALYSIS Part III
28
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Fisher-Discriminant Analysis (FDA)
The subspace induced by PCA maximally captures
variance from all data
Not the correct criterion for classification…
29
-5 -4 -3 -2 -1 0 1 2 3 4 5-40
-30
-20
-10
0
10
20
30Original Space
x1
x2
-1 -0.5 0 0.5 1-40
-30
-20
-10
0
10
20
30PCA Subspace
x1
x2
𝐗T𝐮𝑃𝐶𝐴
𝚺𝐮𝑃𝐶𝐴 = 𝜆𝑃𝐶𝐴𝐮𝑃𝐶𝐴
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Fisher-Discriminant Analysis (FDA)
Optimization criterion of PCA:
Maximize the data‘s variance in the subspace.
max𝐮𝐮T𝚺𝐮, where 𝐮T𝐮 = 1
Optimization criterion of FDA:
Maximize between-class variance and minimize within-
class variance within the subspace.
max𝐮 𝐮T𝚺𝑏𝐮
𝐮T𝚺𝑤𝐮, where
𝚺𝑤 = 𝚺+1 + 𝚺−1𝚺𝑏 = 𝐱+1 − 𝐱−1 𝐱+1 − 𝐱−1
T
Already introduced as a classifier in ML1.
30
Variance
per class
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Fisher-Discriminant Analysis (FDA)
Optimization criterion of FDA for 𝑘 classes:
Maximize between-class variance and minimize within-
class variance within the subspace.
max𝐮 𝐮T𝚺𝑏𝐮
𝐮T𝚺𝑤𝐮, where
𝚺𝑤 = 𝚺1 +⋯+ 𝚺𝑘𝚺𝑏 = 𝑛𝑖 𝐱𝑖 − 𝐱 𝐱𝑖 − 𝐱 T𝑘
𝑖=1
Generalized eigenvalue problem has 𝑘 − 1 different
solutions
31
Number of samples per class Leads to the generalized
eigenvalue problem 𝚺𝑏𝐮 = 𝜆𝚺𝑤𝐮
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Fisher-Discriminant Analysis (FDA)
The subspace induced by PCA maximally captures
variance from all data
Not the correct criterion for classification…
32
-5 -4 -3 -2 -1 0 1 2 3 4 5-40
-30
-20
-10
0
10
20
30Original Space
x1
x2
-1 -0.5 0 0.5 1-40
-30
-20
-10
0
10
20
30PCA Subspace
x1
x2
𝐗T𝐮𝑃𝐶𝐴
𝚺𝐮𝑃𝐶𝐴 = 𝜆𝑃𝐶𝐴𝐮𝑃𝐶𝐴
𝐗T𝐮𝐹𝐷𝐴
𝚺𝑏𝐮𝐹𝐼𝑆 = 𝜆𝐹𝐼𝑆𝚺𝑤𝐮𝐹𝐼𝑆
-1 -0.5 0 0.5 1-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15Fisher Subpace
x1
x2
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
MAXIMUM COVARIANCE ANALYSIS Part IV
33
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Maximum Covariance Analysis (MCA) Motivation: Explanatory Directions
Consider data 𝐱, 𝐲 : input & output 𝐱, 𝐲 ~𝑃 𝑋, 𝑌
Find covariance directions 𝐮𝑋 ∈ 𝑋 & 𝐮𝑌 ∈ 𝑌 s.t.
changes in 𝐮𝑋 correspond to changes in 𝐮𝑌.
Assuming mean-centered data, the covariance of
its projection onto (normalized) 𝐮𝑋 & 𝐮𝑌 is again
E 𝐮𝑋T𝐱 𝐮𝑌
T𝐲 = 𝐮𝑋T𝚺𝑥𝑦𝐮𝑌
Empirical covariance matrix (of centered data):
𝚺 𝑥𝑦 =1𝑛𝐗𝐘
T
How to maximize 𝐮𝑋T𝚺 𝑥𝑦𝐮𝑌 for non-square 𝚺 𝑥𝑦?
How can we kernelize it in space 𝑋?
34
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Maximum Covariance Analysis (MCA) Solution: Singular Value Decomposition
MCA can be cast as finding a pair of directions to
maximize covariance:
max𝐮𝑋 = 𝐮𝑌 =1
1𝑛𝐮𝑋
T𝐗𝐘T𝐮𝑌
≡ max𝐮𝑋 =1
1𝑛2𝐮𝑋
T 𝐗𝐘T𝐘𝐗T 𝐮𝑋
≡ max𝐮𝑌 =1
1𝑛2𝐮𝑌
T 𝐘𝐗T𝐗𝐘T 𝐮𝑌
Solve for 𝐮𝑋 & 𝐮𝑌 as eigenvalue problems
Solution is a Singular Value Decomposition (SVD)
Produces triplets 𝜎𝑖 , 𝐮𝑋,𝑖 , 𝐮𝑌,𝑖 - singular value &
corresponding vectors; 𝜎𝑖 is variance captured.
Generally we get the SVD: 𝚺 𝑥𝑦 = 𝐔𝑋𝐕𝐔𝑌T
35
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Kernelized MCA
MCA can also be kernelized by projecting 𝐱 ↦ 𝜙 𝐱
Consider that eigen-analysis of 𝚺 𝑥𝑦𝚺 𝑥𝑦T yields 𝐔𝑋 &
of 𝚺 𝑥𝑦T𝚺 𝑥𝑦 gives 𝐔𝑌 of the SVD of 𝚺 𝑥𝑦… in fact
𝚺 𝑥𝑦T𝚺 𝑥𝑦 =
1𝑛2
𝐘𝐊𝑥𝑥𝐘T
The relationship between 𝐮𝑋,𝑖 & 𝐮𝑌,𝑖 is then
𝐮𝑋,𝑖 =1
𝜎𝑖𝚺 𝑥𝑦𝐮𝑌,𝑖
This gives projections onto 𝐮𝑋,𝑖 when 𝑋 is kernelized:
proj𝐮𝑋,𝑘𝜙 𝐱 = 𝛼𝑘,𝑗𝜅 𝐱𝑗 , 𝐱𝑛
𝑗=1 𝛂𝑘 =
1𝑛𝜎𝑘
𝐘𝐮𝑌,𝑘
36
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
CANONICAL CORRELATION ANALYSIS (CCA)
Part V
37
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Canonical Correlation Analysis (CCA) Motivation
We have 2 different representations of same data 𝐱:
𝐱𝑎 ← 𝜓𝑏 𝐱 & 𝐱𝑏 ← 𝜓𝑏 𝐱
Find directions 𝐮𝑎 ∈ 𝑋𝑎 & 𝐮𝑏 ∈ 𝑋𝑏 s.t. changes in 𝐮𝑎
correspond to changes in 𝐮𝑏 - correlated directions
Examples of related datasets:
Climate data: spatial measurements of different
quantities (pressure/temperature) may be correlated
due to a single underlying phenomenon (El Nino)
Multi-lingual text: parallel texts written in 2 different
languages that represent the same ideas.
38
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Canonical Correlation Analysis (CCA) Example
Climate Prediction: Researchers have used CCA
techniques to find correlations in sea level pressure
& sea surface temperature:
39
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Canonical Correlation Analysis (CCA) Motivation
We have 2 different representations of same data 𝐱:
𝐱𝑎 ← 𝜓𝑏 𝐱 & 𝐱𝑏 ← 𝜓𝑏 𝐱
Find directions 𝐮𝑎 ∈ 𝑋𝑎 & 𝐮𝑏 ∈ 𝑋𝑏 s.t. changes in 𝐮𝑎
correspond to changes in 𝐮𝑏 - correlated directions
For mean-centered data, the correlation of its
projection onto (normalized) 𝐮𝑎 & 𝐮𝑏 is
𝜌𝑎𝑏 =E 𝐮𝑎
T𝐱𝑎 𝐮𝑏T𝐱𝑏
E 𝐮𝑎T𝐱𝑎 𝐮𝑎
T𝐱𝑎 ∙ E 𝐮𝑏T𝐱𝑏 𝐮𝑏
T𝐱𝑏
40
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Canonical Correlation Analysis (CCA) Motivation
We have 2 different representations of same data 𝐱:
𝐱𝑎 ← 𝜓𝑏 𝐱 & 𝐱𝑏 ← 𝜓𝑏 𝐱
Find directions 𝐮𝑎 ∈ 𝑋𝑎 & 𝐮𝑏 ∈ 𝑋𝑏 s.t. changes in 𝐮𝑎
correspond to changes in 𝐮𝑏 - correlated directions
For mean-centered data, the correlation of its
projection onto (normalized) 𝐮𝑎 & 𝐮𝑏 is
𝜌𝑎𝑏 =𝐮𝑎
T𝚺 𝑎𝑏𝐮𝑏
𝐮𝑎T𝚺 𝑎𝑎𝐮𝑎 ∙ 𝐮𝑏
T𝚺 𝑏𝑏𝐮𝑏
How can we find directions that maximize 𝜌𝑎𝑏?
How can we kernelize it in spaces 𝑋𝑎 & 𝑋𝑏? 41
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Canonical Correlation Analysis (CCA) Equivalent Optimization Problem
CCA is equivalent to finding a pair of unit directions
to maximize covariance:
max𝐮𝑎
T𝚺𝑎𝑎𝐮𝑎=𝐮𝑏T𝚺𝑏𝑏𝐮𝑏=1
𝐮𝑋T𝚺 𝑎𝑏𝐮𝑌
CCA Program has a Lagrangian:
ℒ = 𝐮𝑎T𝚺 𝑎𝑏𝐮𝑏 −
𝜆𝑎2 𝐮𝑎
T𝚺 𝑎𝑎𝐮𝑎 − 1 − 𝜆𝑏2 𝐮𝑏
T𝚺 𝑏𝑏𝐮𝑏 − 1
42
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Canonical Correlation Analysis (CCA) Equivalent Optimization Problem
CCA Program has a Lagrangian:
ℒ = 𝐮𝑎T𝚺 𝑎𝑏𝐮𝑏 −
𝜆𝑎2𝐮𝑎
T𝚺 𝑎𝑎𝐮𝑎 − 1 − 𝜆𝑏2𝐮𝑏
T𝚺 𝑏𝑏𝐮𝑏 − 1
Setting derivatives of ℒ to 0 gives conditions
𝚺 𝑎𝑏𝐮𝑏 − 𝜆𝑎𝚺 𝑎𝑎𝐮𝑎 = 𝟎 𝚺 𝑏𝑎𝐮𝑎 − 𝜆𝑏𝚺 𝑏𝑏𝐮𝑏 = 𝟎
One can show that 𝜆𝑎 = 𝜆𝑏 & we thus have
𝟎 𝚺 𝑎𝑏𝚺 𝑏𝑎 𝟎
𝐮𝑎𝐮𝑏
= 𝜆𝚺 𝑎𝑎 𝟎
𝟎 𝚺 𝑏𝑏
𝐮𝑎𝐮𝑏
𝐀𝐰 = 𝜆𝐁𝐰
This is a generalized eigenvalue problem
43
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Canonical Correlation Analysis (CCA) Solving as Generalized Eigenvalue Probem
Generalized Eigenvalue Problem: 𝐀𝐰 = 𝜆𝐁𝐰
Since 𝐁 ≻ 0, we can simply invert it:
𝐁−1𝐀𝐰 = 𝜆𝐰
This is now an eigenvalue problem… are we done?
Nope… not symmetric
44
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Canonical Correlation Analysis (CCA) Solving as Generalized Eigenvalue Probem
Generalized Eigenvalue Problem: 𝐀𝐰 = 𝜆𝐁𝐰
If 𝐁 ≻ 0, it can be decomposed as 𝐁 = 𝐁1 2 𝐁1 2 & by
letting 𝐰 = 𝐁−1 2 𝐯 we obtain the eigenvalue problem
𝐁−1 2 𝐀𝐁−1 2 𝐯 = 𝜆𝐯
Solutions 𝜆𝑖 , 𝐯𝑖 give solutions to original problem 𝐮𝑎,𝑖𝐮𝑏,𝑖
= 𝐰𝑖 = 𝐁−1 2 𝐯𝑖
Each eigenvalue is correlation coefficient: 𝜆𝑖 ∈ −1,1
Directions 𝐰𝑖 are not generally orthogonal; only
conjugate in space defined by 𝐁
However, there are 2𝑛 solutions; why? 45
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Canonical Correlation Analysis (CCA) Kernelizing CCA
CCA is kernelized taking 𝐮𝑎 = 𝐗𝑎𝛂𝑎 & 𝐮𝑏 = 𝐗𝑏𝛂𝑏.
For both reps., make kernel matrices 𝐊𝑎 & 𝐊𝑏
We then can replace 𝚺 𝑎𝑎 = 𝐊𝑎𝐊𝑎, 𝚺 𝑏𝑏 = 𝐊𝑏𝐊𝑏, &
𝚺 𝑎𝑏 = 𝐊𝑎𝐊𝑏 to apply CCA to arrive at
𝐊𝑎𝐊𝑏𝛂𝑏 − 𝜆𝐊𝑎𝐊𝑎𝛂𝑎 = 𝟎 𝐊𝑏𝐊𝑎𝛂𝑎 − 𝜆𝐊𝑏𝐊𝑏𝛂𝑏 = 𝟎
Problem: in high-dimensional spaces where
𝑚 ≫ 𝑛, one can always find perfect correlations ---
an example of the curse of dimensionality.
What can we do?
Answer: regularize the directions 𝐮𝑎 and 𝐮𝑏.
Solution is beyond the scope of this lecture but also
is solved as a generalized eigenvalue problem. 46
Saw
ade/L
andw
ehr/S
cheffe
r, Maschin
elle
s L
ern
en II
Summary
Goal: reduction / compression of data into essential
components
Maximization of variance leads to an eigenvalue
problem for principal component analysis (PCA)
Applicable to high-dimensional data and non-linear
components (kernel PCA)
Class-dependent variance minimization leads to
Fisher discriminant analysis (FDA)
Covariance maximization also yields a singular
value problem (MCA)
Max correlation between 2 different representations
leads to canonical correlation analysis (CCA)
47