Recognition 3 - University of California, San Diegocseweb.ucsd.edu › classes › wi14 › cse152-a › lec19.pdfdistances from each datapoint to the the subspace. CSE152, Winter

1

CSE152, Winter 2013 Intro Computer Vision

Recognition 3

Introduction to Computer Vision CSE 152

Lecture 19


•  HW4 due on Friday

•  Note fisherfaces paper on web page.


Three Levels of Recognition •  Category Recognition -- near top of tree (e.g.,

vehicles) – lots of within class variability •  Fine grain Recognition – within a categories (e.g.,

species of birds) -- Moderate within class variation •  Instance recognition (e.g., person identification) –

within class mostly shape articulation, bending, etc.


Linear Subspaces & Linear Projection

•  A d-pixel image x∈Rd can be projected to a low-dimensional feature space y∈Rk by

y = Wx where W is an k by d matrix.

•  Each training image is projected to the subspace

•  Recognition is performed in Rk using, for example, nearest neighbor.

•  How do we choose a good W?

Example: Projecting from R3 to R2

Rk Rd


Eigenfaces: Principal Component Analysis (PCA)


PCA Example

First Principal Component Direction of Maximum Variance

v1

µ

v2

Mean

2


Eigenfaces Modeling

1.  Given a collection of n training images xi, represent each one as a d-dimensional column vector

2.  Compute the mean image and covariance matrix. 3.  Compute k Eigenvectors of the covariance matrix

corresponding to the k largest Eigenvalues and form matrix WT=[v1, v2,…,vk] (Or perform using SVD!!) §  Note that the Eigenvectors are images

4.  Project the training images to the k-dimensional Eigenspace. yi=Wxi

Recognition 1.  Given a test image x, project the vectorized image to the

Eigenspace by y=Wx 2.  Perform classification of y to the projected training images.


Why is W a good projection? •  The linear subspace spanned by W

maximizes the variance (i.e., the spread) of the projected data.

•  W spans a subspace that is the best approximation to the data in a least squares sense. E.g., W is the subspace that minimizes the the sum of the squared distances from each datapoint to the the subspace.


Eigenfaces: Training Images

[ Turk, Pentland 91]


Eigenfaces

Mean Image Basis Images


An important footnote: We don’t really implement PCA by constructing a

covariance matrix!

Why? 1.  How big is Σ?

•  d by d where d is the number of pixels in an image!!

2.  You only need the first k Eigenvectors


Singular Value Decomposition •  Any m by n matrix A may be factored such that

A = UΣVT

[m x n] = [m x m][m x n][n x n]

•  U: m by m, orthogonal matrix –  Columns of U are the eigenvectors of AAT

•  V: n by n, orthogonal matrix, –  columns are the eigenvectors of ATA

•  Σ: m by n, diagonal with non-negative entries (σ1, σ2, …, σs) with s=min(m,n) are called the called the singular values. SVD algorithm produces sorted singular values: σ1 ≥ σ2 ≥ … ≥ σs

Important property –  Singular values are the square roots of Eigenvalues of both AAT

and ATA & Columns of U are corresponding Eigenvectors!!

3


SVD Properties •  In Matlab [u s v] = svd(A), and you can verify

that: A=u*s*v’ •  r=Rank(A) = # of non-zero singular values. •  U, V give an orthonormal bases for the subspaces

of A: –  1st r columns of U: Column space of A –  Last m - r columns of U: Left nullspace of A –  1st r columns of V: Row space of A –  1st n - r columns of V: Nullspace of A

•  For some d where d ≤ r, the first d column of U provide the best d-dimensional basis for columns of A in least squares sense.


Distance to Linear Subspace •  An n-pixel image x∈Rd can be projected to a low-dimensional feature space y∈Rk by

y = Wx •  From y ∈ Rk , the reconstruction of the point in Rd is WTy=WTWx

•  The error of the reconstruction, or the distance from x to the subspace spanned by W is:

||x-WTWx||

x

y = Wx


Fisherfaces: Class specific linear projection P. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection, PAMI, July 1997, pp. 711--720.

•  An n-pixel image x∈Rd can be projected to a low-dimensional feature space y∈Rk by

y = Wx where W is an k by d matrix. •  Recognition is performed using nearest neighbor in Rk.

•  How do we choose a good W? CSE152, Winter 2013 Intro Computer Vision

•  Eigenfaces (PCA) – maximizes the scatter (spread) of the projected data.

•  Fisher’s linear discriminant -- maximizes a different criteria – we’ll see this shortly

•  Note: Let Σ be a scatter matrix •  The determinant |Σ| of Σ is an indication of the

spread of the data (product of Eigenvalues of Σ) •  Let W be a projectoin matrix, then scatter of the

projected data is WTΣW and a measure of spread is |WTΣW|.


PCA & Fisher’s Linear Discriminant •  Let χi be a set of images

of class i •  Between-class scatter

•  Within-class scatter

•  Total scatter

where –  c is the number of classes –  µi is the mean of class χi –  | χi | is number of samples of χi..

Tii

c

iiBS ))((

1µµµµχ −−=∑

=

∑ ∑= ∈

−−=c

i x

TikikW

ik

xxS1

))((χ

µµ

WB

c

i x

TkkT SSxxS

ik

+=−−=∑ ∑= ∈1

))((χ

µµ

µ1

µ2

µ

χ1 χ2

If the data points xi are projected by yi=Wxi and the scatter of xi is S, then the scatter of the projected points yi is WTSW


PCA & Fisher’s Linear Discriminant •  PCA (Eigenfaces)

Maximizes projected total scatter

•  Fisher’s Linear Discriminant

Maximizes ratio of projected between-class to projected within-class scatter

WSWW TT

WPCA maxarg=

WSW

WSWW

WT

BT

Wfld maxarg=

χ1 χ2

PCA

FLD

4


Computing the Fisher Projection Matrix

•  The wi are orthonormal •  There are at most c-1 non-zero generalized Eigenvalues, so m ≤ c-1 •  Can be computed with eig in Matlab


Fisherfaces

WWSWW

WWSWWW

WSWW

WWW

PCAWTPCA

T

PCABTPCA

T

Wfld

TT

WPCA

PCAfld

maxarg

maxarg

=

=

= •  Since SW is rank N-c, project

training set to subspace spanned by first N-c principal components of the training set. •  Apply FLD to N-c dimensional subspace yielding c-1 dimensional feature space. •  Rd è RN-c è Rc-1

•  Fisher’s Linear Discriminant projects away the within-class variation (lighting, expressions) found in training set. •  Fisher’s Linear Discriminant preserves the separability of the classes.


PCA vs. FLD


Harvard Face Database 15o

45o

30o

60o

•  10 individuals •  66 images per person •  Train on 6 images at 15o

•  Test on remaining images


Recognition Results: Lighting Extrapolation

0

5

10

15

20

25

30

35

40

45

0-15 degrees 30 degrees 45 degrees

Light Direction

Erro

r Rat

e

Correlation Eigenfaces Eigenfaces (w/o 1st 3) Fisherface


Variability: Camera position Illumination Internal parameters

Within-class variations

5


Appearance manifold approach - for every object 1. sample the set of viewing conditions 2. Crop & scale images to standard size 3. Use as feature vector - apply a PCA over all the images - keep the dominant PCs - Set of views for one object is represented as a manifold in the projected space - Recognition: What is nearest manifold for a given test image?

(Nayar et al. ‘96)


Parameterized Eigenspace


Recognition


Object Bag of ‘words’

Bag-of-features models

Slides from Svetlana Lazebnik who borrowed from others


Bag-of-features models


Origin 1: Texture recognition •  Texture is characterized by the repetition of basic

elements or textons •  For stochastic textures, it is the identity of the

textons, not their spatial arrangement, that matters

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003

6


Origin 1: Texture recognition

Universal texton dictionary

histogram

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003 CSE152, Winter 2013 Intro Computer Vision

Origin 2: Bag-of-words models •  Orderless document representation: frequencies of

words from a dictionary Salton & McGill (1983)

Which US President? Franklin D. Roosevelt, John F. Kennedy, George W. Bush


Origin 2: Bag-of-words models

US Presidential Speeches Tag Cloud http://chir.ag/phernalia/preztags/

•  Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)

















7


Motion

Introduction to Computer Vision CSE 152

Lecture 19b


Motion What are problems that we solve using motion

1.  Correspondence: Where have elements of the image moved between image frames

2.  Ego Motion: How has the camera moved. 3.  Reconstruction: Given correspondence, what is 3-D

geometry of scene 4.  Segmentation: What are regions of image corresponding

to different moving objects 5.  Tracking: Where have objects moved in the image?

related to correspondence and segmentation.

Variations: –  Small motion (video), –  Wide-baseline (multi-view)


Structure-from-Motion (SFM) Goal: Take as input two or more images or

video w/o any information on camera position/motion, and estimate camera position and 3-D structure of scene.

Two Approaches

1.  Discrete motion (wide baseline) 1.  Orthographic (affine) vs. Perspective 2.  Two view vs. Multi-view 3.  Calibrated vs. Uncalibrated

2.  Continuous (Infinitesimal) motion CSE152, Winter 2013 Intro Computer Vision

Discrete Motion: Some Counting Consider M images of N points, how many unknowns?

1.  Camera locations: Affix coordinate system to location of first camera location: (M-1)*6 Unknowns

2.  3-D Structure: 3*N Unknowns 3.  Can only recover structure and motion up to scale. Why?

Total number of unknowns: (M-1)*6+3*N-1 Total number of measurements: 2*M*N Solution is possible when (M-1)*6+3*N-1 ≤ 2*M*N

M=2 è N≥ 5 M=3 è N ≥ 4


Epipolar Constraint: Calibrated Case

Essential Matrix (Longuet-Higgins, 1981)

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−

−

−

=×

00

0][

xy

xz

yz

tttttt

twhere

Recognition 3 - University of California, San Diegocseweb.ucsd.edu › classes › wi14 › cse152-a › lec19.pdfdistances from each datapoint to the the subspace. CSE152, Winter

Documents