Face detection Face detection Slides from: Svetlana Lazebnik, Silvio Savarese, Fei-Fei Li, Derek Hoiem also from Mor Yakobovits and Roni Karlikar and also David Cohen
Face detectionFace detection
Slides from: Svetlana Lazebnik, Silvio Savarese, Fei-Fei Li, Derek Hoiemalso from Mor Yakobovits and Roni Karlikar and also David Cohen
Face detection and recognition
Detection Recognition “Sally”
Applications of Face Detection
• Auto-focus in cameras
• Security systems (recognize faces of certain people)
• Human-computer interface
• Marketing systems
• Much more..
Humans have also tendency to see face patterns even if they don’t really exist.
Faces everywhere
5http://www.marcofolio.net/imagedump/faces_everywhere_15_images_8_illusions.html
Difficulties of Face Detection
Building a model for faces is not a simple task,
faces can be complex and vary from each other.
Faces in images are also affected from the environment.
Face shapes, facial features, skin Tone Variations….
Difficulties of Face Detection
Face appearance under variation in lighting can change drastically
Difficulties of Face Detection
Difficulties of Face Detection
Scaling and angles
Obstruction of facial features
Difficulties of Face Detection
Difficulties of Face Detection
Facial expressions
Funny Nikon ads"The Nikon S60 detects up to 12 faces."
"The Nikon S60 detects up to 12 faces."
Funny Nikon ads
Consumer application: Apple iPhoto
•Things iPhoto thinks are faces
Approaches to face Detection
Skin Detection - approaches
Approaches to face Detection
Template MatchingStart
Skin
Segmentation
Cross Correlation
between Image Blocks and
all scaled average faces
Find Face Location
Face Loc = Pos(Max corr)
Face Size = Size(avg face)
Blank out the Face
Using Face Loc &
Size just found
Max(xcorr) >
Threshold
Next Block
All Blocks
Done?
Stop
No
No
Yes Yes
Face
Candidates
Average Faces
Approaches to face Detection
Template Matching
Template Based approaches – Deformable Templates
Approaches to face Detection
Deformable templates: Yuille, Cohen, Hallinan (1989)
Eigenfaces
M.A. Turk and A.P. Pentland:
Eigenfaces for Recognition. Journal of Cognitive
Neuroscience, 3 (1):71--86, 1991.
The Viola/Jones Face Detector
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
P. Viola and M. Jones. Robust real-time face detection.IJCV 57(2), 2004.
Basic idea:1. Treat pixels as a vector
2. Recognize face by nearest neighbor
x
nyy ...1
xy −=T
kk
k argmin
Eigenfaces
The space of all face images
• Face images as vectors are extremely high-dimensional
– 100x100 image = 10,000 dimensions
– Slow and lots of storage
• But very few 10,000-dimensional vectors are valid face images
• We want to effectively model the subspace of face images
An image is a point in a high dimensional space• An N x M image is a point in RNM
• We can define vector for every image in this space
Face Space
x
• Eigenface idea: construct a low-dimensional linear
subspace that best explains the variation in the set
of face images
Face Space
• Detect new face by measuring distance to Face-Space
Principal Component Analysis (PCA)
Given data – find best linear representation in lower dim.
Find a set of k components (vectors)
Σ [�� - Σ ���ej]2
j=1
k
i
e1 e2 e3….
such that a linear combination gives a good approximation for every data point:
�� ≅����
�ej
Find ej and wj that minimize:
Data Set pi
e0
e0 = mean(pi)
Minimizes Σ(pi – e0)2
i
Pre: k= 0
Σ [�� - e0]2
i
Find e0 that minimizes:
Principal Component Analysis (PCA)
e0
Data Set pi~
pi - Mean zero data set~
Principal Component Analysis (PCA)
k = 1
Find e1 that minimizes:
Σ [��� - Σ ���ej]2
j=1
1
i
e1
Minimizes Σ(pi – e0 – wi1e1)2
i
e0
e1
wi1
pi
pi = e0 + wi1e1~
wi1, = <pi, e1>
e1e2
Data Set pi~
Principal Component Analysis (PCA)
k = 2
Find e2 that minimizes:
Σ [��� - Σ ���ej]2
j=1
2
i
Minimizes Σ(pi – e0 – wi1e1 – wi2e2)2
i
e0
e1
wi1
pi
wi2
pi = e0 + wi1e1 + wi2e2~
Principal Component Analysis (PCA)
Find a set of n directions (vectors)
e1 e2 e3….
with maximum variance of the points.
= Σ ||(�� − ���)�e||2�= Σ e�(�� − ���) (�� − ���)� e�
where A = [��� ��� … ���]C = A AT is the covariance matrix
= e���� e
���(e) = Σ ||����e||2�
Principal Component Analysis (PCA)
Find a set of n directions (vectors)
e1 e2 e3….
with maximum variance of the points.
��� �! e�"ee
||e||2 = 1
Solve e�"e = λ
Choose e eigenvector of largest eigenvalue λ
Create covariance matrix
Diagonalize using Singular Value Decomposition (SVD) :
C = UDVt
Where D is a diagonal matrix of eigenvalues and U,V are matrices of eigenVectors.
U = V = [ e1 e2 e3 ….]
Principal Component Analysis (PCA)
Given pi i =1..n,find a set of k directions (vectors)
e1 e2 e3….
(in decreasing order of eigenvalues)
C = Σ (�� − ���) (�� − ���)��
Choose first k eigenvectors.
•
•
•
Training
images
Eigenfaces
Face images must be
aligned (centered and
of the same size)
Eigenfaces
Treat pixels as a vector
247
249
251
249
249
.
.
.
= ��
e1 e2 e3….
e1
e2
Eigenfaces
e1 e2 e3….
�� �# �$ … = %
C = PPT~~
C = UDVTSVD
247 249
251
249 249
.
.
.
= ��
Eigenfaces exampleChoose top eigenvectors: e1,…ek
Mean: µ
Eigenfaces
Eigenfaces Projection
Top 3 eigenvectors: e1, e2, e3Mean: µ
e0 e1��= + *
e0
e1
wi1
pi
wi2
pi = e0 + wi1e1 + wi2e2~
e2��+ * e3�&+ *
wij = <pi ej> = piT ej
= + 0.195 * + 0.046 * + 0.032 *
e2
Eigenfaces Projection
k = # of eigenfaces
Sim
ilarity
to
origin
al
How many eigenfaces ? (n)
Choosing the Dimension K- Example
Choosing the Dimension K
K NMi =
eigenvalues
• How many eigenfaces to use?
• Look at the decay of the eigenvalues
– the eigenvalue tells you the amount of variance “in the direction” of that eigenface
– ignore eigenfaces with low variance
Eigenfaces Projection
Can perform projection on new image.
new
Projectedinto
face-space
If new ≈ projected
If new ≈ projected
new = face
new = face
e0
e1
wi1
pi
wi2
e2
Eigenfaces Projection
e1
e2
Eigenfaces Projection
e1
e2
Face-space
�� ≅����
�ej
��dΣ [�� - Σ ���ej]
2
j=1
k
id=
d < Thresh = face��= face��d > Thresh
http://demonstrations.wolfram.com/FaceRecognitionUsingTheEigenfaceAlgorithm/
Eigenfaces Results
Eigenfaces Results
Reconstruction using the eigenfaces
Eigenfaces Issues - 1
e1
e2
Possible solutions:
• Evaluate distance to mean face-space,• Evaluate distance to nearest face in space.• Evaluate size of weights Σ���
Problem:Projection of new face is near face-space but NOT near faces.
Problem:The dimension of is
Where is the number of pixels in each image.This matrix is often too large - not practical
N
2 2N N×
MM M×
Eigenfaces Issues -2
C = AAT
ATAATA
L =
Typically M << N2
Solution: consider the matrix The dimension of is ,where is the number of images in the training set
• If are the eigenvector of
( are the eigenvalues)
• Multiply by �
• The Eigenfaces of C are then: (need to normalize)
vi ATAL =
ATA vi = λivi λi
A AATAvi = Aλivi (AAT) Avi = λi Avi
C ei ei= λi
Eigenfaces Issues
�
ei = Avi
• PCA assumes that the data follows a Gaussian
distribution (mean µ, covariance matrix Σ)
The shape of this dataset is not well described by its principal components
Eigenfaces Limitations - I
Eigenfaces Limitations - II
Global appearance method: not robust to
misalignment, background variation
Eigenfaces Limitations - III• Performance decreases quickly with changes to face size
− Multi-scale eigenspaces.
− Scale input image to multiple sizes.
• Performance decreases with changes to face orientation (but not as fast as with scale changes)
− Plane rotations are easier to handle.
− Out-of-plane rotations are more difficult to handle.
Eigenfaces Limitations - IV
The direction of maximum variance is not
always good for classification
e1
f1
non-faces
faces
e1
f1
Fisherfaces (FLD)
A more discriminative subspace: FLD
• Fisher Linear Discriminants � “Fisher Faces”
• PCA preserves maximum variance
• FLD preserves discrimination
– Find projection that maximizes scatter between classes
and minimizes scatter within classes
Reference: Eigenfaces vs. Fisherfaces, Belheumer et al., PAMI 1997
The Viola/Jones Face Detector
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
P. Viola and M. Jones. Robust real-time face detection.IJCV 57(2), 2004.
Challenges of face detection
• Sliding window detector must evaluate tens of thousands of location/scale combinations
• Faces are rare: 0–10 per image
• For computational efficiency, we should try to spend as little time as possible on the non-face windows
• A megapixel image has ~106 pixels and a comparable number of candidate face locations
• To avoid having a false positive in every image image, our false positive rate has to be less than 10-6
• First real-time face detector
• Training is slow, but detection is very fast
• Key ideas
• Integral images for fast feature evaluation
• Boosting for feature selection
• Attentional cascade for fast rejection of non-face windows
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.
The Viola/Jones Face Detector
Image Features
• All faces share some common features:
• The eyes region is darker than the upper-cheeks.
• The nose bridge region is brighterthan the eyes.
• Features must be simple (a value) and efficient to compute
• How many features are needed to indicate the existence of a face?
Image Features
“Rectangle filters”
(Haar-like features)
Value =
∑ (pixels in white area) – ∑ (pixels in black area)
= correlation with a mask having 1 in pixels of white areas,
and -1 in pixels of black areas
Example
Value = 0.001 Value = 10
Rectangle Features (Haar Features)
• Some features correspond to common facial features. Examples:
Basic Features Vary Scale and Orientation
For 24x24 detection region, there are 162,336 possible features (in all sizes), all based on the base 5 features.
Rectangle Features (Haar Features)
Challenges
1) Feature Computation – as fast as possible
2) Feature Selection – too many features, need to select the most informative ones
3) Real-timeliness – focus mainly on potentially positive image areas (potentially faces)
Fast !!!
• The integral image computes a value at each pixel (x,y) that is the sum of the pixel values above and to the left of (x,y), inclusive
• This can quickly be computed in one pass through the image
(x,y)
Integral Image
ii(x, y-1)
s(x-1, y)
i(x, y)
s(x,y) = sum of pixels in row x,columns 1…y
i(x,y) is the imageii(x,y) is its integral image
Computing the Integral Image
�� !, ( = � � !*, (′,-.,,/*./
0 !, ( = 0 ! − 1, ( + �(!, ()�� !, ( = �� !, ( − 1 + 0(!, ()
Recursive definition:
Formal definition:
Computing sum within a rectangle
Compute sum in rectangle D.
A,B,C,D are the values of the integral image at the corners a,b,c,d of D.
The sum of original image values within D can be computed:
sum(D) = ii(d) – ii(b) – ii(c) + ii(a) ii(a) = A
ii(b) = A+B
ii(c) = A+C
ii(d) = A+B+C+D
D = ii(d)+ii(a)-ii(b)-ii(c)
Computing sum within a rectangle
Compute sum in rectangle D.
A,B,C,D are the values of the integral image at the corners a,b,c,d of D.
The sum of original image values within D can be computed:
sum(D) = ii(d) – ii(b) – ii(c) + ii(a)ii(d) + ii(a)– ii(c)– ii(b)
-
Only 3 additions are required for any size of rectangle!
ii(a) = A
ii(b) = A+B
ii(c) = A+C
ii(d) = A+B+C+D
D = ii(d)+ii(a)-ii(b)-ii(c)
Original image:
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
Contructing integral image:
0 0,0 = 0 −1,04
+ � 0,0�45
=205
�� 0,0 = �� 0, −14
+ 0 0,0�45
=205
205
Integral image - example
Slides from David Cohen
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
Contructing integral image:
0 1,0 = 0 0,0�45
+ � 1,0�44
=305
�� 1,0 = �� 1, −14
+ 0 1,0&45
=305
205 305
Integral image - example
Original image:
Slides from David Cohen
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
Contructing integral image:
0 2,0 = 0 1,0&45
+ � 2,074
=395
�� 2,0 = �� 2, −14
+ 0 2,0&75
=395
205 305 395
Integral image - example
Original image:
Slides from David Cohen
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
Contructing integral image:
0 3,0 = 0 2,0&75
+ � 3,04
=395
�� 3,0 = �� 3, −14
+ 0 3,0&75
=395
205 305 395 395
Integral image - example
Original image:
Slides from David Cohen
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
Contructing integral image:
0 0,1 = 0 −1, 14
+ � 0,1�44
=200
�� 0,1 = �� 0,0�45
+ 0 0,1�44
=405
205 305 395 395
405
Integral image - example
Original image:
Slides from David Cohen
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
Contructing integral image:
0 1,1 = 0 0,1�44
+ � 1,1&4
=230
�� 1,1 = �� 1,0&45
+ 0 1,1�&4
=535
205 305 395 395
405 535
Integral image - example
Original image:
Slides from David Cohen
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
Contructing integral image:
0 2,1 = 0 1,1�&4
+ � 2,1�45
=335
�� 2,1 = �� 2,0&75
+ 0 2,1&&5
=730
205 305 395 395
405 535 730
Integral image - example
Original image:
Slides from David Cohen
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
Contructing integral image:
0 3,1 = 0 2,1&&5
+ � 3,194
=415
�� 3,1 = �� 3,0&75
+ 0 3,1:�5
=810
205 305 395 395
405 535 730 810
Integral image - example
Original image:
Slides from David Cohen
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
Integral Image:
205 305 395 395
405 535 730 810
610 840 1125 1205
810 1070 1460 1620
Integral image - example
Original image:
Slides from David Cohen
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
205 305 395 395
405 535 730 810
610 840 1125 1205
810 1070 1460 1620
ii(x,y) in the integral image is the sum of all the pixels above and to the left in the original image
Integral image - example
Original image: Integral Image:
Slides from David Cohen
Assume we want to calculate the pixels in the red area.
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
205 305 395 395
405 535 730 810
610 840 1125 1205
810 1070 1460 1620
Integral image - example
Integral Image:Original image:
Slides from David Cohen
Original image: Integral Image:
205 100 90 0
200 30 105 80
205 100 90 0
200 30 105 80
205 305 395 395
405 535 730 810
610 840 1125 1205
810 1070 1460 1620
B
C D
BA
C D
Integral image - example
Assume we want to calculate the pixels in the red area.
A
=1620-810-1070+535=275-B-C+AD
Slides from David Cohen
Feature Evaluation Using Integral Image
-1
+2
A B
C D
E F
Black square = D – B – C + AWhite square = F – D – E + C
White - Black = -A+B+2C-2D-E+F
∑ (pixels in white area) -
∑ (pixels in black area)
Result: Rapid feature evaluation!
Two-, three- and four-rectangular features can be computed with 6, 8 and 9 array accesses respectively.
Challenges
1) Feature Computation – as fast as possible
2) Feature Selection – too many features, need
to select the most informative ones
3) Real-timeliness – focus mainly on potentially
positive image areas (potentially faces)
Feature Selection
• The problem: too many features
– In a 24x24 sub-window there are ~160,000 possible features
– Impractical to evaluate all of the features in every candidate sub-window
• The solution: select the most informative features
AdaBoost Algorithm
• Introduced by Yoav Freund & Robert E. Schapire in 1995
• It is a machine-learning algorithm
• Stands for Adaptive Boost
• AdaBoost is an algorithm for constructing a “strong” classifier as linear combination of “simple” “weak” classifier
• Week classifier – performs slightly better than random guess.
Strong classifier
Weak classifier
Weight
Image
C(x) = αt ht(x) +αt ht(x) + αt ht(x) + …
>
=otherwise 0
)( if 1)(
tttt
t
pxfpxh
θ
window
value of rectangle feature
parity threshold
The weak classifiers
A weak classifier hj(x) consists of a feature
fj, a threshold θj, and a parity pj indicating
the direction of inequality sign:
x is a 24-by-24 sub-window of an image
Ensemble classification function = linear combination of week classifiers:
>
= ∑∑==
otherwise 02
1)( if 1
)(11
T
t
t
T
t
tt xhxC
αα learned weights
The Strong classifier
Adaboost procedure
• Given training set !�, (� , … , !<, (<• (� ∈ −1,+1 correct/incorrect label of each !� ∈ >• All examples initialized to have the same weight �� = �
<• For t = 1,…,T
• Construct all weak classifiers: ℎ@: > → −1,+1• Choose weak classifier with minimum error C@ :
C@ =��� ℎ@ !� − (�<
��
• Update weights – increase weight of points that were classified incorrectly by current weak classifier.
Adaboost procedure
• Compute final classifier as linear combination of all weak
learners (weight of each learner is directly proportional to
its accuracy).
>
= ∑∑==
otherwise 02
1)( if 1
)(11
T
t
t
T
t
tt xhxC
αα
αi are a function of the point weights wi
αi are proportional to how “reliable” a week classifier is.
Boosting Example
First classifier
First 2 classifiers
First 3 classifiers
Final Classifier learned by Boosting
Boosting for face detection
First two features selected by boosting:
This feature combination can yield 100% detection rate and 50% false positive rate
Boosting vs. SVM
• Advantages of boosting• Integrates classifier training with feature selection
• Complexity of training is linear instead of quadratic in the number of training examples
• Flexibility in the choice of weak learners, boosting scheme
• Testing is fast
• Easy to implement
• Disadvantages• Needs many training examples
• Training is slow
• Often doesn’t work as well as SVM (especially for many-class problems)
• A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084
Not good enough!
Receiver operating characteristic (ROC) curve
Boosting for face detection
Challenges
1) Feature Computation – as fast as possible
2) Feature Selection – too many features, need to select the most informative ones
3) Real-timeliness – focus mainly on potentially positive image areas (potentially faces)
Real-timeliness
• On average only 0.01% of all sub-windows in a image are positives (faces)
• We spend time equally on negative & positive windows
Attentional cascade
• Start with simple classifiers which reject many of the negative sub-windows but detect (almost) all positive sub-windows
• Positive response from the first classifier triggers a second (more complex) classifier, and so on
• A negative outcome at any point leads to the immediate rejection of the sub-window
FACEIMAGE
SUB-WINDOWClassifier 1
T
Classifier 3
T
F
NON-FACE
T
Classifier 2
T
F
NON-FACE
F
NON-FACE
Cascade classifiers with gradually increased complexity
FACEIMAGE
SUB-WINDOWClassifier 1
T
Classifier 3
T
F
NON-FACE
T
Classifier 2
T
F
NON-FACE
F
NON-FACE
• Chain classifiers that are progressively more complex and have lower false positive rates:
• Each layer will be a “strong” classifier obtained using AdaBoost
vs false neg determined by
% False Pos
% D
etec
tion
0 50
0 100
Receiver operating characteristic
Attentional cascade
• The detection rate and the false positive rate of the cascade are found by multiplying the respective rates of the individual stages
• A detection rate of 0.9 and a false positive rate on the order of 10-6 can be achieved by a 10-stage cascade if each stage has a detection rate of 0.99 (0.9910 ≈ 0.9) and a false positive rate of about 0.30 (0.310 ≈ 6×10-6)
Attentional cascade
FACEIMAGE
SUB-WINDOWClassifier 1
T
Classifier 3
T
F
NON-FACE
T
Classifier 2
T
F
NON-FACE
F
NON-FACE
Cascade - Comparison
Training
Set(sub-
windows)
Integral
Representation
Feature
computation
AdaBoost
Feature Selection
Cascade trainer
Training phase
Strong Classifier 1
(cascade stage 1)
Strong Classifier N
(cascade stage N)
Classifier cascade
framework
Strong Classifier 2
(cascade stage 2)
FACE IDENTIFIED
Viola & Jones Algorithm - Visualization
Testing phase
Viola & Jones Algorithm - Visualization
Strong Classifier 1
(cascade stage 1)
Strong Classifier N
(cascade stage N)
Classifier cascade
framework
Strong Classifier 2
(cascade stage 2)
Testing phase
NOT A FACE !!!
• Finding the optimum is not practical.
• Viola & Jones goal: 95% TP rate, 10-6 FP rate
They suggested an algorithm that:
• does not guarantee optimality, but
• able to generate a cascade that meets their goal
Training the cascade
– How many layers? (strong classifiers)
– How many features in each layer?
– Threshold of each strong classifier?
• Set target detection and false positive rates for each stage
• Keep adding features to the current stage until its target rates have been met
• Need to lower AdaBoost threshold to maximize detection (as opposed to minimizing total classification error)
• Test on a validation set
• If the overall false positive rate is not low enough, then add another stage
• Use false positives from current stage as the negative training examples for the next stage
Training the cascade
Viola & Jones System
• Tested on the MIT+MCU test set
• Training time: “weeks” on 466 MHz Sun workstation
• 38 layers, total of 6061 features
• 1st classifier- layer, 2-features 50% FP rate, 99.9% TP rate
• 2nd classifier- layer, 10-features 20% FP rate, 99.9% TP rate
• next 2 layers 25-features each, next 3 layers 50-features each
• Average of 10 features evaluated per window on test set
• A 384x288 image on an PC (dated 2001) took about 0.067 seconds
• 15 times faster than previous detector (Rowley et al., 1998)
Output of Face Detector on Test Images
Profile Detection
Profile Features
Summary: Viola/Jones detector
• Rectangle features
• Integral images for fast computation
• Boosting for feature selection
• Attentional cascade for fast rejection of negative windows