Viola Jones Simplified By Eric Gregori Introduction Viola Jones refers to a paper written by Paul Viola and Michael Jones describing a method of machine vision based fast object detection. This method revolutionized the field of face detection. Using this method, face detection could be implemented in embedded devices and detect faces within a practical amount of time. In this paper they describe an algorithm that uses a modified version of the AdaBoost machine learning algorithm to train a cascade of weak classifiers ( Haar features ). Haar features ( along with a unique concept, the integral image ) are used as the weak classifiers. The weak classifiers are combined using the AdaBoost algorithm to create a strong classifier. The strong classifiers are combined to create a cascade. The cascade provides the mechanism to achieve high classification with a low cpu cycle count cost.
33
Embed
Viola Jones Simplified - Robot MagMany theories have been proposed to calculate the threshold; minimum, average, standard variation, and average variation. The following diagram illustrates
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Viola Jones Simplified
By Eric Gregori
Introduction
Viola Jones refers to a paper written by Paul Viola and
Michael Jones describing a method of machine vision based fast
object detection. This method revolutionized the field of face
detection. Using this method, face detection could be
implemented in embedded devices and detect faces within a
practical amount of time.
In this paper they describe an algorithm that uses a
modified version of the AdaBoost machine learning algorithm to
train a cascade of weak classifiers ( Haar features ). Haar
features ( along with a unique concept, the integral image ) are
used as the weak classifiers. The weak classifiers are combined
using the AdaBoost algorithm to create a strong classifier. The
strong classifiers are combined to create a cascade. The
cascade provides the mechanism to achieve high classification
with a low cpu cycle count cost.
“This paper describes a machine learning approach for visual
object detection which is capable of processing images
extremely rapidly and achieving high detection rates. This
work is distinguished by three key contributions. The first
is the introduction of a new image representation called the
“Integral Image” which allows the features used by our detector
to be computed very quickly. The second is a learning
algorithm, based on AdaBoost, which selects a small number
of critical visual features from a larger set and yields
extremely efficient classifiers[6]. The third contribution is
a method for combining increasingly more complex classifiers
in a “cascade” which allows background regions of the
image to be quickly discarded while spending more computation
on promising object-like regions.”
Excerpt from Rapid Object Detection using a Boosted Cascade of Simple Features
Haar Features
Haar features are one of the mechanisms used by the Viola -
Jones algorithm. Images are made up of many pixels. A 250x250
image contains 62500 pixels. Processing images on a pixel by
pixel basis is a very CPU cycle intensive process. In addition,
individual pixel data contains no information about the pixels
around it. Pixel data is absolute, as aposed to relative. A
side effect of the absolute nature of pixel data is the effect
of lighting. Since the pixel data is absolutely effected by the
lighting of the image, large variances can occur in the pixel
data due only to changes in lighting.
Haar features solve both problem related to pixel data (
cpu cycles required and relativity of data). Haar features do
not encode individual pixel information, they encode relative
pixel information. Haar features provide relative information on
multiple pixels.
Haar features are based on Haar wavelets as proposed by:
S. Mallat. A theory for multiresolution signal decom- position: The wavelet representation. IEEE Transacttons
on Pattern Analyszs and Machzne Intellzgence,
11(7):674-93, July 1989.
Haar features were originally used in the paper:
A General Framework for Object Detection Constantine P. Papageorgiou Michael Oren Tomaso Poggio
Center for Biological and Computational Learning
Artificial Intelligence Laboratory
MIT
Cambridge, MA 02139
{ cpapa, oren, tp}@ai. mit , edu
A Haar feature is used to encode both the relative data between
pixels, and the position of that data. A Haar feature consists
of multiple adjacent areas that are subtracted from each other.
Viola - Jones suggested Haar features containing; 2, 3, and 4
areas.
Haar Features ( +/- Polarities )
Value = Pixels under Green – pixels under red
The value of a Haar feature is calculated by taking the sum of
the pixels under the green square and subtracting the sum of the
pixels under the red square.
value = �
��∑ ����� ∑ ���
� �
� � )
By encoding the difference between two adjoining areas in a
image, the Haar feature is effectively detecting edges. The
further the value is from zero, the harder or more distinct the
edge. A value of zero indicates the two areas are equal, thus
the pixels under the area have equal average intensities ( the
lack of, an edge ). It should be noted that although this
process can be done on color images, for the Viola Jones
algorithm this process is done on grayscale images. In most
cases individual pixel values are from 0 to 255, with 0 being
black, and 255 being white.
Average of pixels
under Green area under Red area Haar Value
125 250 -125
125 225 -100
125 200 -75
125 175 -50
125 150 -25
125 125 0
125 100 25
125 75 50
125 50 75
125 25 100
125 0 125
Average of pixels
under Green area under Red area Haar Value
250 250 0
250 225 25
250 200 50
250 175 75
250 150 100
250 125 125
250 100 150
250 75 175
250 50 200
250 25 225
250 0 250
The values calculated using Haar features require one additional
step before being used for object detection. The values must be
converted to true or false results. This is done using
thresholding.
Harder Edge
Harder Edge
Harder Edge
No Edge
Hard edge
Thresholding
Thresholding is the process of converting an analog value
into a true or false. In this case, the analog value is the
output of the Haar feature: value = �
��∑ ����� ∑ ���
� �
� � ). To
convert the analog value into a true / false statement the
analog value is compared to a threshold. If the value is >= to
the threshold, the statement is true, if not it is false.
Where: hj(x) – Weak classifier(basically 1 Haar feature)
Pj - Parity
fi(x) - Haar feature
Thetaj - Threshold
As the equation above illustrates, the output of the weak
classifier is either true or false. The threshold determines
when the function transitions the output state. The parity
determines the direction of the in-equality sign. This will be
demonstrated more later. The threshold and parity must be set
correctly to get the full benefit of the feature.
Setting the threshold and parity is not clearly defined in
the Viola Jones paper “For each feature, the weak learner
determines the optimal threshold classification function, such
that the minimum number of examples are misclassified” Viola
Jones. This statement is attributed to anonymous in the paper.
Many theories have been proposed to calculate the threshold;
minimum, average, standard variation, and average variation.
The following diagram illustrates the results of those theories.
The data is based off of one feature type, in a single known
position. The feature position and type were based off of data
from the Viola Jones paper.
Example
As the above text and illustrations show, Viola Jones found that
a 3 area Haar feature across the bridge of the nose provided a
better then average probability of detecting a face.
The eyes are darker then the bridge of the nose.
value = �
��2 ∑ ����� ∑ ���
� �
� � )
The value increases if the eye area gets darker, or the bridge
of the nose gets brighter. The feature shown above was placed
over the eyes of 313 face images. The values were calculated for
each image and graphed. Value for picture above
Figure 1 - Haar Feature sums over 313 face and 313 non-face images
Each blue point in the blue curve above represents a Haar
feature value ( like the one above ), when placed over the eyes
and nose of 313 different faces. The green line represents the
average of those values. The average value is positive, and
well above zero. This indicates that the Haar feature is over a
portion of the image that matches the features characteristics (
in this case, light in the middle and dark on the sides ).
The red line indicates the exact same feature being placed
in the upper left corner of the image, as shown above. This
represents a non-face image, or random noise. A feature is
weighted on how well it distinguishes between random noise ( no-
faces ) and a portion of the face ( eyes/nose in this case ).
The center of the 3 area Haar feature is multiplied by 2, to
balance the 2 negative sides. So the formula for the 3 area Haar
is slightly different then the other Haar types: value = �
��2 ∑ ����� ∑ ���
� �
� � ). A 3 area Haar feature over a random
noise image results in a value of 0. As you can see from the
graph above, the average over 313 images is about 0.
-60000
-50000
-40000
-30000
-20000
-10000
0
10000
20000
30000
40000
50000
1
12
23
34
45
56
67
78
89
10
0
11
1
12
2
13
3
14
4
15
5
16
6
17
7
18
8
19
9
21
0
22
1
23
2
24
3
25
4
26
5
27
6
28
7
29
8
30
9
Face
non-Face
Face Avg
Non-Face Avg
Parity
The parity variable is used to adjust the value so that it
is above 0. An inverted feature �
��2 ∑ ����� � ∑ ���
� �
� � ) would
present a value that is of the same magnitude with a different
sign. The parity is used to convert the value of an inverted
feature into a positive integer, bringing it above the zero
line.
The implementation described in this paper used a slightly
different approach. Instead of using a parity variable, 2 sets
of Haar features were used; inverted and non-inverted. During
the threshold learning process, Haar features were discarded if
their average value over the 313 images was negative. In
summary, both an inverted and non-inverted Haar feature were
placed in the same location on the image. The Haar feature with
a negative average value over all images was discarded. This was
simpler from an implementation point of view.
Haar feature thresholding
Figure 2 - Various statistical values from data in figure 1
The graph above shows various methods of calculating a
threshold. All methods calculate a threshold based on measuring
the Haar feature values at the same image position, in all 313
images.
The goal is to calculate a single threshold for all 313 images,
that maximizes the number of faces detected, and minimizes the
number of non face detection ( false positives ).
-60000
-50000
-40000
-30000
-20000
-10000
0
10000
20000
30000
40000
50000
1
11
21
31
41
51
61
71
81
91
10
1
11
1
12
1
13
1
14
1
15
1
16
1
17
1
18
1
19
1
20
1
21
1
22
1
23
1
24
1
25
1
26
1
27
1
28
1
29
1
30
1
31
1
24|24
Fdev
Favg
0|0
Nfavg
Nfdev
Fstd
Nstd
Mean Threshold
The mean threshold is the average of the value for all
images containing a face. This is done by summing the Haar
feature values obtained in the same position over the face on
all 313 images, and dividing by the number of images. This is
represented by the green line ( Favg ) in the graph above.
Threshold method #1 = �
�∑ ����
� � Where: N = Number of images
i = single image
�� = Single Haar
in same position
on all images.
As stated above, the parity determines the direction of the in-
equality sign. In this example the parity is set such that the
result is a 1 if the Haar feature value is >= threshold.
With parity set accordingly, everything above or on the green
line ( Favg ) in the graph above will register a true output
from the weak classifier equation. Images that result in a Haar
value >= the green line ( Favg ) is a face. If we apply the same
threshold to the non-face images ( noise ) we can get an idea of
how well the weak classifier works.
Figure 3 - Haar feature over 313 face images
The green shaded box, represents values over Favg. This Haar
feature in this position will categorize values greater then
17516 as faces. 17516 is the average value for this Haar feature
in this position across all 313 images. Using Favg for threshold
only detects 160 out of the 313 faces ( about 51% ). This makes
sense since the threshold is the average value over all 313
images.
If the same Haar feature is moved to a position in the image
where it is know there is no face, the following data is
generated.
-15000
-5000
5000
15000
25000
35000
1
11
21
31
41
51
61
71
81
91
10
1
11
1
12
1
13
1
14
1
15
1
16
1
17
1
18
1
19
1
20
1
21
1
22
1
23
1
24
1
25
1
26
1
27
1
28
1
29
1
30
1
31
1
Faces 24|24
Faces Fdev
Faces Favg
Faces Fstd
The Haar feature is located in the exact same position
within the image. Pixels under the red boxes are
subtracted from pixels in the green box. The graph below
shows the results from 313 images with the same
feature in the same position. When the Haar feature is
thresholded it produces a value between 1 and 0. At this
point it becomes a weak classifier.
Figure 4 - - Haar feature over 313 non-face images
The above graph shows the results of placing the Haar feature
over the same portion of background in each image. Since the
lfw images use random backgrounds, this results in generally
random data being measured by the feature in this position.
This is backed-up by the average being close to 0. Notice some
background image is incorrectly classified as a face. This is
determined by the number of points greater then or equal to the
green line ( Favg for all images, Haar feature over face ).
There were false positives in 11 of the 313 images.
Mean – results
Face Nonface
160/313 11/313
51% 3.5%
Faces were correctly identified in 160 images, and incorrectly
in 11 images. This was only testing 1 background position/image.
-15000
-5000
5000
15000
25000
35000
1
11
21
31
41
51
61
71
81
91
10
1
11
1
12
1
13
1
14
1
15
1
16
1
17
1
18
1
19
1
20
1
21
1
22
1
23
1
24
1
25
1
26
1
27
1
28
1
29
1
30
1
31
1
Non-Faces 0|0
Fdev
Favg
Fstd
The Haar feature is located in the exact same position
within the image. Pixels under the red boxes are
subtracted from pixels in the green box. The graph below
shows the results from 313 images with the same feature
in the same position. Notice there are many peaks over
the threshold ( green line ). These peaks represent false
positives. Background that the weak classifier mistakenly
classifies as a face.
Problem with using mean threshold
As expected, using the mean value for threshold resulted in
about half of the face images being classified correctly as
faces. This should not be a surprise. The number of false
positives was low ( 11/313 ), but the so was the number of faces
classified correctly ( 160/313 ). The data derived from using
the mean value for threshold suggests that the mean value may be
more appropriate as a ceiling for the threshold.
On the other side of the spectrum, the floor for the
threshold would be the minimum of the values across the images.
This would guarantee that all the training images would be
correctly classified as faces. The tradeoff is a high number of
false positives.
As mentioned earlier in this paper, this implementation
uses 2 separate Haar features of different polarities. This
allows the test to always be the same ( value >= threshold ). To
achieve this goal, a Haar feature that is primarily creating
negative values is disposed of ( it’s Haar feature of opposite
polarity, would create primarily positive results and will be
kept ). As a result minimum values that are negative, are
ignored. 0 is the lowest value threshold can be.
Figure 5 - Threshold = Min ( 0 ) Haar on faces
-15000
-5000
5000
15000
25000
35000
1
11
21
31
41
51
61
71
81
91
10
1
11
1
12
1
13
1
14
1
15
1
16
1
17
1
18
1
19
1
20
1
21
1
22
1
23
1
24
1
25
1
26
1
27
1
28
1
29
1
30
1
31
1
Faces 24|24
Fdev
Favg
Fstd
Notice, many more faces are detected using the min or 0 as the
threshold. The specific Haar feature, at the specific position
over the faces in the images, described by the above graph
yields 302/313 faces detected correctly. Classifiers are ranked
not only by the number of faces correctly classified, but how
well it filters noise by NOT classifying noise as faces.
Figure 6 - Threshold = Min ( 0 ) Haar on background ( noise )
Using the minimum ( from the faces – blue – graph ) as a
threshold, the above graph illustrates values from the same Haar
feature over a portion of the background ( representing noise ).
There are MANY false positives in the above graph when the
threshold is 0. The exact number is: 171 false positives.
Min Results:
Face Nonface
301/313 171/313
96% 55%
Although the minimum method detects more faces then the mean
threshold, it also creates significantly more false positives.
If the mean threshold represents the ceiling, the minimum
threshold represents the floor.
-15000
-5000
5000
15000
25000
35000
1
11
21
31
41
51
61
71
81
91
10
1
11
1
12
1
13
1
14
1
15
1
16
1
17
1
18
1
19
1
20
1
21
1
22
1
23
1
24
1
25
1
26
1
27
1
28
1
29
1
30
1
31
1
Non-Faces 0|0
Fdev
Favg
Fstd
The ideal threshold is between the Min and the Average
Figure 7 - Positives - False Positives
The ideal threshold is windowed by the minimum at the
bottom, and the average at the top. In the graph above, the blue
line represents the number of faces correctly detected. The red
line indicates the number of face incorrectly identified in
random non-face images ( noise ). The red line indicates false
positives. The ideal threshold would detect 100% of the faces,
with no false positives. The green line represents the
difference between the blue and red lines ( positives – false
positives ). Where the green line peaks is the provides the
best face detection with the least number of false positives.
This is the closest to ideal we can get.
The blue arrow is pointing to the peak of the green line.
This is our desired threshold. To find this peak, a simple max
operation was applied to the difference between the blue and red