Cognitive Vision Model Gauging Interest in Advertising Hoardings Saad Choudri MSc Cognitive Systems Session 2005/2006 The candidate confirms that the work submitted is his own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student) ___________________________________
94
Embed
Cognitive Vision Model Gauging Interest in Advertising ... · Gauging Interest in Advertising Hoardings ... Incorporation of facial expression or gesture recognition. ... the scope
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Cognitive Vision Model Gauging Interest in
Advertising Hoardings
Saad Choudri
MSc Cognitive Systems Session 2005/2006 The candidate confirms that the work submitted is his own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism.
(Signature of student) ___________________________________
For the late Vice Admiral HMS Choudri (H.Pk ; MBE), my grandfather, friend and mentor.
i
Acknowledgments At the outset, I would like to acknowledge the support of my parents Rishad and Samar Choudri
who encouraged and sponsored my MSc Cognitive Systems. My other family members and friends
who supported me considerably include Sana Lutfullah, Rayika, Zayd and Khizar Diwan to whom I
am thankful.
Prof. David Hogg (Supervisor) and Dr. Andy Bulpit (Assessor), with their unmatched experience in
Computer Vision and guidance, fueled the “from idea to project” concept to help me refine the
conception of this project.
Prof. David Hogg supported this project from the very start and most of all encouraged a self-driven
approach which in itself had immense advantages including novelty. Dr. Hannah Dee (stand-in
Supervisor) played a vital role in his absence to steer the report toward its current state and make me
think “Japanese garden” vs. “English garden”.
I am very grateful to Eric Atwell for giving me an edge for the road ahead and to Katja Markert,
Prof. Tony Cohn, Prof. Ken Brodlie, Martyn Clark and Tony Jenkins for various roles they played.
I would also like to thank all members of [email protected] including Savio Pirondini,
Graham Hardman and Pritpal Rehal for none other than their support.
Special thanks to Simon Baker and Ralph Gross at Carnegie Mellon University for making available
the Pose Illumination and Expression (PIE) database of 41368 images.
Thanks to Lee Kitching and Khurram Ahmed for participating in a few evaluation videos. Mention
must also be made of Arnold, the silverfish in my room who, if he could, would gladly have eaten
this report.
Last but not least, I would like to thank administration members of the School of Computing,
especially Mrs. Irene Rudling, Yasmeen, Judy and Teresa for, among other assistance, helping me
locate Prof. Hogg for my million plus questions.
ii
Abstract
In order to gauge pedestrian interest in advertising hoardings, a gaze or head pose estimation system
is required. Proposed here, is a novel “spirit-level” approach to head pose estimation where heads
may be as small as 20 pixels high and lack detail. This adds to the few existing models that deal
with such head sizes. The heads are found using a Viola-Jones model for face detection. A binary
feature vector drawn horizontally from approximately the centre of the eyes and tip of the nose
consists of skin pixels as a bubble against non-skin pixels. The feature vector is classified by a
support vector machine classifier previously trained and containing a number of support vectors for
three generalised head poses; left, right and centre. Designed specifically to gauge interest in
advertisement hoardings, the model is complemented with a regional interest gauge to measure
interest in specific objects. A hoarding is divided into nine regions and 5 discretised face kernel
templates are used to classify the region of interest, in a combination of two or more classifications.
The “spirit-level” system performs at par with other existing systems at 89% accuracy.
The image set selection from the PIE database was done on the grounds of what the face and eye
detection component detected. In order to test that component a test set was required where the
background was uniform and the angle-sweep was at shorter intervals. The PIE database does not fit
this requirement and so a floor plan as shown in Figure 3.4 was devised. A single subject was
photographed in up to 70 different face poses. The subject positioned his face from position 1 to 5
with the face tilt at 0° (i.e. straight on), 24° (upward) and then at -24° (downward). Camera A was
positioned approximately 12° above the subjects eye line, and camera B, roughly 12° below the
subjects eye line. In one round with both cameras and all three face tilts, 30 images were taken with
the subject not looking at the camera.
20
Figure 3.4 Each marking on the wall, from 1 to 9, is approximately 11.25 ° apart. There are two camera
positions above and below the subject, A (12° above) and B (12° below).
The same was repeated with the subjects eyeballs fixed on the lens and face poses as before. This
was done to add to the PIE database if need be. With a few extra free-look sweeps (Appendix E)
approximately 70 images were acquired. To get from position 5 to 9 a horizontal image
transformation was applied to all images of face poses 1 to 4, inspired by [46]. Therefore a total of
126 images are obtained.
3.3 Component Analysis and Testing MPT includes a face detector (mpiSearch), an eye detector (eyefinder), a blink detector and a colour
tracker. The component that is tested here is the face detector for reasons already discussed.
Conducting informal experiments to verify that stated about mpiSearch in the literature review
showed that contrary to [26] this version of the MPT does have trouble with 16x12 pixel sized faces.
However, faces of 20 pixels high can easily be detected by mpiSearch still be used for this project.
Informally, it was found that while the eye detector of the MPT, “eyefinder”, is slow, it works at
real-time in areas dictated by the mpiSearch face coordinates.
MpiSearch detected the generalised subset of images shown in Figure 3.5. Only 29 images out of
126 images were detected of head poses between wall positions 3 and 7 (Figure 3.4). This result was
obtained using the AdaBoost algorithm built into mpiSearch without which fewer faces were
detected. Also the face tilt detected was between 5 o and -5o to camera level. However, informal
experiments with the “eyefinder” suggested that its face finding capabilities were better than
mpiSearch, but it is much slower as discussed in Section 2.2.3. Though attempts were made to find
out exactly why this difference existed from Machine Perception Laboratory, they were
unsuccessful. Further experiments also showed that mpiSearch focuses on [28], the Viola-Jones
approach for frontal faces instead of other rotations and angles, as they discuss in [27].
Even though the problem of frontal faces being detected existed, mpiSearch was used to begin the
project as there is only limited face to torso and eye movement flexibility that allows a subject to
gaze at an advertising hoarding. This usually means a near frontal-face view. This was learnt from
21
studying how people look at hoardings while walking by and the observation was backed by a video
analysis. Appendix E has video clips that provide evidence of the same. Also its dimension
limitations to faces 20 pixels high incorporated into the assumption that the number of cameras, i.e.
from 1 to N was according to the size of the area to be covered and the distance from the ground. As
this was the most important component to help discover a solution to the problem of DoG, it had to
be tested in advance. However, functions like the SVM, and a blob finding algorithm, that are later
used, are tested during development and approved simply on the basis of them working and giving
desired results.
Figure 3.5 A subset of the images against a cluttered background showing angles detected by mpiSearch.
3.4 Testing and Training Data
The previous section suggests that poses C05, C27 and C29 from Figure 3.1 be selected for the
subject looking left, centre (“straight-on”) and right respectively. Another deciding factor for this set
was that images of pose C05 have a skin coloured background. For a skin based approach to be tried
this obstacle can be an opportunity to make the algorithm invariant to background colour. The
images were selected on the basis of the subject’s physical anthropology variation, appearance,
expression, and illumination. The training data of 73 images with the break up as shown in
Appendix F as Figure F.1, was selected to allow variation to avoid over-fitting. Some images have
repeated subjects with and without spectacles. This data set was used for the feature based and skin
based prototypes described in Chapters 4 and 5 respectively. The test data of 133 images was
selected to include unseen poses not present in the training data. The “looking” images have a few
repeated subjects to test if classifiers are biased towards seen data. The “looking” and “not-looking”
test sets are shown as Figure F.2 and Figure F.3 in Appendix F.
Cross-referencing face pose with objects in the billboard or hoarding also required a dataset since
the dataset for the minimum requirements could fall under the category of “looking” and “not
looking” rather than, for example, “top-left corner”. For this purpose a training and test set was
selected of poses C05(left), C27(centre), C29 (right), C09 (down) and C07(up). Between 10 and 15
subjects of each pose were selected for training and 5 of each for testing. This dataset was carefully
selected to maximise subject appearance and physical anthropology variation. This dataset is shown
as Figure F.4 and Figure F.5 in Appendix F.
22
Chapter 4: Experimental Feature Based Model This Chapter highlights the feature based prototype and its versions developed that cater to the minimum requirements.
4.1 Plan and Architecture
The literature reviewed in Section 2.2 offers some interesting and novel techniques to go about the
solution to this problem. Some of those cannot be applied here while others give insight into
possibilities. Two main approaches can be chalked out, a feature based approach, and a skin based
approach. The first step taken was to adopt a feature based approach. This was the simplest way to
begin and allowed a detailed study of the human face to be conducted, thus allowing for ideas to
emerge. It avoided problems of segmentation, and as Robertson et al. claim in [25] skin cannot be
represented in any colour space. Figure 4.1 illustrates this first prototype which caters to the
minimum requirements. Section 4.1.1 describes stage 1 and stage 2 of the diagram and Section 4.1.2
describes stage 3 and how the various machine learning techniques chosen in Section 2.3 were
experimented with. They are evaluated in Section 4.2.
Figure 4.1 Outline of the feature based prototype. All the versions share this architecture. Stage 1 involved the
expansion of face coordinates found by mpiSearch and image processing for eyefinder. The DoG component
first extracted 13 features in stage 2 and a number of classifiers were used in stage 3.
Direction of Gaze Algorithm
Looking?
YesNo
1
2
Eye coordinates
MPT:mpiSearch
INPUT Expanded Face Region
MPT:eyefinder
Feature
Classification
3
OUTPUT
23
4.1.1 Integrating MPT and Feature Extraction
Figure 4.2 13 Picture coordinates with respect to the x axis and y axis of the face box. A1 = centre of the eye
plane on x (a.k.a NCX) and y (NTY). A2 = mouth or upper lip on y (NBY). B3= subjects right eye (given by
eyefinder) on x (REX) and y (REY). B4 = subjects left eye on x (LEX) and y (LEY). C is the face box drawn
from coordinates returned by mpiSearch. E6 (RECY) and E5 (RECX) are the intersection of the right eye, and
D8 (LECX) and D9 (LECY) are the intersection of the left eye ordinates with the contour ordinates on x and
y. F9 (FCY) and F10 (FCX) are approximate ordinates of the centre of the face.
As explained in Section 2.2, a picture based coordinate system is being used rather than a “world”
based system. The first step taken was to try and adapt Gee and Cipollas work in [20] by using the
eye points returned by eye finder as a starting point to identify the tilt of the face. In its initial stages
it was stopped and a new approach was required. The line running vertically down the face in
Figures 4.2 and 4.3 represent this implementation. Figure 4.3 shows why it could not be taken
further. The image on the extreme right shows the subject’s right eye detected much further down
than it is and this changes the angle of the constructed tilt-line. This took place often and because of
the inconsistency of the eyefinder sometimes showed a completely opposite tilt.
Figure 4.3 The images show how face tilts produce an angle against a possible normal that may be produced
parallel to the y-axis of face box.
It was not possible to use eye ball based systems but it was a possibility to use the location of the
eye axis centre on the x axis. Figure 4.2 illustrates thirteen coordinates that were introduced over a
period of time; the building blocks for this prototype and its versions. This seemed an intuitive step
to take and allowed several issues to come to light. One of these was that the face detector always
24
centred the face bounding box on the eyes. Due to this NCX, for example, was often the same
regardless of where the person was looking. To cater to this, a novel medoid contouring approach
was applied. As shown in Figure 4.2 (right) this resulted in six features. By taking the intersection of
the eyes and the contour on the axis of the face bounding box, it was possible to determine where in
the now enlarged bounding box the face was. LECX (read as Left-Eye-Contour-X axis), LECY,
RECX and RECY were obtained in this manner. Through observation the face has a larger number
of different isosurfaces than the rest of the face. Therefore the face would have more isolines or
contours. Taking a median of the ordinates of these contours would give an approximate centre to
the face. FCY and FCX are obtained in this way. A centroid, i.e. using the mean, would be
susceptible to outliers. The rest of the features are self explanatory from Figure 4.2.
In this prototype, to speed up processing, mpiSearch defined the region where eyefinder searched.
This was done after various attempts to speed up processing and it worked well until major
problems were identified; Figure 4.4 illustrates. To get around these problems the face box was
expanded with a growth procedure.
Figure 4.4 Extreme right: eyefinder result within original face box. Centre right: eyefinder within grown face
box parameters (grown box not shown). Centre left: best possible face growth of 1 to 20 pixels for either side
depending on the image dimensions. Extreme left: face box and eyes drawn only from the eyefinder showing
that “Centre left” is a near approximation and that the eyefinders face is a closer fit than that found by
mpiSearch (extreme right). Source [49]
The growth procedure was checked on a number of images and then implemented. It grew the face
box 20 pixels in either direction or until the image dimensions were reached. This catered to the
problem to a large extent. Another problem this growth procedure catered to was that of not
detecting eyes at all. Occasionally the eyes were not detected as the eyefinder typically finds a face
first and then the eyes. If the face is found then the eyefinder cannot find a face within a face and so
it does not detect anything or sometimes only one eye.
25
4.1.2 Classification
Training and testing were done in different stages. During training all the parameters were extracted
from the training set and were used to build a regression tree. The tree is shown in Figure 4.5. The
results were interesting and a functional model was created to cater to the minimum requirements.
Then an SVM was used with the same parameters in place of the regression tree. The evaluation in
Section 4.2 describes the results. Each classification method was tried in combination and single
parameters were also used for classifying all the images on their own.
Figure 4.5 Regression tree of the top performing tree based system with 63% accuracy. 121=Yes, 110=No.
Since there was no available Normal distribution function that matched the current need, it was
implemented, but only after checking to see if it would be suitable to employ having already
encountered the problem of the face box being centred on the eyes. For this a non-parametric
density model, the histogram, was used (subset shown in Appendix G, Figure G.1 to Figure G.6).
This was inadequate because histograms do not provide a smooth estimate [32] and so the mean and
standard deviation of each parameter was plotted (Figure 4.6). As shown the ranges suggested a
Normal or Gaussian distribution may be fitted and so it was. The training values of each parameter
were used to compute the mean and standard deviation for the Z-score formula for “not looking”
and “looking” curves. Therefore there were 26 Gaussians in total. A Z-score was computed for each
of the 13 features and was looked up in an “areas under the standard normal probability distribution”
table implemented from [36]. A new vector was created that took a 1 if the feature probability for
“yes” was higher, or 0 if “no” was higher. If there are more 0’s in this vector than 1s then it was not
looking, otherwise it was a “looking” face. The number 2 was assigned for equal probabilities. This
26
offered a means to further tweak those parameters that caused problems, i.e., they were represented
by 0’s and 1’s so those which said 1 when they should have said 0 were removed.
Figure 4.6 Parameters plotted as ranges from μ to ± 1σ. Green = looking, Red = not looking, 1=NBY, 2=
NCX-FCX NCY-FCY Combined NBY NTY NCX LEX LEY REX REY LECX LECY RECX RECY FCX FCY
Figure 4.8 All features using the regression tree.
5955
43
6158
60
55 5351
64
44
6359
20
25
30
35
40
45
50
55
60
65
70
75
80
Yes 61 48 48 61 64 63 60 53 55 77 22 69 59
Accuracy 59 55 43 61 58 60 55 53 51 64 44 63 59
No 56 63 40 62 50 58 56 54 46 48 71 56 58
NBY NTY NCX LEX LEY REX REY LECX LECY RECX RECY FCX FCY
Figure 4.9 All features using the SVM.
29
36
4643
30
50
4447 47
45 45
56
43
53
20
25
30
35
40
45
50
55
60
65
70
75
Yes 30 33 48 28 62 56 39 39 31 34 66 36 51
Accuracy 36 46 43 30 50 44 47 47 45 45 56 43 53
No 46 66 40 40 39 46 56 60 68 60 44 57 61
NBY NTY NCX LEX LEY REX REY LECX LECY RECX RECY FCX FCY
Figure 4.10 All features using Gaussians.
During the course of the experiments conducted it was found that the regression tree misclassified
instances it had already seen. This is a good sign showing that the tree summarises information
rather than reconstructing it. It is evident that the SVM and the regression tree are top performers.
However, all the SVM approaches had a lower bias and from Figure 4.6 one of its top performing
versions was only 1 % lower than the top performing regression tree version. The top performing
single features were also with the SVM with the RECX (Right-Eye-Contour-X axis) at 64%
accuracy and FCX (Face-Centre (from contour)-X axis) at 63% accuracy. All this indicates that the
SVM is an excellent classifier to use. It also provides a bias-reading after training and thus can be
tweaked to bring the bias-reading as close to zero as possible. The regression tree would require
pruning which is a tedious and delicate process and the Gaussian based versions suffer from equal
probabilities besides poor performance.
Therefore after 15 different versions, 5 for each classifier-version, the SVM with all the least bias
features above the baseline is chosen to be the top system from the feature based prototype to serve
as an interest level gauge for any resolution of face that mpiSearch and eyefinder can detect.
30
Chapter 5: Spirit-Level and Face Kernels Following is a description of the final models prototype and its various versions. These were developed to move beyond the minimum requirements catered to by the SVM feature based approach (Chapter 4) and to incorporate the gaze to object cross-referencing enhancement.
5.1 Plan and Architecture
2
L R S
3
DoG
Face & Eyes
MPT:eyefinder
INPUT
OUTPUT ?
Corrected Coordinates
1
Figure 5.1 Outline of the final model. Stage 1 involved the correction of face and eye coordinates using
several ratios and subsampling. The “spirit-level” model is presented in stage 2 which goes further than the
feature based model in classification and accuracy. Stage 3 is the regional interest detector.
The final model or system can be broken up into 3 main stages as shown in Figure 5.1; the face and
eye detection package, the main DoG component which classifies “not looking” specifically as
either “left” or “right” and the regional interest detector which works if the subjects gaze is
“straight-on” (or “centre”). Although the feature based prototype did work with “left”, “right” and
“straight-on” input and output it was not specifically designed to do so. The final version of stage 2
is described in Section 5.2. Its initial version and reasons for the extensive image segmentation
techniques developed and described in Section 5.1.2 is described in Section 5.1.1. Stage 3, the
regional interest detecting further enhancement, is described in Section 5.3. It was not feasible to
develop the facial expression and gesture recognition system since the heads were very small and
the project schedule could not allow it. Section 5.4 contains an evaluation of the various versions of
the two main components. It also contains a comparison between the best “spirit-level” version and
the SVM feature based approach described in Chapter 4. In Section 5.5 the components are put
together and the final models interface is explained.
31
Before discussing the new approach it should be noted that mpiSearch was replaced by eyefinder.
This was because mpiSearch had trouble with detecting faces and eyefinder detected many more.
Using the eyefinder within mpiSearch, as discussed in Chapter 4, was prone to precision problems
and what was not discussed earlier is the fact that mpiSearch caused segmentation violation
problems more frequently than eyefinder. The former issue greatly affected the overall performance
as this approach relies on a pin point accurate segment of the face. The latter issue occurred usually
with memory mismanagement. Though the memory management problem was with the
development environment (for some unexplained reason by MathWorks and Machine Perception
Laboratory), the slower eyefinder reduced its frequency. Since the main algorithm for DoG was
under development and the face detector was only a means to test this algorithm the speed decrease
was not thought to be a major problem. The only problem identified by switching was a comparison
between the two main prototypes. The source data-set remained the same but the sub-sets were now
different. They were sub-sets because both mpiSearch and eyefinder did not detect all the subjects
and when they did eyefinder detected more. However with confusion matrices in place problems for
comparisons were minimised. Also, the only major difference was the detection of a few more faces
so this did not affect the comparison.
a b c d e f g h i
Figure 5.2 A subset of possible pedestrian appearances that this algorithm should cater to. Face segments
shown below each subject.
As discussed in Section 2.2. Robertson et al.’s model in [25] could not be applied as it did not give
precise gaze estimation. Also, a model that moved beyond that proposed in Chapter 4 was sought
after which had improved performance and estimation quality. With this new prototype it was
possible to determine, more precisely than the feature based model, where the subject was looking.
Therefore it became possible to tell if the hoarding should be in a different position as well as if
people are looking at it, or if people are paying more attention to another hoarding and if so then
why.
While observing various images of people and video footage it was realised that there is one section
or segment of the face that is usually visible shown in Figure 5.2. As shown, regardless of whether
the person has a beard or not or if the face is covered, the skin is visible in this segment. Segment
“f” of the lady wearing a burqa (i.e., full veil) shows an abundance of cloth. As discussed later, skin-
32
pixels were assumed to be those clustered-pixels (i.e., encoded with a cluster number) that were
found to occur most between the mouth plane and the eye plane. A detailed explanation of these
planes may be found in [20]. Therefore if an eye detector found the eyes the section may be drawn
and the cloth would act in place of the skin in that region. But a face and eye detector superior to
eyefinder and mpiSearch needs to be used which does not need a visible face and works with eyes as
well. Nevertheless, all the prototypes and versions developed, especially the final proposed model,
are ready to work with any face and eye detector package. This is proved in the Chapter 6 where
robustness is evaluated. The following Sections describe the various steps that lead up to the final
“spirit-level” approach.
5.1.1 Initial Version
The initial version took a larger section centered on the tip of the nose referring to it as the third
quarter from the top of the face coordinates returned by the eyefinder. However, its results
motivated further refinements and the resulting novel algorithm. Work first began with RGB images
that were converted to greyscale and then standardised using the median of pixels since that was
affected less by outliers [43, 38]. K-means was also used with a medoid approach [38] for the same
reason. Regional clustering was done to obtain a mean skin cluster value and a mean background
cluster value over the training images. Other than searching for a solution to segment the face from
the background, it was also done to verify Robertson et al.’s claim in [25]; that skin cannot be
represented as, or in, any colour space. Results for non-regional clustering are shown in Figure 5.4,
with a 25 x 18 pixel face. The face was subsampled taking every second pixel to remove extra
information including detail in the background that was similar to that belonging to the skin region.
The concepts of subsampling and regional base clustering were intuitive steps taken and since they
gave good results they were accepted. Figure 5.5 shows segments obtained from an image that was
not subsampled and the same image subsampled. (This Figure is the result of the final model but it
still describes the advantages of subsampling.)
Figure 5.4 Left: K-means with 3 clusters, lightest coloured region is skin. Right: original image.
33
Figure 5.5 Subsampled face of 30 pixels high and original with segments shown.
As opposed to Figure 5.4, Figure 5.6 shows the result of applying K-means in a regional way, i.e.
separately to the area between the eye plane and the mouth plane.
Figure 5.6 Left: Kmeans (medoid) with 2 clusters, red indicating skin. Right: original image.
While this image shows a good segmentation, clustering of this sort was not resource efficient, at
least not in the RGB colour space. Figure 5.7 shows the similarity between the skin coloured
background and the face in a standardised greyscale image converted from an RGB image.
50 100 150 200 250 300
1020304050
50 100 150 200 250 300
1020304050
Figure 5.7 Top: Similarity between face and background toward left of image. Bottom: poor skin and non-
skin segmentation.
The result in Figure 5.6 encouraged a new stance to be taken in colour segmentation. Linear colour
spaces include the CMYK, CIE XYZ and RGB colour spaces. RGB among these is common and
was created for practical reasons. It uses single wavelength primaries. But with all linear colour
34
spaces the coordinates of a colour do not always encode properties that are important in most
applications, and individual coordinates do not capture human intuitions about the topology of
colours [9]. In other words, the RGB colour space cannot be represented in terms of hue, saturation
and value, or HSV, such as in non-linear colour spaces. Also, it was observed that standardisation
did not give desired results using the mean or median. This is necessary since brightness in a RGB
image varies with scale out from the origin. If the image was in HSV format the “value” component
could simply be removed leaving saturation and hue to help segment various objects in the scene.
Figure 5.8 Bottom left: Normal distribution resultant. Bottom right: Nearest match using Euclidean distance.
Simply removing “value”, and using hue and saturation did not work well either. Until then
Euclidean distance was being used as the encoding scheme and a mean value for the skin and non-
skin prototype clusters was the encoding criteria. In order to try a different approach to Euclidean
distance, the standard deviations were calculated by using Euclidean distance to find out which
pixels were skin pixels and which were non-skin. The standard deviation and mean for skin and
non-skin pixels were then used to look at the area under the skin and non-skin Gaussians to find out
where the pixel had a higher probability of belonging. Also, Gaussian smoothing is applied before
segmentation in attempts to improve the cluster assignment. The difference is show in Figure 5.8.
An isotropic low pass Gaussian smoothing was applied to blur images, remove detail and remove
noise [9]. It outputs a weighted average weighted towards the value of the central pixels, of each
35
pixel's neighbourhood as opposed to, for example, the mean filter's uniformly weighted average.
Thus it provided gentler smoothing and preserved edges [50, 51].
There was a slight difference in the results of the Euclidean distance measure and the Mahalanobis
distance variation (Normal distribution). The latter took longer to compute and was prone to equal
probabilities. However at the same time it provided a closer fit. Figure 5.9 shows the difference in
the cross section.
Figure 5.9 Centre image: The Gaussian pixel encoding scheme draws nearer to the actual face than the
Euclidean distance encoding scheme (bottom) does.
The model was then taken to construct binary feature vectors such as those used by Dance et al. in
[52]. This is step 12 in the outline presented in Section 5.2. This was done by dividing the face
segment into a histogram of 12 bins. The skin pixels in each bin were counted and if they exceeded
[60] Essa I, Pentland A, (1995), Facial Expression Recognition Using a Dynamic Model and Motion
Energy, in the Proceedings of ICCV.
64
Appendix A: Reflection A reflection of experience that should be beneficial to other students in the future. While it is not possible to cover all the experience gained some aspects should be useful for others
to learn from. With any problem the simplest of solutions should be exhausted first. The image
segmentation techniques that were tried tended to be overly complicated at times when the actual
solution settled on was extremely simple to implement. For example, if simple experimentations
with RGB and HSV images had been exhausted earlier the further enhancement of expression
recognition could have been developed or more time could have been spent evaluating more frames.
Possible additions to a saturation based technique include using the mean-shift tracker. Without
using a PCA model those two techniques should give fruitful results in symbiosis.
The development environment is a very important aspect with any computer science problem.
Matlab provided rapid prototype development capabilities but also has poor memory management.
It is essential to keep such factors in mind since this project suffered considerably as a result. It took
2 weeks to manage to successfully process film that was expected to take a few days. Finally 1.5
hours of video was reduced to clips of 1 minute each which took at least half a day to compute. The
other alternative was to use OpenCV libraries and code in C++ but experience suggested that Matlab
was simpler. The lesson here is that if high quality images are to be processed (frames) then RAM
greater than 1GB is required. After every 80 frames (1024x786 pixel dimensions) of a 1 minute AVI
clip (approx.100MB) the system would crash with 1GB RAM. Also the task was divided among 4
computer systems with Intel Pentium 4 processors of 2.8 GHz, each with atleast 50 GB of hard disk
space for virtual memory. The higher resolution was essential to be able to detect smaller heads. The
computers were limited to 4 since the Matlab release being used had license limitations for its image
processing toolbox. This was another problem that should be kept in mind while developing
systems. Since this is an idea converted to a project there were certain aspects that were both pros
and cons. Firstly there was no set specification that allowed the development process to use as a
milestone and there was no existing system to beat. This also meant possible “feature-creep”.
However, at the same time it prevented “feature-creep” since there was no representative from the
industry and allowed development in a very research oriented style. The aspect governing its
success was a clearly defined precise set of minimum requirements from the very start and a specific
evaluation criteria to measure success with. While still on the subject of “idea to a project” it is
important to define an achievable schedule with large buffers in place. Since this is mainly a self-
driven project such buffers are required. For example, in July the development finished as planned
but the evaluation portion used up the buffer due to problems discussed. Lastly, for the report it is
important to document and evaluate everything during development and keep code to reproduce
results.
65
Appendix B: Objectives and Deliverables Form Objectives and deliverables form recording the initial agreement of minimum requirements, deliverables and other aspects of the project.
AIM AND REQUIREMENTS FORM COPIED TO SUPERVISOR AND STUDENT ---------------------------------------------------------- Name of the student: Saad CHOUDRI Email address: scs5sc Degree programme: MCGS - MSc in Cognitive Systems Number of credits: 60 Supervisor: dch Company: (none) The aim is: Development of Computer Vision concepts and skills through experience. The project outline is: Background reading: Reading and understanding face and gesture recognition to understand models such as the Viola and Jones model. Methodology: Prototype Development Product: Software implementation Evaluation of product: Performance evaluation against ground truth. The minimum requirements are: 1. Integrate off the shelf face and eye detection software 2. Estimate viewing direction from a face and eye detection package augmented with novel algorithms and or, approaches through literature research. 3. Devise a measure of interest from viewing angle and duration of viewing. 4. Evaluate the system documented in a report. 5. Enhancement Optionals 1) incorporate facial expression into the interest measure 2)Cross reference objects in the display to gauge level of interest in object. The hardware and software resources are: 1. Video Cameras from Vision Lab 2. Extra disk space as needed on school computers, starting with about 100 MB. 3. Continuous Mat Lab availability and support on school computers. The foundation modules are: 1. COMP5430M 2. COMP5420M The project title is Cognitive Model gauging interests in advertising hoardings.
66
Appendix C: Marking Scheme and Header Sheet The marking scheme, header sheet and comments from the interim report are give below.
67
68 68
69
Appendix D: Gantt Chart and Project Management Reflection on the project management and updates are given below Schedule Feb Mar Apr May Jun Jul Aug Sep Background Reading 1 2 3 4 1 2 3 4 1 Concurrent activities Final Evaluation Poster Report Free slots for delays KEY Completion of 3 stages and 1 week off. Project submission.
Table 1 Original Schedule describing an overview of the planned assigned weeks. For example the whole of
June was to take up the concurrent activities and background reading but only the first 3 weeks of August
were to be used for report writing. The key describes irregularities.
Schedule Feb Mar Apr May Jun July Aug Background Reading 1 2 3 4 1 2 3 4 Concurrent activities Done Final Evaluation Ongoing Poster Report Free slots for delays KEY Completion of 3 stages and 1 week off.
Project submission.
18th July progress meeting.
Table 2 Schedule as on 18th July.
The final evaluation stage threw the project off track and processing continued till the 10th of August
due to lack of resources such as RAM. This delayed all the other activities but the 1st and 2nd drafts
of the entire report were done by the 31st of August and so the buffer zones were very useful and
well planned.
A total of 280 hours of solid productive work were required for the evaluation troubleshooting,
report writing and poster making during August and the entire project was ready for submission 5
days before the deadline, on the 1st of September. Prior to this at least 400 hours were spent since
February. Therefore a total of 640 hours were spent productively for this project.
70
Appendix E: CD All deliverables other than this report, images, videos etc and source code are in the CD enclosed with the printed report handed in on the 6th of September 2006.
The following points explain how to use this CD or Appendix E (E for electronic). A readme.txt file
per section is present explaining the contents. The sections are:
• The folder titled “Deliverables” includes
o Evaluation videos filmed ( edited and as many as could fit on the CD)
o Processed clips ( named in accordance with evaluation form)
o Extracted Frames that were evaluated manually.
o Software code with workspace and usage instructions. (demonstration arrangeable)
• The folder titled “Dataset” contains
o All the training images, testing images, and extra 70 images taken for
The main DoG Component
The further enhancement
MPT testing
• The folder titled “Miscellaneous” contains
o Histograms etc. to understand the parameters of the 13 feature approach
o Background, skin and hair pictures
71
Appendix F: Dataset This Appendix contains thumbnails of the training and test images.
Figure F.1 Training images. 34 centre, 19 left and 20 right facing images. Subjects with glasses and without
glasses have been repeated in some areas. Other images have also been included to avoid optimisation.
Figure F.2 68 “looking” images selected on the basis of appearance variation. Some images have been
repeated to see how classifiers behave with previously seen images as well as unseen.
72
Figure F.3 65 “not looking” test images. Carefully selected images with many unseen poses, varying
appearance, illumination, background and expression, and that are likely to be misclassified.
Figure F.4 Training images for cross referencing objects with pose. 5 different poses used as described.
Figure F.5 Test images for cross referencing objects with pose. Carefully selected subjects with varying
appearance.
73
Appendix G: 13 Feature Analysis Samples Histograms and range analysis for Normal distribution.
L=left, R=right, S=straight or centre, Y=looking, N= not looking
Figure G.1 FCX
Figure G.2 FCY
Figure G.3 REY
This was a subset of the 13 histograms for L R and S that were plotted. The rest may be found in
Appendix E. Following are the same histograms but as yes and no rather than left right and straight.
74
Figure G.4 FCX
Figure G.5 FCY
Figure G.6 REY
The rest of the set of histograms are in Appendix E.
75
Appendix H: Image Segmentation Samples Screen shots of the various implemented image segmentation techniques on 76 training images.
Figure H.1 Face and hair extraction using K-means and Euclidean distance with 3 clusters for the background
and 2 for the subject.
Figure H.2 Results of encoding using 5 Gaussians. Equal probabilities are the biggest problem.
76
Figure H.3 6 Gaussians, 3 for face and 3 for background and used for joint probability.
Figure H.4 The foreground and background pixels individually representing 2 Gaussians
77
Figure H.5 An adaptive background approach using a series of Gaussians.
Figure H.6 After 3 iterations of the adaptive accumulative Gaussian approach from Figure H.5.
78
Figure H.7 Texture segmented images
Figure H.8 Ellipse on texture without adjustments
79
Figure H.9 Ellipse fit on RGB images
Figure H.10 Ellipse on texture with adjustments. The ellipse has been drawn on the original images but with
the coordinates of the textured image.
80
Figure H.11 Textured image in Column 7 and row D of Figure 5.11 with it edges detected.
Figure H.12 Edge detection on RGB image with bias frame
81
Figure H.13 Training image subset in saturation method with ellipses to show the colour discrimination.
Figure H.14 Faces segmented using saturation and K-means.
82
Appendix I: Ground Truth Evaluation Form Completed Evaluation form for 111 frames as described in Chapter 6. The empty cells are equivalent to zeros and were not required to be filled.