Stereovision based vehicle classification using support vector machines by Pascal Paysan Submitted to the University of Applied Sciences Fachhochschule Esslingen Hochschule f¨ ur Technik Fachbereich Informationstechnik, Softwaretechnik in partial fulfillment of the requirements for the degree of Diplom-Ingenieur - Softwaretechnik February 2004 accomplished at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY Center for Biological and Computational Learning for DaimlerChrysler Author ............................................................................ Pascal Paysan February 28, 2004 Certified by ........................................................................ J¨ urgen Koch Dr. rer.nat. Thesis Supervisor Accepted by ....................................................................... Tomaso Poggio Ph.D. Supervisor at CBCL Accepted by ....................................................................... Stefan Gehrig Dr. Supervisor at DaimlerChrysler
65
Embed
Stereovision based vehicle classi cation using support ...cbcl.mit.edu/publications/theses/thesis-masters-paysan.pdf · Stereovision based vehicle classi cation using support vector
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Stereovision based vehicle classification usingsupport vector machines
byPascal Paysan
Submitted to the University of Applied SciencesFachhochschule EsslingenHochschule fur Technik
Fachbereich Informationstechnik, Softwaretechnik in partialfulfillment of the requirements for the degree of
Diplom-Ingenieur - SoftwaretechnikFebruary 2004
accomplished at theMASSACHUSETTS INSTITUTE OF TECHNOLOGY
Center for Biological and Computational Learningfor
Stereovision based vehicle classification using support vector
machines
by
Pascal Paysan
Submitted to the University of Applied Sciences Fachhochschule Esslingenon February 28, 2004, in partial fulfillment of the
requirements for the degree ofDiplom-Ingenieur - Softwaretechnik
Abstract
The thesis studies the detection of oncoming vehicles in traffic scenes by using depthinformation. The image sequences in our experiments are captured by a pair of stereocameras which are mounted in a test vehicle. The main difficulty is to build a systemthat runs in real time on a standard PC and performs accurate detection of vehicleseven under unfavorable illumination and weather conditions. Robust object detectionis a key function in vision guided, autonomous driving. The most relevant classes ofobjects for this type of application are vehicles, pedestrians, and traffic signs. Thiswork focuses on the recognition of oncoming vehicles. Although stereo vision by itselfis not reliable enough to perform accurate vehicle detection it is useful to quicklygenerate object hypotheses which can then be verified by accurate pattern recognitiontechniques. The pattern recognition algorithm consists of an SVM trained on waveletcoefficients of histogram equalized frontal views of vehicles, similar to the techniquedescribed in [21]. Experiments show a detection rate of 63 %; the processing timefor a 640x480 frame is about 300 ms. The work contains detailed statistics about thedetection rate and the computing time. A novel way to combine stereo vision andSVM classifiers is introduced.
Thesis Supervisor: Jurgen KochTitle: Dr. rer.nat.
3
4
Acknowledgments
First I want to thank Carsten Knoppel who has arranged the first contact to the
CBCL. Thanks to Uwe Franke, Stefan Gering and DimlerChraysler for technically
and financial support. I want to thank Jerry Jun Yokono and Bernd Heisele for useful
advice, Stanley Bileschi for supporting me with image and classification librarys. I
want to thank Professor Poggio for supervising my thesis at MIT and all the CBCL
members, for sharing their knowledge, and for the friendly welcome. Thanks to
Professor Jurgen Koch for supervising me at my home University in Esslingen. Last
but not least I want to thank my family and friends for supporting me.
A.4 Table of libraries used in the project. . . . . . . . . . . . . . . . . . . 60
14
Chapter 1
Introduction
This project studies the detection of oncoming vehicles in traffic scenes by using depth
information. The image sequences in our experiments are captured by a pair of stereo
cameras which are mounted in a test vehicle. These images are preprocessed by a
standard stereo system to receive 3D information for extracted interest points. The
stereo system is not in the scope of this work although we give a small introduction
to it. The 3D data is used to detect roughly estimated positions of bounding boxes
around close objects. These bounding boxes are used for the extraction of example
images to train the SVM as well as for the detection stage. The gray level patch
bordered by the bounding box is scaled to the preferred detection size using linear
interpolation. For the final classification the patch was histogram equalized and
transformed using different wavelet transformations to acquire the SVM input vector.
Finally, a program was developed to compare the results of the classification and
optimize the parameters of the SVM using grid search with Gaussian kernels.
1.1 Motivation
Robust object detection is a key technique in terms of understanding the environment
and a step towards the intelligent vehicle. Since driver assistance systems such as
Distronic automatic distance cruise control have become more popular to improve
safety and comfort, the need for new technologies which use improved detection is
15
growing. Optical systems provide an opportunity to solve the task of environment
interpretation. A major theme in this area is the detection of oncoming vehicles in
which the detection time and also the robustness are fundamental items. That makes
it necessary to use improved classifiers to prove the results of fast classifiers. The
statistical learning combined with stereo vision provides a technology to solve this
task.
1.2 The Problem
The main difficulty of the vehicle detection task is the variety of different shapes
and colors of vehicles. Unfavorable illumination and weather conditions are another
challenging point. There are also reflections which influence the realisability of local
descriptors. Furthermore we have many degrees of freedom in scale and translation,
in which vehicles can occur in images. In our case the detection time on a standard
PC becomes important to, because the future goal is to implement a real time system
which supports the driver.
16
Chapter 2
Related work
Much work related to vehicle detection has been done. In [20] Constantine P. Pa-
pageorgiou and Tomaso Poggio introduced an approach which is very similar to our
SVM classification. The approach showed good results. Unlike our system they used
no 3D information from a stereo vision system, so that they had to apply the SVM
classifier in different scales and on the whole image. The computational cost for our
method should be lower. In [1] an interesting approach based on an advanced feature
selection method was proposed. Their system builds a ”vocabulary” of information
rich features. The classifier is then trained on the feature vectors from this ”vocab-
ulary.” They mention 8 seconds as the average time to test a 200x150 pixel image,
which is still too slow for our purpose. Another advantage of using stereo vision is
that it provides information about the position of the occurrence relative to the vehi-
cle in which the cameras are mounted. This information is usable for other tasks to
calculate for example the relative movement. In [19] a stereo based method is shown.
In contrast to our system, they use symmetry to refine and measure the correctness of
found bounding boxes. This approach might not work as well in downtown scenarios
where symmetries can occur e.g. on houses or guardrails of bridges.
17
18
Chapter 3
Background
3.1 UTA
Urban Traffic Assistant (UTA) is a current research project at DaimlerChrysler ex-
ploring the possibilities of using machine vision and image understanding to increase
safety and comfort in downtown scenarios [10].
The most important perception tasks that have to be solved for this project are:
• The leading vehicle must be detected and its distance, speed and acceleration
must be estimated in longitudinal and lateral directions.
• The course of the lane must be extracted even if it is not given by well painted
markings.
• Small traffic signs and traffic lights have to be detected and recognized in a
highly colored environment.
• Different additional traffic participants such as oncoming vehicles,bicyclists or
pedestrians must be detected and classified.
• Stationary obstacles that limit the available free space e.g, parked cars, must
be detected.
19
This work is part of the UTA project and shows a new approach that combines
stereo measurement with SVM classifiers to extract position estimation and classifi-
cation results at the same time.
3.2 Standard Stereo
A computational theory of human stereo vision has been proposed by T. Poggio [18].
We use depth data from a standard stereo system. Standard stereo is a technique
where the images are projected on one single plane in such a way that the equipolar
lines are represented by the image rows. The advantage of this method is that the
correlating features can be searched efficiently. The cameras are calibrated by the
method introduced by Bouguet [2]. During the calibration process the internal as
well as the external parameters of the camera can be calculated. These parameters
are used in the rectification process to correct lens-dependent distortion and project
the images. On one of these images an interest operator, such as the Harris corner
detector [12], is applied to extract interest points. Afterwards, for each of the interest
points, a correlating feature is matched in the other image using a technique such
as sum of square differences (SSD). From the correlating features the disparity is
calculated and then used to determine the 3D point. In our case we compute the
points using the matrix KK−1
KK−1 =
widthnu
0 −width2
0 heightnv
−height2
0 0 f
(3.1)
where width, height is the chip size in m, f is the focal length in m, and nu, nv is
the image size in pixels.
The 3D point is then calculated by solving the equation
20
~C0 + λ0KK−1
u
v
1
= ~C1 + λ1KK−1
u
v − d
1
(3.2)
where ~C0 and ~C1 are the focal points of the cameras relative to the road surface,
u, v are the pixel coordinates of the feature, and d is the disparity of the matched
feature (Fig. 3-1).
C0
C1
P
Epipolar line
V
U,U’
Correlating features
X
Y
Z
Baseline
Height
Figure 3-1: Schematic illustration of standard stereo geometry, where C0 and C1 arethe focal points, P is the intersection in 3D space of rays through the correlating fea-tures. The camera coordinate system (X,Y,Z) is placed in the middle of the baseline,translated by -height.
A more detailed description can be found at [22, 9].
3.3 Histogram equalization
The histogram equalization (HE) changes the distribution of the gray values over the
image. It uses the histogram to determine possible occurrences of gray values and then
maps these such that the distribution fits the desired function. As shown in [6] HE
increases the performance of face and people classification which is comparable to our
21
task. HE is used for image compression because it is a form of Vector Quantization
[11]. If we can assume that HE leads to a more compressed representation of the
image, then we can conclude that it is roughly the same as using a larger training set.
We know that HE provides a better illumination invariance which is also a reason
why it improves the performance.
(a) (b)
Figure 3-2: Example of Histogram Equalization (a) Source image (b) Histogramequalized image.
3.4 Wavelets
Wavelets are a mathematical tool to hierarchically decompose functions [17, 26].
Wavelets represent a signal in different resolutions. As shown in many publications
[21, 6, 4] wavelets increase the performance of classifiers. In the case of 2D wavelets
for images, the transformation is applied on columns in the first step and then on
the resulting rows. This approach, called non-standard decomposition, is repeated
recursively until the transformation is complete. The Haar wavelet is defined by the
wavelet function φ (3.4) and the scaling function ψ (3.3).
φ(x) =
1 for0 ≤ x < 1
0 otherwise(3.3)
ψ(x) =
1 for0 ≤ x < 1/2
−1 for1/2 ≤ x < 1
0 otherwise
(3.4)
The set of scaled and translated functions for the different resolutions are then
defined as
22
φji (x) = 2j/2φ(2jx− i), i = 0, ..., 2j − 1 (3.5)
ψji (x) = 2j/2ψ(2jx− i), i = 0, ..., 2j (3.6)
where j is the scale and i is the translation. In our case the 2D Wavelet function
is shown in Fig. 3-5. We can also think of the 2D Wavelet as a multiple-edge
filter in horizontal, vertical and diagonal directions. The filter is applied in different
resolutions.
0 1
1
Figure 3-3: Scaling function.
0 1
-1
1
Figure 3-4: Wavelet function.
23
-
+
+-+
+-
-
+
X
Y
Figure 3-5: 2D Scaling and Wavelet functions.
3.5 Support vector machine classification
Support vector machines (SVM) are a well-founded technic in statistical learning
theory [27, 3, 16, 7, 13] . The SVM is a trainable machine which predicts the output
from the given input. For the supervised learning process, labelled examples are
presented, the task is to find a function which describes the relation between the input
examples and the output. In case of binary class SVM’s, the function to predict the
output is
f(x) = sign
(m∑
i=1
αiyiK(xi, x) + b
)(3.7)
where xi, i = 1, ...,m are the selected training examples called support vectors,
and x is the input vector, K(xi, x) called kernel is a symmetric positive function, yi
the label for the vector (1,-1), and αi a weight for the support vector determined in
the training process. b is the bias of the hyperplane. There are several different types
of kernels
Linear : K(xi, x) = xi · x (3.8)
Polynomial : K(xi, x) = (xi · x + 1)d (3.9)
Gaussian : K(xi, x) = e−||xi−x||2/2σ2(3.10)
24
where d is the degree of the polynomial kernel and σ the variance of the Gaussian.
The polynomial and Gaussian are non linear kernels, which is important if there is no
linear relation between the labels and the input. These kernels can solve non-linear
problems.
Linear Polinomial Gaussian
Figure 3-6: Examples for different classifiers. Visualized wit Gui for LibSvm [5].
3.5.1 SVM training
The training process of an SVM classifier is derived from the regularization theory
minf∈H
1m
m∑
i=1
V (yi, f(xi)) + λ||f ||2K (3.11)
where f ∈ H is of the kind
f(x) =m∑
i=1
yiαiK(xi, x) + b (3.12)
and V is the loss function which measures the goodness of the predicted output f(xi)
with respect to the given label yi. There are several different kinds of loss functions,
for SVM classification. The following loss function is used:
V (f(x), y) = (1− yf(x))+ (3.13)
25
where (t)+ = t if t > 0, and zero otherwise.
In most of the SVM literature, these equations are re-parameterized. Instead of
the regularization parameter λ, regularization is controlled via a parameter C, defined
using the relationship
C =1
2λm(3.14)
The parameter C gives the user the possibility to chose an extra cost for errors.
A higher C is corresponding to assign a higher penalty to errors [3]. Using this
definition, the regularization problem becomes
minf∈H
Cm∑
i=1
V (yi, f(xi)) +12||f ||2K (3.15)
This leads to a primal or a dual problem, which are both convex quadratic programs.
A detailed description can be found in [23, 14, 7]. In [23] is also described an algorithm
which can handle large training sets.
3.5.2 Advantages of SVM based classifiers
There are several advantages of SVM’s. The most important advantage is that during
the training process, only a few vectors out of the training set are selected to become
support vectors. This reduces the computational cost and provides a better gener-
alization. Another advantage is that. There are no local minima in the quadratic
program, so the found solution is always the optimum of the given training set. Fi-
nally we have the advantage that the solution is not dependent on start conditions
unlike neural networks.
3.5.3 The kernel parameters C and σ
Choosing the right parameter C, and in case of the Gaussian kernels additionally σ,
can become a kind of a challenge. Geometrically, C controls the width of the margin
between class and non-class. If C is too big the margin becomes very small and the
26
training time becomes extremely long. On the other hand, if C is too small, there will
be no unbounded support vectors and the term b is not determinable [23]. Almost
the same is true for σ, if it is too small the generalization becomes very poor and
every vector is used as a support vector, if σ is too big the kernel also will not show
good results. We propose the grid search as a straightforward method (4.3.5) to find
suitable values for C and σ.
27
28
Chapter 4
Procedure
4.1 System
The built system contains three programs. The main program is called StereoVisual-
isation. It can be started in three modes. The calibration mode is used to set up the
world coordinate system (4.2.1), the label mode (4.3.1) to generate training and test
sets, and finally the detection mode, which uses the SVM to classify the found bound-
ing boxes and display the oncoming vehicles. In addition we developed a program to
generate training and test files in SVMFu format, which is called ImageTransform.
The program was separated from the main system in order to use several different
transformations to generate training and test files from one image database. Further
more, a program was developed to automate the training and testing of the SVM
kernels. The SVMFuTool has several features; it generates shell scripts to dispatch
SVMFu training processes with different parameters on available Cpus and machines.
The tool also applies the testing kernels to the test set and saves GnuPlot files and
tables to evaluate the results. A list of the libraries used in this project can be found
in A.4, some of the libraries had to be adapted for the system.
29
3D points
Stereo vision system
1.1
Disparities
1
3D Point calculation
1.2
Bounding box detection
Bounding boxes
2
2.1
Interactive labeling
2.3
Classification (SVM)
2.4
Feature extraction
2.2
Feature vector
Feature extraction
3.1
Generation of training file
3.2
SVM training
3.3
SVM kernel test (grid search)
3.4
Training examples
Gaussian kernels
Selected kernel
Feature vector
Test examples
3
Figure 4-1: Schematic illustration of the detection system.
The system shown in Fig. 4-1, contains three sub systems (1...3 in the figure) and
uses the shown modules. In the following list we describe the modules of the system.
- 1 Stereo system to compute depth data.
30
- 1.1 Feature matching on epipolar lines to approximate disparities.
- 1.2 3D point calculation.
- 2 Stereo based vehicle detection and SVM training system.
- 2.1 Bounding box detection searches in depth data for proper occurrences of vehi-
cles. Dependent on the mode of the application, we pass the bounding box to
the feature extraction or display it for the labeling.
- 2.2 In the detection mode we extract the features using the transformation de-
scribed in 4.3.2.
- 2.3 We save the gray value patch in two directories on the hard drive, dependent
on their label.
- 2.4 SVM classification of the feature vector (4.3).
- 3 Separate programs to generate training set, train SVM kernel and, test SVM
kernels.
- 3.1 Same feature extraction as in the main system (2.2), but in a separate program
to read gray value patches from hard drive.
- 3.2 Feature vectors are saved in SVMFu format as a training or test file.
- 3.3 We use the SVMFu [24] to train the Gaussian kernels.
- 3.4 Program to test the kernels and generate graphics for evaluation. The graphics
are saved in GnuPlot format.
4.1.1 Sequence loader
The class SequenceLoader is the main class for the detection process. The class
contains all necessary class instances for the whole system, except the user interface.
The collaboration is shown in Fig. 4-2.
31
SequenceLoader
ZImageTransform
mTransform
DepthMap
mDepthMap
BouguetCamera
mCam0mCam1
List< SeqItem >
mSeqItems
ZImageTexture
mC1TexturemC0Texture
TimeStatistic
mTimeStatistic
Figure 4-2: Collaboration diagram for the SequenceLoader class. This class connectsall important components.
The following classes are used by the SequenceLoader:
Table 4.1: Classes used by the SequenceLoaderZImageTransform Transformation and classification of the
gray value patches (4.3.2).DepthMap 3D data loader and bounding box detection.BouguetCamera Calibration data loader and projection matrixes (4.2.1).List <SeqItem> List of the image pairs and the belonging
3D data files from input directory.ZImageTexture Image loader for images from both cameras,
with display functions for OpenGL.TimeStatistic Some time measurement.
32
4.2 Bounding box detection
4.2.1 Area of interest
The volume of interest is a user- defined area in 3D space in which objects can be
detected. This area in front of the vehicle is rotated to the world coordinate system,
which is parallel to the road. It is defined by two sets of parameters.
• A Cartesian projection matrix W (4.1) to rotate and translate the 3D points.
This permits us to place the interest box parallel to the street. The system
provides a calibration mode to set up the projection matrix.
• A clipping volume to define the boundaries of the detection. This box specifies
the area of interest in front of the vehicle. Each point out of range is ignored
in further detection steps. The constants top, left, right, bottom, near and far
are read from a configuration file and shown as an orange box in the calibration
mode.
Figure 4-3: Area of interest in 3D space with right lane mark.
The Matrix W is used to project the point ~p to ~p′ in the world coordinate system
which is placed in front of the vehicle as in (4.2).
33
W =
r00 r01 r02 t03
r10 r11 r12 t13
r20 r21 r22 t23
0 0 0 1
(4.1)
~p′ = W~p (4.2)
Each projected point ~p′ is discarded if it is not placed in the interest box.
left > p′x > right,
top > p′y > bottom,
near > p′z > far
4.2.2 Bounding box search
For bounding box detection we project the points to the surface given by the interest
volume. We quantify x and z of the points to use these values as an index for an
array. For every x, z coordinate an average of the 3D points is calculated and stored
in the array (Figure 4-4 (a)). Possible vehicle fronts are detected by shifting a search
window over the array. For the window the outer left and right average points are
searched and tested to see if they fulfil several geometrical constraints (Figure 4-4
(b)). The width (p′lx − p′rx) between the outer left and right average point must be
in a given interval. The approximated height is the difference of the maximal and
minimal heights (p′y value) of the points within the window. We also provide the
possibility to use constant values for the bottom and the top, which shows more
reliable results. Further on, we use the angles α and β to limit the rotation of the
box relative to the world coordinate system. The edge points then consist of the x, z
coordinate of the left or right point and the minimum or maximum p′y value. The
detected boxes are stored in a list BL = {b0, b1..bi..bn−1} for the following processing.
34
width
a
he
ight
b
Left and right average point
(a) (b)
Figure 4-4: (a) Schematic illustration of bounding box detection on projected sur-face. Gray values represent the count of 3D points. (b) Geometrical constraints forvalidation. Width and height as well as α and β must be in given range. The height ismeasured using minimum and maximum y values of the 3D points within the searchwindow.
Figure 4-5: Detected bounding boxes on a truck in 3D space.
4.2.3 Bounding box projection
To extract the gray value patches contained by the bounding boxes we have to project
the coordinates from the world coordinate system to the image plane of the chosen
camera. To achieve this, we need the matrix KK describing the internal parameters
of the camera. The matrix KK describes the projection of 3D points on the image
surface following the pinhole camera model. We use the calibration results from the
Camera Calibration Toolbox introduced by Jean-Yves Bouguet [2]. We extend the
matrix KK to KKw by including the translation of the cameras relative to the road
35
surface. The image coordinates are then calculated as follows
KKw =
fnuwidth
0 −nu2 t fnu
width
0 fnvheight
−nv2 −h fnv
height
0 0 0 0
0 0 −1 0
(4.3)
where width, height[m] represent the chip size of the camera in meters. f [m] is the
focal length of the camera lens. t[m] is the length of the base line. This parameter
must be set either to t for the right or −t for the left camera. h[m] is the height over
the surface.
~pi = KKw ·W−1~p′ (4.4)
where ~p′ is an edge point of the bounding box and W is the world projection matrix
(4.1). The pixel coordinate on the image |~pi| is then:
|~pi| =
pix
piw
piy
piw
(4.5)
4.2.4 Overlap removing
In order to remove overlap of the boxes found, we sort the list BL = {b0, b1..bi..bn−1}containing all boxes by the mid point of the box on the image plane. Each box
bi ∈ {bn−1..b0} is compared with the boxes {bi−1..b0} to find the biggest overlap ratio.
If the overlap ratio is bigger than a certain threshold we merge them and delete bi.
For the label mode we choose a high threshold to provide many but not all boxes.
This makes it more intuitive to choose them with the mouse. In the detection mode
a low threshold merges all detections to one per vehicle.
36
4.3 Classification
We use an SVM classifier to make the final decision wether the bounding box contains
a vehicle or not. The classifier is trained with SvmFu [24].
4.3.1 Sample extraction and labeling
We provide an ease way to extract training and test data sets from the image se-
quences. Our system provides the possibility to label bounding boxes; the content
can then be saved. We transform the images with another program to an SVMFu
training or test set file. With this approach we can try several different settings for the
transformation, as well as certain dimension counts. To get fast growing sets of data
we provide a method to extract slightly shifted and scaled patches from one bounding
box. In addition, that should help to achieve invariance to these transformations. To
extract negative samples we implemented a random patch extraction, which can be
used if no vehicles occur in the frame.
4.3.2 Wavelet transformation
In our system we use wavelet coefficients to represent the feature vectors of vehicles.
As shown in [21], wavelets provide a compact representation of image features. Fur-
ther information about wavelets can be found in [26]. To avoid numerical difficulties
in the training stage we apply a histogram equalization on the gray value patch before
we transform it. The average illumination of the patch does not affect the classifica-
tion result, so we discard it in the wavelet response. To improve classification speed,
we also test the performance on coefficients above a certain threshold. In addition, we
perform experiments with absolute wavelet coefficients, as in [21]. With this method
the filter response of the wavelet is treated the same way for dark regions on a bright
background as vice versa. The whole transformation chain is shown in Fig. 4-6.
37
Extraction HE DWT T ABS SVM
Figure 4-6: The transformation steps from the gray value patch to the input vectorof the SVM. All transformations are optional.
4.3.3 SVM kernel
The Gaussian kernel (4.7) nonlinearly maps the samples into high-dimensional space
so the Gaussian kernel can handle a decision function which is not a linear function
of the data and labels [3]. Since we assume that this case is given, we decide to use
a Gaussian kernel in our system. (3.7) is the decision function, as in section 3.5.
f(x) = sgn
(m∑
i=1
αiyiK(xi, x) + b
)(4.6)
K(xi, x) = e−||xi−x||2/2σ2(4.7)
where xi is one of the support vectors and xj the test sample.
4.3.4 Kernel performance evaluation
In order to analyze the classification results of the SVM kernels we choose Receiver
Operating Characteristic (ROC) curves to visualize relation between true and false
positives. The axes of an ROC curve are the number of true positives divided by the
total positives in the test set and the false positives divided by the total number of
negatives. The Area Under the Curve (AUC) gives us the possibility to get a scalar
measurement for the performance. The AUC value is also used to evaluate the results
Figure 4-7: ROC curve example with important features.
4.3.5 Kernel parameter optimization
In order to optimize the performance of the kernels we choose the grid search method
shown in [16]. The grid search is more or less a brute force approach which is com-
putational expensive, but the training processes can be easily parallelized. Using this
technique we train several kernels with different values of (C, σ), where C is the cost
(weight) of each vector from the training set and σ the parameter for the Gaussian ker-
nel (4.7). As recommended in [16], we try exponentially growing sequences of (C, σ)
. For each kernel the area under the curve is computed with a certain test set. The
results are plotted as a surface with contours (Fig. 4-8). The results can be improved
by refining the search for a good area, in our example C = 25...210, σ = 20...210.
39
’TestSamples4NHDCData.txt’ using 2:3:12 0.565 0.471 0.377 0.283 0.188
0.0942
-10-8
-6-4
-2 0
2 4
6 8
lg(C)-10
-8-6
-4-2
0 2
4 6
8
lg(sigma)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
AUC
Figure 4-8: Plot shows the computed area under the curve for Gaussian kernels over(C, σ). The kernels were trained with wavelet coefficients from histogram equalizedgray value samples.
40
Chapter 5
Results
5.1 Bounding box detection
From the bounding box detection we receive possible occurrences of vehicles in the
world coordinate system (Fig. 5-1). This showed that it is useful for the labeling
process to apply the overlap removing (4.2.4) before the manual class choice by the
user. As a further feature, we provide the possibility to use the actual kernel to pre-
classify the boxes, which makes it easy to find false positives and false negatives and
include them in the training set. In addition we provide a bootstrap method to add
false positives from frames without vehicles. Our experience is that this makes it easy
to improve the performance of the classifier. For vehicle detection we first classify
all predicted boxes and then merge only the positives, which is slower than merging
them first, but gives better results.
41
Figure 5-1: Sample screen shots for detected bounding boxes. In the lower left corner,the content of the marked bounding box (dark blue) is shown.
The bounding box detection shows poor results if the target vehicle is in the far
distance or rotated in reference to the world coordinate system. If we only have depth
information on one side of the vehicle, the detection fails. Under normal illumination
conditions, we usually have several slightly shifted and scaled boxes on the object.
42
5.2 Classification results
As described in 4.3, we tried several input vector modifications such as histogram
equalization and compressed coefficients for the training process. For the optimization
of the Gaussian kernel we used the grid search method.
5.2.1 Training set
For the training of the SVM we use labeled gray value patches (Fig. 5-2). The images
are taken under natural weather and illumination conditions. It is obvious that the
rough bounding box estimation courses slightly scaled and shifted views of vehicles.
Table 5.1: Different training and test sets used to evaluate the performance.type positives negatives totalTraining 311 2284 2595Training 2853 3186 6039Test 212 2216 2428
(a)
(b)
Figure 5-2: A view of samples from our training set. (a) Images from the vehicleclass. (b) Typical non-vehicles.
43
5.2.2 Grid search
The results of the grid search were plotted as a contour of the AUC values. Fig. 5-3
shows some of the calculated results.
0.565 0.471 0.377 0.283 0.188 0.0942
-10 -8 -6 -4 -2 0 2 4 6 8
lg(C)
-10
-8
-6
-4
-2
0
2
4
6
8
lg(sigma)
0.94 0.927 0.913 0.9
0.886 0.872 0.859 0.845 0.831 0.818
5 6 7 8 9 10 11 12 13 14
lg(C)
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
lg(sigma)
(a) (b)
0.854 0.759 0.665 0.57 0.475 0.38 0.285 0.19
0.0949
-4 -2 0 2 4 6 8 10 12
lg(C)
-14
-12
-10
-8
-6
-4
-2
0
2
lg(sigma)
0.841 0.748 0.654 0.561 0.467 0.374 0.28 0.187
0.0935
-4 -2 0 2 4 6 8 10 12
lg(C)
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
lg(sigma)
(c) (d)
Figure 5-3: Example of contour line plot for grid search with Gaussian kernels. Allkernels are trained with histogram equalized Haar wavelet coefficients.
44
Table 5.2: Several trained kernels for different ranges of (C, σ). All kernels are trainedon Haar wavelet coefficients of histogram normalized samples and tested with the sametest set.
Figure C σ Best AUC Type Training p/n Test p/n5-3 (a) 2−10..29 2−10..29 0.942 C = 27, σ = 23 dense 2853/3186 212/22165-3 (b) 25..214 22..27 0.94 C = 25, σ = 23.5 dense 2853/3186 212/22165-3 (c) 2−5..215 2−14..23 0.949 C = 23, σ = 22.2 sparseN 311/2284 212/22165-3 (d) 2−5..215 23..28 0.935 C = 25, σ = 23 sparseN 311/2284 212/2216
5.2.3 Classification performance
In order to analyze the classification results of the SVM kernels, we choose Receiver
Operating Characteristic (ROC) curves to visualize relations between the true and
false positives. Once again the Area Under the Curve (AUC) enables us to get a
scalar measurement for the performance. The AUC value is also used to evaluate the
results in the grid.
45
Table 5.3: Listing of kernels with their performance. The results are calculated usinga test set with 212 vehicles and 2216 negative examples. Most of the SVM’s aretrained with Haar wavelet coefficients from histogram equalized images. SV’s arethe number of support vectors and ”Training p/n” is the number of positives andnegatives in the training set. The last column shows the time needed to classify thetest set. A complete listing of all good kernels can be found in A.2 and A.3.
The experiments show that, in general, many support vectors are needed to rep-
resent the class. That tells us that the training set is hard to distinguish. We used a
threshold to get a compressed representation of the image patch, so that we can use
a sparse SVM. In all experiments with the sparse method we used coefficients above
or below 0.1. The sparse method is fast, but the results are less dependable. We
decided to do only a few experiments with different types of wavelets since the test
with the Spline wavelet shows no major improvement in the results. The best results
were obtained with absolute Haar wavelet coefficients of histogram normalized gray
value samples. That is true for the dense SVM as well as the sparse.
1No histogram equalization2Spline 3 7 wavelet3Absolute values of wavelet coefficients4False positives added as negatives to the training set using bootstrap method
46
The following ROC curves (Fig. 5-4 ... 5-6) show the performance of Gaussian
Figure 5-4: ROC curve for best Gaussian kernels from the first training set contain-ing 2428 samples, which were histogram equalized and transformed to Haar waveletcoefficients.
47
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
True
Pos
itive
Rate
False Positive Rate
ROC curve for sparse gausian kernels (threshold 0.1).
Figure 5-5: ROC curve for best Gaussian kernels, with sparse kernels using valuesabove threshold 0.1. The Gaussian kernels where trained on the first set contain-ing 2428 samples, which were histogram equalized and transformed to Haar waveletcoefficients.
Figure 5-6: ROC curves for best Gaussian kernels from the last training set containing6424 samples. Absolute values of Haar wavelet coefficients from histogram equalizedgray values. (a) Dense kernel using bootstrap to add false positives to the trainingset; (b) dense kernel; (c) sparse kernel threshold 0.1.
49
5.3 System detection rate
For the measurement of the overall detection rate we suggest counting the vehicles in
the frames for which depth information is available. We further try to count the cars
which are located in the interest area but for which no information is available. To
make it possible to judge whether a vehicle is located in the interest area or not, we
decide to project the non-clipped 3D points on the image of the left camera as shown
in Figure 5-7.
(a) (b)
Figure 5-7: Image of the left camera with corresponding depth data. This visualiza-tion makes it possible to count vehicles which occur in the interest area.
50
We have done two experiments for a distance of 50 and 100 meters. The detection
rate for the closer distance is better, as expected.
Table 5.4: Results for whole system detection rate, with interest area 100 meters infront of the vehicle..
Our system displays the bounding boxes, which are classified as vehicles, as a
green box in the image of the left camera (Fig. 5-8). In addition, the approximated
relative position of the detected vehicle is displayed as horizontal bar on the right
side of the window. The vertical bar on the right side represents the response of the
SVM; the higher the bar, the reliable the decision.
1Rate relative to the counted vehicles.
51
Figure 5-8: Sample screen shots of the system in the detection mode. On the rightside, the relative position of the detected vehicle, is shown. The readability of theclassification is displayed es vertical bar on the right side. In the right collum weshow the typical errors of the system. The errors from top to bottom: false positive,backside of vehicle as positive and missed vehicle.
52
Finally we want to present some time measurements and statistics of the detection
and classification in our system (5.6). If the system is integrated in the test vehicle,
the times for loading the data from the hard drive drop out.
Table 5.6: Statistics of the system.Kernel: equalized Haar Abs 32x32 count time [ms]Frames 2634 -Time to load images - 28.83Time to load 3D data - 81.71Points per frame 1602 -Boxes per frame 2 -Time to find boxes - 21.14Time to classify boxes1 - 61.90Time to remeasure boxes1 - 2.84Overall time per frame - 302.00
1Time per bounding box
53
5.4 Discussion
This project shows that stereo vision provides a technique to make fast predictions for
proper vehicle occurrence in street scenes. This leads to the opportunity of using more
complex classifiers. However, the tests show that the dependability of our bounding
box detection is not as good as expected. This weak point affects the detection rate
of the whole system. There are several possibilities to improve the detection.
• Apply an advanced technique for compensation of dynamic changes in camera
pitch and height during driving [10].
• Use lane detection to make a better approximation of possible vehicle occur-
rence.
• Track good boxes over time. We can use a Kalman filter to predict bounding
boxes between frames.
• Use clustering methods to improve the bounding box accuracy.
• Improve stereo vision accuracy by using filters to reduce noise and outliers.
We have also shown how the supervised learning process can be supported by a stereo
system. The grid search method is a straightforward method to optimize training
parameters of the Gaussian kernels even if the training set is hard to distinguish.
With the grid search method and several transformations of the SVM input vector, we
were able to enlarge the Area Under the Curve (AUC) up to 99.6%. Furthermore, the
system provides functions to increase the accuracy of the SVM classifier incrementally.
This makes it much easier to receive large training sets from an image sequence.
54
Chapter 6
Conclusion
In conclusion, we built a system to detect oncoming vehicles using depth information
from stereo vision image pairs, to predict possible occurrence of oncoming vehicles.
Consequently, we can determine where and in which scale the vehicle is located in
the images. Although the accuracy of the prediction is not dependable on its own,
it provides a possibility to apply a relatively slow global SVM classifier on a smaller
subregion of the image. This reduces the computational cost, compared to shifting
a classifier over the whole translation and scale space. Furthermore, we compared
several kinds of SVM input vector transformations such as histogram equalization
and Haar wavelets, regarding their ability to improve the performance of the classifier.
Since the Gaussian SVM kernel showed the best results in early tests we decided to
perform all experiments with this kernel. In this thesis we presented a novel way
to combine stereo vision with SVM classification. In addition we provided a short
introduction to statistical learning and stereo vision in this work.
6.1 Future work
There are many ways to improve the system detection rate in the future. A major
improvement would be a flexible region of interest in front of the vehicle. One aim
of our study would be the integration a dynamic pitch and height correction such as
in [10]. We also consider using lane information to steer the bounding box detection.
55
Furthermore we will study the possibility of using an SVM classificator on the x-z
projection of the 3D points, for the bounding box detection. In order to improve the
detection rate of the SVM based part, a component based SVM classifier [15] could be
used. We suppose to do experiments with the triangular kernel [8] which has similar
characteristics as the Gaussian kernel but no parameter σ, which could make the grid
search method unnecessary. We suggest to take a closer look at different kinds of
wavelets, in particular masked wavelets [25]. In order to increase the system speed
we recommend studying feature selection techniques and PCA analysis [15] in order
to receive a sparse representation of the gray value patches.
56
Appendix A
Tables
Table A.1: List of Terms and AcronymsAbs Absolute valueAUC Area Under the Curve (for ROC)DWT Discrete Wavelet TransformationHE Histogram Equalisationp/n positive count / negative count (in tables)PCA Principal Component AnalysisRBF Radial Base Function (SVM kernel)ROC Receiver Operating CaracteristicSV Support VectorSVM Support Vector MachineT Threshold
57
Table A.2: Listing of kernels with their performance. The results are calculatedusing a test set with 212 vehicles and 2216 negative examples. Most of the SVM’sare trained with Haar wavelet coefficients from Histogram equalized images.
Table A.3: Sequel listing of kernels with their performance. The results are calculatedusing a test set with 212 vehicles and 2216 negative examples. Most of the SVM’sare trained with Haar wavelet coefficients from Histogram equalized images.
2Spline 3 7 wavelet3Absolute values of wavelet coefficients4False positives added as negatives to the training set using bootstrap method
59
Table A.4: Table of libraries used in the project.Name Type Internet addressCG NVIDIA Cg Toolkit http://developer.nvidia.com/
page/cg main.htmlnv math math library http://cvs1.nvidia.com/LIBS/ZImage Image class library -ZClassifierSVMachine SVM classifier -IniFile Ini file class http://inifile.sourceforge.net/NeHeGL OpenGL Window class http://nehe.gamedev.net/Wvlt Wavelet library http://www.cs.ubc.ca/nest/imager/
contributions/bobl/wvlt/top.html
60
Appendix B
Erklarung (Agreement)
Erklarung Hiermit erklare ich, dass ich die vorliegende Diplomarbeit selbstandig ange-
fertigt habe. Es wurden nur die in der Arbeit ausdrucklich benannten Quellen und
Hilfsmittel benutzt. Wortlich oder sinngem ubernommenes Gedankengut habe ich
als solches kenntlich gemacht.
Ort, Datum Unterschrift
61
62
Bibliography
[1] Shivani Agarwal and Dan Roth. Learning a sparse representation for object
detection. In Proceedings of the 7th European Conference on Computer Vision,
volume 4, pages 113–130, 2002.
[2] J. Bouguet. Camera calibration toolbox for matlab. Technical report, Intel Corp.,
2001. Available at http://www.vision.caltech.edu/bouguetj/calib doc/.
[3] Christopher J. C. Burges. A tutorial on support vector machines for pattern
recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998.
[4] Andrew K. Chan and Cheng Peng. Wavelets for Sensing Technologies. Artech
House, October 2003.
[5] Chih-Chung Chang and Chih-Jen Lin. Libsvm. Technical report, Computer
Science and Information Engineering National Taiwan University Taipei, Taiwan,
106, 2003. Available at http://www.csie.ntu.edu.tw/ cjlin/libsvm/.
[6] T. Evgeniou, M. Pontil, C. Papageorgiou, and T. Poggio. Image representations
for object detection using kernel classifiers, 2000.
[7] Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Statistical learn-
ing theory: A primer. International Journal of Computer Vision, 38(1):9–13,
2000.
[8] Francois Fleuret and Hichem Sahbi. Scale-invariance of support vector machines
based on the triangular kernel. Technical report, IMEDIA Research Group,
63
France, IMEDIA Research Group, INRIA, Domaine de Voluceau, 78150 Le Ches-
nay, France, 2002.
[9] David A. Forsyth and Jean Ponce. Computer Vision A Modern Approach. Pren-
tice Hall, 2003.
[10] Uwe Franke, Dariu Gavrila, Steffen Gorzig, Frank Lindner, Frank Paetzold, and
Christian Wohler. Autonomous driving goes downtown. IEEE Intelligent Sys-
tems, 13(6):40–48, 1998.
[11] A. Gersho and R.M. Gray. Vector quantization and signal compression. Kluwer
Academic, 1991.
[12] C. Harris and M. Stephens. A combined corner and edge detector. In Proceedings
Alvey Vision Conference, pages 147–151, University of Manchester, 1988.
[13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.
Springer, 2001. HAS t 01:1 1.Ex.
[14] B. Heisele, A. Verri, and T. Poggio. Learning and vision machines. In Proceedings
of the IEEE, pages 90:1164–1177, 2002.
[15] Bernd Heisele, Thomas Serre, S. Prentice, and Tomaso Poggio. Hierarchical
classification and feature reduction for fast face detection with support vector.
In Pattern Recognition, volume 36, pages 2007–2017, 2003.
[16] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to
support vector classification. Technical report, Department of Computer Science
and Information Engineering National Taiwan University Taipei 106, Taiwan,
2003.
[17] S. G. Mallat. A theory for multiresolution signal decomposition: The wavelet