Stereovision based vehicle classi cation using support ...cbcl.mit.edu/publications/theses/thesis-masters-paysan.pdf · Stereovision based vehicle classi cation using support vector

Stereovision based vehicle classification usingsupport vector machines

byPascal Paysan

Submitted to the University of Applied SciencesFachhochschule EsslingenHochschule fur Technik

Fachbereich Informationstechnik, Softwaretechnik in partialfulfillment of the requirements for the degree of

Diplom-Ingenieur - SoftwaretechnikFebruary 2004

accomplished at theMASSACHUSETTS INSTITUTE OF TECHNOLOGY

Center for Biological and Computational Learningfor

DaimlerChrysler

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Pascal Paysan

February 28, 2004

Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Jurgen KochDr. rer.nat.

Thesis Supervisor

Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Tomaso Poggio

Ph.D.Supervisor at CBCL

Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Stefan Gehrig

Dr.Supervisor at DaimlerChrysler

2

Stereovision based vehicle classification using support vector

machines

by

Pascal Paysan

Submitted to the University of Applied Sciences Fachhochschule Esslingenon February 28, 2004, in partial fulfillment of the

requirements for the degree ofDiplom-Ingenieur - Softwaretechnik

Abstract

The thesis studies the detection of oncoming vehicles in traffic scenes by using depthinformation. The image sequences in our experiments are captured by a pair of stereocameras which are mounted in a test vehicle. The main difficulty is to build a systemthat runs in real time on a standard PC and performs accurate detection of vehicleseven under unfavorable illumination and weather conditions. Robust object detectionis a key function in vision guided, autonomous driving. The most relevant classes ofobjects for this type of application are vehicles, pedestrians, and traffic signs. Thiswork focuses on the recognition of oncoming vehicles. Although stereo vision by itselfis not reliable enough to perform accurate vehicle detection it is useful to quicklygenerate object hypotheses which can then be verified by accurate pattern recognitiontechniques. The pattern recognition algorithm consists of an SVM trained on waveletcoefficients of histogram equalized frontal views of vehicles, similar to the techniquedescribed in [21]. Experiments show a detection rate of 63 %; the processing timefor a 640x480 frame is about 300 ms. The work contains detailed statistics about thedetection rate and the computing time. A novel way to combine stereo vision andSVM classifiers is introduced.

Thesis Supervisor: Jurgen KochTitle: Dr. rer.nat.

3

4

Acknowledgments

First I want to thank Carsten Knoppel who has arranged the first contact to the

CBCL. Thanks to Uwe Franke, Stefan Gering and DimlerChraysler for technically

and financial support. I want to thank Jerry Jun Yokono and Bernd Heisele for useful

advice, Stanley Bileschi for supporting me with image and classification librarys. I

want to thank Professor Poggio for supervising my thesis at MIT and all the CBCL

members, for sharing their knowledge, and for the friendly welcome. Thanks to

Professor Jurgen Koch for supervising me at my home University in Esslingen. Last

but not least I want to thank my family and friends for supporting me.

5

6

Contents

1 Introduction 15

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Related work 17

3 Background 19

3.1 UTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Standard Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Histogram equalization . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Support vector machine classification . . . . . . . . . . . . . . . . . . 24

3.5.1 SVM training . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5.2 Advantages of SVM based classifiers . . . . . . . . . . . . . . 26

3.5.3 The kernel parameters C and σ . . . . . . . . . . . . . . . . . 26

4 Procedure 29

4.1 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Sequence loader . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Bounding box detection . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.1 Area of interest . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.2 Bounding box search . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.3 Bounding box projection . . . . . . . . . . . . . . . . . . . . . 35

7

4.2.4 Overlap removing . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.1 Sample extraction and labeling . . . . . . . . . . . . . . . . . 37

4.3.2 Wavelet transformation . . . . . . . . . . . . . . . . . . . . . . 37

4.3.3 SVM kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.4 Kernel performance evaluation . . . . . . . . . . . . . . . . . . 38

4.3.5 Kernel parameter optimization . . . . . . . . . . . . . . . . . 39

5 Results 41

5.1 Bounding box detection . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2.1 Training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2.2 Grid search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2.3 Classification performance . . . . . . . . . . . . . . . . . . . . 45

5.3 System detection rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Conclusion 55

6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A Tables 57

B Erklarung (Agreement) 61

8

List of Figures

3-1 Schematic illustration of standard stereo geometry, where C0 and C1

are the focal points, P is the intersection in 3D space of rays through

the correlating features. The camera coordinate system (X,Y,Z) is

placed in the middle of the baseline, translated by -height. . . . . . . 21

3-2 Example of Histogram Equalization (a) Source image (b) Histogram

equalized image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3-3 Scaling function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3-4 Wavelet function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3-5 2D Scaling and Wavelet functions. . . . . . . . . . . . . . . . . . . . . 24

3-6 Examples for different classifiers. Visualized wit Gui for LibSvm [5]. . 25

4-1 Schematic illustration of the detection system. . . . . . . . . . . . . . 30

4-2 Collaboration diagram for the SequenceLoader class. This class con-

nects all important components. . . . . . . . . . . . . . . . . . . . . . 32

4-3 Area of interest in 3D space with right lane mark. . . . . . . . . . . . 33

4-4 (a) Schematic illustration of bounding box detection on projected sur-

face. Gray values represent the count of 3D points. (b) Geometrical

constraints for validation. Width and height as well as α and β must be

in given range. The height is measured using minimum and maximum

y values of the 3D points within the search window. . . . . . . . . . . 35

4-5 Detected bounding boxes on a truck in 3D space. . . . . . . . . . . . 35

4-6 The transformation steps from the gray value patch to the input vector

of the SVM. All transformations are optional. . . . . . . . . . . . . . 38

9

4-7 ROC curve example with important features. . . . . . . . . . . . . . . 39

4-8 Plot shows the computed area under the curve for Gaussian kernels

over (C, σ). The kernels were trained with wavelet coefficients from

histogram equalized gray value samples. . . . . . . . . . . . . . . . . 40

5-1 Sample screen shots for detected bounding boxes. In the lower left

corner, the content of the marked bounding box (dark blue) is shown. 42

5-2 A view of samples from our training set. (a) Images from the vehicle

class. (b) Typical non-vehicles. . . . . . . . . . . . . . . . . . . . . . 43

5-3 Example of contour line plot for grid search with Gaussian kernels. All

kernels are trained with histogram equalized Haar wavelet coefficients. 44

5-4 ROC curve for best Gaussian kernels from the first training set con-

taining 2428 samples, which were histogram equalized and transformed

to Haar wavelet coefficients. . . . . . . . . . . . . . . . . . . . . . . . 47

5-5 ROC curve for best Gaussian kernels, with sparse kernels using val-

ues above threshold 0.1. The Gaussian kernels where trained on the

first set containing 2428 samples, which were histogram equalized and

transformed to Haar wavelet coefficients. . . . . . . . . . . . . . . . . 48

5-6 ROC curves for best Gaussian kernels from the last training set con-

taining 6424 samples. Absolute values of Haar wavelet coefficients from

histogram equalized gray values. (a) Dense kernel using bootstrap to

add false positives to the training set; (b) dense kernel; (c) sparse kernel

threshold 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5-7 Image of the left camera with corresponding depth data. This visual-

ization makes it possible to count vehicles which occur in the interest

area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

10

5-8 Sample screen shots of the system in the detection mode. On the

right side, the relative position of the detected vehicle, is shown. The

readability of the classification is displayed es vertical bar on the right

side. In the right collum we show the typical errors of the system.

The errors from top to bottom: false positive, backside of vehicle as

positive and missed vehicle. . . . . . . . . . . . . . . . . . . . . . . . 52

11

12

List of Tables

4.1 Classes used by the SequenceLoader . . . . . . . . . . . . . . . . . . . 32

5.1 Different training and test sets used to evaluate the performance. . . 43

5.2 Several trained kernels for different ranges of (C, σ). All kernels are

trained on Haar wavelet coefficients of histogram normalized samples

and tested with the same test set. . . . . . . . . . . . . . . . . . . . . 45

5.3 Listing of kernels with their performance. The results are calculated

using a test set with 212 vehicles and 2216 negative examples. Most of

the SVM’s are trained with Haar wavelet coefficients from histogram

equalized images. SV’s are the number of support vectors and ”Train-

ing p/n” is the number of positives and negatives in the training set.

The last column shows the time needed to classify the test set. A

complete listing of all good kernels can be found in A.2 and A.3. . . . 46

5.4 Results for whole system detection rate, with interest area 100 meters

in front of the vehicle.. . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.5 Results for whole system detection rate, with interest area 50 meters

in front of the vehicle. . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.6 Statistics of the system. . . . . . . . . . . . . . . . . . . . . . . . . . 53

A.1 List of Terms and Acronyms . . . . . . . . . . . . . . . . . . . . . . . 57

A.2 Listing of kernels with their performance. The results are calculated

using a test set with 212 vehicles and 2216 negative examples. Most of

the SVM’s are trained with Haar wavelet coefficients from Histogram

equalized images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

13

A.3 Sequel listing of kernels with their performance. The results are cal-

culated using a test set with 212 vehicles and 2216 negative examples.

Most of the SVM’s are trained with Haar wavelet coefficients from

Histogram equalized images. . . . . . . . . . . . . . . . . . . . . . . . 59

A.4 Table of libraries used in the project. . . . . . . . . . . . . . . . . . . 60

14

Chapter 1

Introduction

This project studies the detection of oncoming vehicles in traffic scenes by using depth

information. The image sequences in our experiments are captured by a pair of stereo

cameras which are mounted in a test vehicle. These images are preprocessed by a

standard stereo system to receive 3D information for extracted interest points. The

stereo system is not in the scope of this work although we give a small introduction

to it. The 3D data is used to detect roughly estimated positions of bounding boxes

around close objects. These bounding boxes are used for the extraction of example

images to train the SVM as well as for the detection stage. The gray level patch

bordered by the bounding box is scaled to the preferred detection size using linear

interpolation. For the final classification the patch was histogram equalized and

transformed using different wavelet transformations to acquire the SVM input vector.

Finally, a program was developed to compare the results of the classification and

optimize the parameters of the SVM using grid search with Gaussian kernels.

1.1 Motivation

Robust object detection is a key technique in terms of understanding the environment

and a step towards the intelligent vehicle. Since driver assistance systems such as

Distronic automatic distance cruise control have become more popular to improve

safety and comfort, the need for new technologies which use improved detection is

15

growing. Optical systems provide an opportunity to solve the task of environment

interpretation. A major theme in this area is the detection of oncoming vehicles in

which the detection time and also the robustness are fundamental items. That makes

it necessary to use improved classifiers to prove the results of fast classifiers. The

statistical learning combined with stereo vision provides a technology to solve this

task.

1.2 The Problem

The main difficulty of the vehicle detection task is the variety of different shapes

and colors of vehicles. Unfavorable illumination and weather conditions are another

challenging point. There are also reflections which influence the realisability of local

descriptors. Furthermore we have many degrees of freedom in scale and translation,

in which vehicles can occur in images. In our case the detection time on a standard

PC becomes important to, because the future goal is to implement a real time system

which supports the driver.

16

Chapter 2

Related work

Much work related to vehicle detection has been done. In [20] Constantine P. Pa-

pageorgiou and Tomaso Poggio introduced an approach which is very similar to our

SVM classification. The approach showed good results. Unlike our system they used

no 3D information from a stereo vision system, so that they had to apply the SVM

classifier in different scales and on the whole image. The computational cost for our

method should be lower. In [1] an interesting approach based on an advanced feature

selection method was proposed. Their system builds a ”vocabulary” of information

rich features. The classifier is then trained on the feature vectors from this ”vocab-

ulary.” They mention 8 seconds as the average time to test a 200x150 pixel image,

which is still too slow for our purpose. Another advantage of using stereo vision is

that it provides information about the position of the occurrence relative to the vehi-

cle in which the cameras are mounted. This information is usable for other tasks to

calculate for example the relative movement. In [19] a stereo based method is shown.

In contrast to our system, they use symmetry to refine and measure the correctness of

found bounding boxes. This approach might not work as well in downtown scenarios

where symmetries can occur e.g. on houses or guardrails of bridges.

17

18

Chapter 3

Background

3.1 UTA

Urban Traffic Assistant (UTA) is a current research project at DaimlerChrysler ex-

ploring the possibilities of using machine vision and image understanding to increase

safety and comfort in downtown scenarios [10].

The most important perception tasks that have to be solved for this project are:

• The leading vehicle must be detected and its distance, speed and acceleration

must be estimated in longitudinal and lateral directions.

• The course of the lane must be extracted even if it is not given by well painted

markings.

• Small traffic signs and traffic lights have to be detected and recognized in a

highly colored environment.

• Different additional traffic participants such as oncoming vehicles,bicyclists or

pedestrians must be detected and classified.

• Stationary obstacles that limit the available free space e.g, parked cars, must

be detected.

19

This work is part of the UTA project and shows a new approach that combines

stereo measurement with SVM classifiers to extract position estimation and classifi-

cation results at the same time.

3.2 Standard Stereo

A computational theory of human stereo vision has been proposed by T. Poggio [18].

We use depth data from a standard stereo system. Standard stereo is a technique

where the images are projected on one single plane in such a way that the equipolar

lines are represented by the image rows. The advantage of this method is that the

correlating features can be searched efficiently. The cameras are calibrated by the

method introduced by Bouguet [2]. During the calibration process the internal as

well as the external parameters of the camera can be calculated. These parameters

are used in the rectification process to correct lens-dependent distortion and project

the images. On one of these images an interest operator, such as the Harris corner

detector [12], is applied to extract interest points. Afterwards, for each of the interest

points, a correlating feature is matched in the other image using a technique such

as sum of square differences (SSD). From the correlating features the disparity is

calculated and then used to determine the 3D point. In our case we compute the

points using the matrix KK−1

KK−1 =

widthnu

0 −width2

0 heightnv

−height2

0 0 f

(3.1)

where width, height is the chip size in m, f is the focal length in m, and nu, nv is

the image size in pixels.

The 3D point is then calculated by solving the equation

20

~C0 + λ0KK−1

u

v

1

= ~C1 + λ1KK−1

u

v − d

1

(3.2)

where ~C0 and ~C1 are the focal points of the cameras relative to the road surface,

u, v are the pixel coordinates of the feature, and d is the disparity of the matched

feature (Fig. 3-1).

C0

C1

P

Epipolar line

V

U,U’

Correlating features

X

Y

Z

Baseline

Height

Figure 3-1: Schematic illustration of standard stereo geometry, where C0 and C1 arethe focal points, P is the intersection in 3D space of rays through the correlating fea-tures. The camera coordinate system (X,Y,Z) is placed in the middle of the baseline,translated by -height.

A more detailed description can be found at [22, 9].

3.3 Histogram equalization

The histogram equalization (HE) changes the distribution of the gray values over the

image. It uses the histogram to determine possible occurrences of gray values and then

maps these such that the distribution fits the desired function. As shown in [6] HE

increases the performance of face and people classification which is comparable to our

21

task. HE is used for image compression because it is a form of Vector Quantization

[11]. If we can assume that HE leads to a more compressed representation of the

image, then we can conclude that it is roughly the same as using a larger training set.

We know that HE provides a better illumination invariance which is also a reason

why it improves the performance.

(a) (b)

Figure 3-2: Example of Histogram Equalization (a) Source image (b) Histogramequalized image.

3.4 Wavelets

Wavelets are a mathematical tool to hierarchically decompose functions [17, 26].

Wavelets represent a signal in different resolutions. As shown in many publications

[21, 6, 4] wavelets increase the performance of classifiers. In the case of 2D wavelets

for images, the transformation is applied on columns in the first step and then on

the resulting rows. This approach, called non-standard decomposition, is repeated

recursively until the transformation is complete. The Haar wavelet is defined by the

wavelet function φ (3.4) and the scaling function ψ (3.3).

φ(x) =

1 for0 ≤ x < 1

0 otherwise(3.3)

ψ(x) =

1 for0 ≤ x < 1/2

−1 for1/2 ≤ x < 1

0 otherwise

(3.4)

The set of scaled and translated functions for the different resolutions are then

defined as

22

φji (x) = 2j/2φ(2jx− i), i = 0, ..., 2j − 1 (3.5)

ψji (x) = 2j/2ψ(2jx− i), i = 0, ..., 2j (3.6)

where j is the scale and i is the translation. In our case the 2D Wavelet function

is shown in Fig. 3-5. We can also think of the 2D Wavelet as a multiple-edge

filter in horizontal, vertical and diagonal directions. The filter is applied in different

resolutions.

0 1

1

Figure 3-3: Scaling function.

0 1

-1

1

Figure 3-4: Wavelet function.

23

-

+

+-+

+-

-

+

X

Y

Figure 3-5: 2D Scaling and Wavelet functions.

3.5 Support vector machine classification

Support vector machines (SVM) are a well-founded technic in statistical learning

theory [27, 3, 16, 7, 13] . The SVM is a trainable machine which predicts the output

from the given input. For the supervised learning process, labelled examples are

presented, the task is to find a function which describes the relation between the input

examples and the output. In case of binary class SVM’s, the function to predict the

output is

f(x) = sign

(m∑

i=1

αiyiK(xi, x) + b

)(3.7)

where xi, i = 1, ...,m are the selected training examples called support vectors,

and x is the input vector, K(xi, x) called kernel is a symmetric positive function, yi

the label for the vector (1,-1), and αi a weight for the support vector determined in

the training process. b is the bias of the hyperplane. There are several different types

of kernels

Linear : K(xi, x) = xi · x (3.8)

Polynomial : K(xi, x) = (xi · x + 1)d (3.9)

Gaussian : K(xi, x) = e−||xi−x||2/2σ2(3.10)

24

where d is the degree of the polynomial kernel and σ the variance of the Gaussian.

The polynomial and Gaussian are non linear kernels, which is important if there is no

linear relation between the labels and the input. These kernels can solve non-linear

problems.

Linear Polinomial Gaussian

Figure 3-6: Examples for different classifiers. Visualized wit Gui for LibSvm [5].

3.5.1 SVM training

The training process of an SVM classifier is derived from the regularization theory

minf∈H

1m

m∑

i=1

V (yi, f(xi)) + λ||f ||2K (3.11)

where f ∈ H is of the kind

f(x) =m∑

i=1

yiαiK(xi, x) + b (3.12)

and V is the loss function which measures the goodness of the predicted output f(xi)

with respect to the given label yi. There are several different kinds of loss functions,

for SVM classification. The following loss function is used:

V (f(x), y) = (1− yf(x))+ (3.13)

25

where (t)+ = t if t > 0, and zero otherwise.

In most of the SVM literature, these equations are re-parameterized. Instead of

the regularization parameter λ, regularization is controlled via a parameter C, defined

using the relationship

C =1

2λm(3.14)

The parameter C gives the user the possibility to chose an extra cost for errors.

A higher C is corresponding to assign a higher penalty to errors [3]. Using this

definition, the regularization problem becomes

minf∈H

Cm∑

i=1

V (yi, f(xi)) +12||f ||2K (3.15)

This leads to a primal or a dual problem, which are both convex quadratic programs.

A detailed description can be found in [23, 14, 7]. In [23] is also described an algorithm

which can handle large training sets.

3.5.2 Advantages of SVM based classifiers

There are several advantages of SVM’s. The most important advantage is that during

the training process, only a few vectors out of the training set are selected to become

support vectors. This reduces the computational cost and provides a better gener-

alization. Another advantage is that. There are no local minima in the quadratic

program, so the found solution is always the optimum of the given training set. Fi-

nally we have the advantage that the solution is not dependent on start conditions

unlike neural networks.

3.5.3 The kernel parameters C and σ

Choosing the right parameter C, and in case of the Gaussian kernels additionally σ,

can become a kind of a challenge. Geometrically, C controls the width of the margin

between class and non-class. If C is too big the margin becomes very small and the

26

training time becomes extremely long. On the other hand, if C is too small, there will

be no unbounded support vectors and the term b is not determinable [23]. Almost

the same is true for σ, if it is too small the generalization becomes very poor and

every vector is used as a support vector, if σ is too big the kernel also will not show

good results. We propose the grid search as a straightforward method (4.3.5) to find

suitable values for C and σ.

27

28

Chapter 4

Procedure

4.1 System

The built system contains three programs. The main program is called StereoVisual-

isation. It can be started in three modes. The calibration mode is used to set up the

world coordinate system (4.2.1), the label mode (4.3.1) to generate training and test

sets, and finally the detection mode, which uses the SVM to classify the found bound-

ing boxes and display the oncoming vehicles. In addition we developed a program to

generate training and test files in SVMFu format, which is called ImageTransform.

The program was separated from the main system in order to use several different

transformations to generate training and test files from one image database. Further

more, a program was developed to automate the training and testing of the SVM

kernels. The SVMFuTool has several features; it generates shell scripts to dispatch

SVMFu training processes with different parameters on available Cpus and machines.

The tool also applies the testing kernels to the test set and saves GnuPlot files and

tables to evaluate the results. A list of the libraries used in this project can be found

in A.4, some of the libraries had to be adapted for the system.

29

3D points

Stereo vision system

1.1

Disparities

1

3D Point calculation

1.2

Bounding box detection

Bounding boxes

2

2.1

Interactive labeling

2.3

Classification (SVM)

2.4

Feature extraction

2.2

Feature vector

Feature extraction

3.1

Generation of training file

3.2

SVM training

3.3

SVM kernel test (grid search)

3.4

Training examples

Gaussian kernels

Selected kernel

Feature vector

Test examples

3

Figure 4-1: Schematic illustration of the detection system.

The system shown in Fig. 4-1, contains three sub systems (1...3 in the figure) and

uses the shown modules. In the following list we describe the modules of the system.

- 1 Stereo system to compute depth data.

30

- 1.1 Feature matching on epipolar lines to approximate disparities.

- 1.2 3D point calculation.

- 2 Stereo based vehicle detection and SVM training system.

- 2.1 Bounding box detection searches in depth data for proper occurrences of vehi-

cles. Dependent on the mode of the application, we pass the bounding box to

the feature extraction or display it for the labeling.

- 2.2 In the detection mode we extract the features using the transformation de-

scribed in 4.3.2.

- 2.3 We save the gray value patch in two directories on the hard drive, dependent

on their label.

- 2.4 SVM classification of the feature vector (4.3).

- 3 Separate programs to generate training set, train SVM kernel and, test SVM

kernels.

- 3.1 Same feature extraction as in the main system (2.2), but in a separate program

to read gray value patches from hard drive.

- 3.2 Feature vectors are saved in SVMFu format as a training or test file.

- 3.3 We use the SVMFu [24] to train the Gaussian kernels.

- 3.4 Program to test the kernels and generate graphics for evaluation. The graphics

are saved in GnuPlot format.

4.1.1 Sequence loader

The class SequenceLoader is the main class for the detection process. The class

contains all necessary class instances for the whole system, except the user interface.

The collaboration is shown in Fig. 4-2.

31

SequenceLoader

ZImageTransform

mTransform

DepthMap

mDepthMap

BouguetCamera

mCam0mCam1

List< SeqItem >

mSeqItems

ZImageTexture

mC1TexturemC0Texture

TimeStatistic

mTimeStatistic

Figure 4-2: Collaboration diagram for the SequenceLoader class. This class connectsall important components.

The following classes are used by the SequenceLoader:

Table 4.1: Classes used by the SequenceLoaderZImageTransform Transformation and classification of the

gray value patches (4.3.2).DepthMap 3D data loader and bounding box detection.BouguetCamera Calibration data loader and projection matrixes (4.2.1).List <SeqItem> List of the image pairs and the belonging

3D data files from input directory.ZImageTexture Image loader for images from both cameras,

with display functions for OpenGL.TimeStatistic Some time measurement.

32

4.2 Bounding box detection

4.2.1 Area of interest

The volume of interest is a user- defined area in 3D space in which objects can be

detected. This area in front of the vehicle is rotated to the world coordinate system,

which is parallel to the road. It is defined by two sets of parameters.

• A Cartesian projection matrix W (4.1) to rotate and translate the 3D points.

This permits us to place the interest box parallel to the street. The system

provides a calibration mode to set up the projection matrix.

• A clipping volume to define the boundaries of the detection. This box specifies

the area of interest in front of the vehicle. Each point out of range is ignored

in further detection steps. The constants top, left, right, bottom, near and far

are read from a configuration file and shown as an orange box in the calibration

mode.

Figure 4-3: Area of interest in 3D space with right lane mark.

The Matrix W is used to project the point ~p to ~p′ in the world coordinate system

which is placed in front of the vehicle as in (4.2).

33

W =

r00 r01 r02 t03

r10 r11 r12 t13

r20 r21 r22 t23

0 0 0 1

(4.1)

~p′ = W~p (4.2)

Each projected point ~p′ is discarded if it is not placed in the interest box.

left > p′x > right,

top > p′y > bottom,

near > p′z > far

4.2.2 Bounding box search

For bounding box detection we project the points to the surface given by the interest

volume. We quantify x and z of the points to use these values as an index for an

array. For every x, z coordinate an average of the 3D points is calculated and stored

in the array (Figure 4-4 (a)). Possible vehicle fronts are detected by shifting a search

window over the array. For the window the outer left and right average points are

searched and tested to see if they fulfil several geometrical constraints (Figure 4-4

(b)). The width (p′lx − p′rx) between the outer left and right average point must be

in a given interval. The approximated height is the difference of the maximal and

minimal heights (p′y value) of the points within the window. We also provide the

possibility to use constant values for the bottom and the top, which shows more

reliable results. Further on, we use the angles α and β to limit the rotation of the

box relative to the world coordinate system. The edge points then consist of the x, z

coordinate of the left or right point and the minimum or maximum p′y value. The

detected boxes are stored in a list BL = {b0, b1..bi..bn−1} for the following processing.

34

width

a

he

ight

b

Left and right average point

(a) (b)

Figure 4-4: (a) Schematic illustration of bounding box detection on projected sur-face. Gray values represent the count of 3D points. (b) Geometrical constraints forvalidation. Width and height as well as α and β must be in given range. The height ismeasured using minimum and maximum y values of the 3D points within the searchwindow.

Figure 4-5: Detected bounding boxes on a truck in 3D space.

4.2.3 Bounding box projection

To extract the gray value patches contained by the bounding boxes we have to project

the coordinates from the world coordinate system to the image plane of the chosen

camera. To achieve this, we need the matrix KK describing the internal parameters

of the camera. The matrix KK describes the projection of 3D points on the image

surface following the pinhole camera model. We use the calibration results from the

Camera Calibration Toolbox introduced by Jean-Yves Bouguet [2]. We extend the

matrix KK to KKw by including the translation of the cameras relative to the road

35

surface. The image coordinates are then calculated as follows

KKw =

fnuwidth

0 −nu2 t fnu

width

0 fnvheight

−nv2 −h fnv

height

0 0 0 0

0 0 −1 0

(4.3)

where width, height[m] represent the chip size of the camera in meters. f [m] is the

focal length of the camera lens. t[m] is the length of the base line. This parameter

must be set either to t for the right or −t for the left camera. h[m] is the height over

the surface.

~pi = KKw ·W−1~p′ (4.4)

where ~p′ is an edge point of the bounding box and W is the world projection matrix

(4.1). The pixel coordinate on the image |~pi| is then:

|~pi| =

pix

piw

piy

piw

(4.5)

4.2.4 Overlap removing

In order to remove overlap of the boxes found, we sort the list BL = {b0, b1..bi..bn−1}containing all boxes by the mid point of the box on the image plane. Each box

bi ∈ {bn−1..b0} is compared with the boxes {bi−1..b0} to find the biggest overlap ratio.

If the overlap ratio is bigger than a certain threshold we merge them and delete bi.

For the label mode we choose a high threshold to provide many but not all boxes.

This makes it more intuitive to choose them with the mouse. In the detection mode

a low threshold merges all detections to one per vehicle.

36

4.3 Classification

We use an SVM classifier to make the final decision wether the bounding box contains

a vehicle or not. The classifier is trained with SvmFu [24].

4.3.1 Sample extraction and labeling

We provide an ease way to extract training and test data sets from the image se-

quences. Our system provides the possibility to label bounding boxes; the content

can then be saved. We transform the images with another program to an SVMFu

training or test set file. With this approach we can try several different settings for the

transformation, as well as certain dimension counts. To get fast growing sets of data

we provide a method to extract slightly shifted and scaled patches from one bounding

box. In addition, that should help to achieve invariance to these transformations. To

extract negative samples we implemented a random patch extraction, which can be

used if no vehicles occur in the frame.

4.3.2 Wavelet transformation

In our system we use wavelet coefficients to represent the feature vectors of vehicles.

As shown in [21], wavelets provide a compact representation of image features. Fur-

ther information about wavelets can be found in [26]. To avoid numerical difficulties

in the training stage we apply a histogram equalization on the gray value patch before

we transform it. The average illumination of the patch does not affect the classifica-

tion result, so we discard it in the wavelet response. To improve classification speed,

we also test the performance on coefficients above a certain threshold. In addition, we

perform experiments with absolute wavelet coefficients, as in [21]. With this method

the filter response of the wavelet is treated the same way for dark regions on a bright

background as vice versa. The whole transformation chain is shown in Fig. 4-6.

37

Extraction HE DWT T ABS SVM

Figure 4-6: The transformation steps from the gray value patch to the input vectorof the SVM. All transformations are optional.

4.3.3 SVM kernel

The Gaussian kernel (4.7) nonlinearly maps the samples into high-dimensional space

so the Gaussian kernel can handle a decision function which is not a linear function

of the data and labels [3]. Since we assume that this case is given, we decide to use

a Gaussian kernel in our system. (3.7) is the decision function, as in section 3.5.

f(x) = sgn

(m∑

i=1

αiyiK(xi, x) + b

)(4.6)

K(xi, x) = e−||xi−x||2/2σ2(4.7)

where xi is one of the support vectors and xj the test sample.

4.3.4 Kernel performance evaluation

In order to analyze the classification results of the SVM kernels we choose Receiver

Operating Characteristic (ROC) curves to visualize relation between true and false

positives. The axes of an ROC curve are the number of true positives divided by the

total positives in the test set and the false positives divided by the total number of

negatives. The Area Under the Curve (AUC) gives us the possibility to get a scalar

measurement for the performance. The AUC value is also used to evaluate the results

in the grid.

38

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Tru

e P

ositiv

e R

ate

(tr

ue p

ositiv

es/p

ositiv

es)

False Positive Rate (false positives/negatives)

ROC (Receiver Operating Characteristic) curve.

k_c2.297397_s4.924578 SVs: 735 Dim: 1023 sig: 4.925 AUC: 0.973

All positives detected

but many false positives

No false positives

Kernel nameNumber of support vectors

Number of dimensionsArea under the curve

Figure 4-7: ROC curve example with important features.

4.3.5 Kernel parameter optimization

In order to optimize the performance of the kernels we choose the grid search method

shown in [16]. The grid search is more or less a brute force approach which is com-

putational expensive, but the training processes can be easily parallelized. Using this

technique we train several kernels with different values of (C, σ), where C is the cost

(weight) of each vector from the training set and σ the parameter for the Gaussian ker-

nel (4.7). As recommended in [16], we try exponentially growing sequences of (C, σ)

. For each kernel the area under the curve is computed with a certain test set. The

results are plotted as a surface with contours (Fig. 4-8). The results can be improved

by refining the search for a good area, in our example C = 25...210, σ = 20...210.

39

’TestSamples4NHDCData.txt’ using 2:3:12 0.565 0.471 0.377 0.283 0.188

0.0942

-10-8

-6-4

-2 0

2 4

6 8

lg(C)-10

-8-6

-4-2

0 2

4 6

8

lg(sigma)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUC

Figure 4-8: Plot shows the computed area under the curve for Gaussian kernels over(C, σ). The kernels were trained with wavelet coefficients from histogram equalizedgray value samples.

40

Chapter 5

Results

5.1 Bounding box detection

From the bounding box detection we receive possible occurrences of vehicles in the

world coordinate system (Fig. 5-1). This showed that it is useful for the labeling

process to apply the overlap removing (4.2.4) before the manual class choice by the

user. As a further feature, we provide the possibility to use the actual kernel to pre-

classify the boxes, which makes it easy to find false positives and false negatives and

include them in the training set. In addition we provide a bootstrap method to add

false positives from frames without vehicles. Our experience is that this makes it easy

to improve the performance of the classifier. For vehicle detection we first classify

all predicted boxes and then merge only the positives, which is slower than merging

them first, but gives better results.

41

Figure 5-1: Sample screen shots for detected bounding boxes. In the lower left corner,the content of the marked bounding box (dark blue) is shown.

The bounding box detection shows poor results if the target vehicle is in the far

distance or rotated in reference to the world coordinate system. If we only have depth

information on one side of the vehicle, the detection fails. Under normal illumination

conditions, we usually have several slightly shifted and scaled boxes on the object.

42

5.2 Classification results

As described in 4.3, we tried several input vector modifications such as histogram

equalization and compressed coefficients for the training process. For the optimization

of the Gaussian kernel we used the grid search method.

5.2.1 Training set

For the training of the SVM we use labeled gray value patches (Fig. 5-2). The images

are taken under natural weather and illumination conditions. It is obvious that the

rough bounding box estimation courses slightly scaled and shifted views of vehicles.

Table 5.1: Different training and test sets used to evaluate the performance.type positives negatives totalTraining 311 2284 2595Training 2853 3186 6039Test 212 2216 2428

(a)

(b)

Figure 5-2: A view of samples from our training set. (a) Images from the vehicleclass. (b) Typical non-vehicles.

43

5.2.2 Grid search

The results of the grid search were plotted as a contour of the AUC values. Fig. 5-3

shows some of the calculated results.

0.565 0.471 0.377 0.283 0.188 0.0942

-10 -8 -6 -4 -2 0 2 4 6 8

lg(C)

-10

-8

-6

-4

-2

0

2

4

6

8

lg(sigma)

0.94 0.927 0.913 0.9

0.886 0.872 0.859 0.845 0.831 0.818

5 6 7 8 9 10 11 12 13 14

lg(C)

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

lg(sigma)

(a) (b)

0.854 0.759 0.665 0.57 0.475 0.38 0.285 0.19

0.0949

-4 -2 0 2 4 6 8 10 12

lg(C)

-14

-12

-10

-8

-6

-4

-2

0

2

lg(sigma)

0.841 0.748 0.654 0.561 0.467 0.374 0.28 0.187

0.0935

-4 -2 0 2 4 6 8 10 12

lg(C)

3

3.5

4

4.5

5

5.5

6

6.5

7

7.5

lg(sigma)

(c) (d)

Figure 5-3: Example of contour line plot for grid search with Gaussian kernels. Allkernels are trained with histogram equalized Haar wavelet coefficients.

44

Table 5.2: Several trained kernels for different ranges of (C, σ). All kernels are trainedon Haar wavelet coefficients of histogram normalized samples and tested with the sametest set.

Figure C σ Best AUC Type Training p/n Test p/n5-3 (a) 2−10..29 2−10..29 0.942 C = 27, σ = 23 dense 2853/3186 212/22165-3 (b) 25..214 22..27 0.94 C = 25, σ = 23.5 dense 2853/3186 212/22165-3 (c) 2−5..215 2−14..23 0.949 C = 23, σ = 22.2 sparseN 311/2284 212/22165-3 (d) 2−5..215 23..28 0.935 C = 25, σ = 23 sparseN 311/2284 212/2216

5.2.3 Classification performance

In order to analyze the classification results of the SVM kernels, we choose Receiver

Operating Characteristic (ROC) curves to visualize relations between the true and

false positives. Once again the Area Under the Curve (AUC) enables us to get a

scalar measurement for the performance. The AUC value is also used to evaluate the

results in the grid.

45

Table 5.3: Listing of kernels with their performance. The results are calculated usinga test set with 212 vehicles and 2216 negative examples. Most of the SVM’s aretrained with Haar wavelet coefficients from histogram equalized images. SV’s arethe number of support vectors and ”Training p/n” is the number of positives andnegatives in the training set. The last column shows the time needed to classify thetest set. A complete listing of all good kernels can be found in A.2 and A.3.

C σ AUC Type SV’s Training p/n Dimensions Time [ms]25.00 23.00 0.936 dense1 565 311/2284 1023 (32x32) 3420.00 22.00 0.937 dense 1212 311/2284 511 (32x16) 3727.00 23.00 0.971 dense 665 311/2284 1023 (32x32) 4023.00 22.20 0.949 sparseN 577 311/2284 12 (32x32) 525.00 23.00 0.935 sparseN 436 311/2284 12 (32x32) 421.80 28.00 0.832 sparseN 490 311/2284 13 (32x32) 421.00 22.20 0.947 dense2 910 311/2284 1023 (32x32) 5827.00 23.00 0.942 dense 896 2853/3186 1023 (32x32) 5324.00 22.00 0.941 sparseN 1407 2853/3186 192 (32x32) 5221.00 23.20 0.94 sparseN 908 2853/3186 1143 (64x32) 5920.00 22.00 0.992 sparseN3 1191 2853/3186 361 (32x32) 4122.00 22.00 0.993 dense3 763 2853/3186 1023 (32x32) 4726.00 24.00 0.996 dense34 974 2853/3571 1023 (32x32) 62

The experiments show that, in general, many support vectors are needed to rep-

resent the class. That tells us that the training set is hard to distinguish. We used a

threshold to get a compressed representation of the image patch, so that we can use

a sparse SVM. In all experiments with the sparse method we used coefficients above

or below 0.1. The sparse method is fast, but the results are less dependable. We

decided to do only a few experiments with different types of wavelets since the test

with the Spline wavelet shows no major improvement in the results. The best results

were obtained with absolute Haar wavelet coefficients of histogram normalized gray

value samples. That is true for the dense SVM as well as the sparse.

1No histogram equalization2Spline 3 7 wavelet3Absolute values of wavelet coefficients4False positives added as negatives to the training set using bootstrap method

46

The following ROC curves (Fig. 5-4 ... 5-6) show the performance of Gaussian

kernels selected from the previous experiments.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

True

Pos

itive

Rate

False Positive Rate

ROC curve for dense gausian kernels.

k_c32.000000_s8.000000 SVs: 565 Dim: 1023 sig: 8.000 AUC: 0.936 k_c128.000000_s8.000000 SVs: 665 Dim: 1023 sig: 8.000 AUC: 0.971


Figure 5-4: ROC curve for best Gaussian kernels from the first training set contain-ing 2428 samples, which were histogram equalized and transformed to Haar waveletcoefficients.

47

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

True

Pos

itive

Rate

False Positive Rate

ROC curve for sparse gausian kernels (threshold 0.1).

k_c8.000000_s4.594792 SVs: 577 Dim: 12 sig: 4.595 AUC: 0.949 k_c32.000000_s8.000000 SVs: 565 Dim: 1023 sig: 8.000 AUC: 0.936


Figure 5-5: ROC curve for best Gaussian kernels, with sparse kernels using valuesabove threshold 0.1. The Gaussian kernels where trained on the first set contain-ing 2428 samples, which were histogram equalized and transformed to Haar waveletcoefficients.

48

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

True

Pos

itive

Rate

False Positive Rate

ROC (Receiver Operating Characteristic) curve.

(a) k_c64.000000_s16.000000 SVs: 974 Dim: 1023 sig: 16.000 AUC: 0.996 (b) k_c4.000000_s4.000000 SVs: 763 Dim: 1023 sig: 4.000 AUC: 0.993 (c) k_c1.000000_s4.000000 SVs: 1191 Dim: 361 sig: 4.000 AUC: 0.992

Figure 5-6: ROC curves for best Gaussian kernels from the last training set containing6424 samples. Absolute values of Haar wavelet coefficients from histogram equalizedgray values. (a) Dense kernel using bootstrap to add false positives to the trainingset; (b) dense kernel; (c) sparse kernel threshold 0.1.

49

5.3 System detection rate

For the measurement of the overall detection rate we suggest counting the vehicles in

the frames for which depth information is available. We further try to count the cars

which are located in the interest area but for which no information is available. To

make it possible to judge whether a vehicle is located in the interest area or not, we

decide to project the non-clipped 3D points on the image of the left camera as shown

in Figure 5-7.

(a) (b)

Figure 5-7: Image of the left camera with corresponding depth data. This visualiza-tion makes it possible to count vehicles which occur in the interest area.

50

We have done two experiments for a distance of 50 and 100 meters. The detection

rate for the closer distance is better, as expected.

Table 5.4: Results for whole system detection rate, with interest area 100 meters infront of the vehicle..

Kernel: equalized Haar Abs 32x32 count rate1 [%]Frames 398 -Vehicles 173 -Bounding boxes 1827 -True positives 75 43.3False positives 11 6.35Vehicle back 20 8.65

Table 5.5: Results for whole system detection rate, with interest area 50 meters infront of the vehicle.

Kernel: equalized Haar Abs 32x32 count rate1 [%]Frames 398 -Vehicles 129 -Bounding boxes 2724 -True positives 82 63.5False positives 20 15.5Vehicle back 27 20.9

Our system displays the bounding boxes, which are classified as vehicles, as a

green box in the image of the left camera (Fig. 5-8). In addition, the approximated

relative position of the detected vehicle is displayed as horizontal bar on the right

side of the window. The vertical bar on the right side represents the response of the

SVM; the higher the bar, the reliable the decision.

1Rate relative to the counted vehicles.

51

Figure 5-8: Sample screen shots of the system in the detection mode. On the rightside, the relative position of the detected vehicle, is shown. The readability of theclassification is displayed es vertical bar on the right side. In the right collum weshow the typical errors of the system. The errors from top to bottom: false positive,backside of vehicle as positive and missed vehicle.

52

Finally we want to present some time measurements and statistics of the detection

and classification in our system (5.6). If the system is integrated in the test vehicle,

the times for loading the data from the hard drive drop out.

Table 5.6: Statistics of the system.Kernel: equalized Haar Abs 32x32 count time [ms]Frames 2634 -Time to load images - 28.83Time to load 3D data - 81.71Points per frame 1602 -Boxes per frame 2 -Time to find boxes - 21.14Time to classify boxes1 - 61.90Time to remeasure boxes1 - 2.84Overall time per frame - 302.00

1Time per bounding box

53

5.4 Discussion

This project shows that stereo vision provides a technique to make fast predictions for

proper vehicle occurrence in street scenes. This leads to the opportunity of using more

complex classifiers. However, the tests show that the dependability of our bounding

box detection is not as good as expected. This weak point affects the detection rate

of the whole system. There are several possibilities to improve the detection.

• Apply an advanced technique for compensation of dynamic changes in camera

pitch and height during driving [10].

• Use lane detection to make a better approximation of possible vehicle occur-

rence.

• Track good boxes over time. We can use a Kalman filter to predict bounding

boxes between frames.

• Use clustering methods to improve the bounding box accuracy.

• Improve stereo vision accuracy by using filters to reduce noise and outliers.

We have also shown how the supervised learning process can be supported by a stereo

system. The grid search method is a straightforward method to optimize training

parameters of the Gaussian kernels even if the training set is hard to distinguish.

With the grid search method and several transformations of the SVM input vector, we

were able to enlarge the Area Under the Curve (AUC) up to 99.6%. Furthermore, the

system provides functions to increase the accuracy of the SVM classifier incrementally.

This makes it much easier to receive large training sets from an image sequence.

54

Chapter 6

Conclusion

In conclusion, we built a system to detect oncoming vehicles using depth information

from stereo vision image pairs, to predict possible occurrence of oncoming vehicles.

Consequently, we can determine where and in which scale the vehicle is located in

the images. Although the accuracy of the prediction is not dependable on its own,

it provides a possibility to apply a relatively slow global SVM classifier on a smaller

subregion of the image. This reduces the computational cost, compared to shifting

a classifier over the whole translation and scale space. Furthermore, we compared

several kinds of SVM input vector transformations such as histogram equalization

and Haar wavelets, regarding their ability to improve the performance of the classifier.

Since the Gaussian SVM kernel showed the best results in early tests we decided to

perform all experiments with this kernel. In this thesis we presented a novel way

to combine stereo vision with SVM classification. In addition we provided a short

introduction to statistical learning and stereo vision in this work.

6.1 Future work

There are many ways to improve the system detection rate in the future. A major

improvement would be a flexible region of interest in front of the vehicle. One aim

of our study would be the integration a dynamic pitch and height correction such as

in [10]. We also consider using lane information to steer the bounding box detection.

55

Furthermore we will study the possibility of using an SVM classificator on the x-z

projection of the 3D points, for the bounding box detection. In order to improve the

detection rate of the SVM based part, a component based SVM classifier [15] could be

used. We suppose to do experiments with the triangular kernel [8] which has similar

characteristics as the Gaussian kernel but no parameter σ, which could make the grid

search method unnecessary. We suggest to take a closer look at different kinds of

wavelets, in particular masked wavelets [25]. In order to increase the system speed

we recommend studying feature selection techniques and PCA analysis [15] in order

to receive a sparse representation of the gray value patches.

56

Appendix A

Tables

Table A.1: List of Terms and AcronymsAbs Absolute valueAUC Area Under the Curve (for ROC)DWT Discrete Wavelet TransformationHE Histogram Equalisationp/n positive count / negative count (in tables)PCA Principal Component AnalysisRBF Radial Base Function (SVM kernel)ROC Receiver Operating CaracteristicSV Support VectorSVM Support Vector MachineT Threshold

57

Table A.2: Listing of kernels with their performance. The results are calculatedusing a test set with 212 vehicles and 2216 negative examples. Most of the SVM’sare trained with Haar wavelet coefficients from Histogram equalized images.

C σ AUC Type SV’s Training p/n Dimensions time [ms]25.00 23.00 0.936 dense1 565 311/2284 1023 (32x32) 3420.00 22.00 0.937 dense 1212 311/2284 511 (32x16) 3727.00 23.00 0.971 dense 665 311/2284 1023 (32x32) 40211.00 23.00 0.971 dense 665 311/2284 1023 (32x32) 4025.00 23.00 0.971 dense 665 311/2284 1023 (32x32) 40215.00 23.00 0.971 dense 665 311/2284 1023 (32x32) 4029.00 23.00 0.971 dense 665 311/2284 1023 (32x32) 40213.00 23.00 0.971 dense 665 311/2284 1023 (32x32) 4321.20 22.30 0.973 dense 735 311/2284 1023 (32x32) 4623.00 22.20 0.949 sparseN 577 311/2284 12 (32x32) 525.00 23.00 0.935 sparseN 436 311/2284 12 (32x32) 421.80 28.00 0.832 sparseN 490 311/2284 13 (32x32) 4

1No Histogram equalization

58

Table A.3: Sequel listing of kernels with their performance. The results are calculatedusing a test set with 212 vehicles and 2216 negative examples. Most of the SVM’sare trained with Haar wavelet coefficients from Histogram equalized images.

C σ AUC Type SV’s Training p/n Dimensions time [ms]21.00 22.20 0.947 dense2 910 311/2284 1023 (32x32) 5821.00 22.50 0.917 dense 1869 311/2284 1023 (32x32) 11027.00 23.00 0.942 dense 896 2853/3186 1023 (32x32) 5328.00 23.00 0.942 dense 896 2853/3186 1023 (32x32) 5829.00 23.00 0.942 dense 896 2853/3186 1023 (32x32) 5326.00 23.00 0.942 dense 896 2853/3186 1023 (32x32) 5325.00 23.50 0.94 dense 672 2853/3186 1023 (32x32) 4424.00 22.00 0.941 sparseN 1407 2853/3186 192 (32x32) 5222.00 22.00 0.941 sparseN 1407 2853/3186 192 (32x32) 5221.00 23.20 0.94 sparseN 908 2853/3186 1143 (64x32) 5920.00 22.00 0.992 sparseN3 1191 2853/3186 361 (32x32) 4122.00 22.00 0.993 dense3 763 2853/3186 1023 (32x32) 4726.00 24.00 0.996 dense34 974 2853/3571 1023 (32x32) 62

2Spline 3 7 wavelet3Absolute values of wavelet coefficients4False positives added as negatives to the training set using bootstrap method

59

Table A.4: Table of libraries used in the project.Name Type Internet addressCG NVIDIA Cg Toolkit http://developer.nvidia.com/

page/cg main.htmlnv math math library http://cvs1.nvidia.com/LIBS/ZImage Image class library -ZClassifierSVMachine SVM classifier -IniFile Ini file class http://inifile.sourceforge.net/NeHeGL OpenGL Window class http://nehe.gamedev.net/Wvlt Wavelet library http://www.cs.ubc.ca/nest/imager/

contributions/bobl/wvlt/top.html

60

Appendix B

Erklarung (Agreement)

Erklarung Hiermit erklare ich, dass ich die vorliegende Diplomarbeit selbstandig ange-

fertigt habe. Es wurden nur die in der Arbeit ausdrucklich benannten Quellen und

Hilfsmittel benutzt. Wortlich oder sinngem ubernommenes Gedankengut habe ich

als solches kenntlich gemacht.

Ort, Datum Unterschrift

61

62

Bibliography

[1] Shivani Agarwal and Dan Roth. Learning a sparse representation for object

detection. In Proceedings of the 7th European Conference on Computer Vision,

volume 4, pages 113–130, 2002.

[2] J. Bouguet. Camera calibration toolbox for matlab. Technical report, Intel Corp.,

2001. Available at http://www.vision.caltech.edu/bouguetj/calib doc/.

[3] Christopher J. C. Burges. A tutorial on support vector machines for pattern

recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998.

[4] Andrew K. Chan and Cheng Peng. Wavelets for Sensing Technologies. Artech

House, October 2003.

[5] Chih-Chung Chang and Chih-Jen Lin. Libsvm. Technical report, Computer

Science and Information Engineering National Taiwan University Taipei, Taiwan,

106, 2003. Available at http://www.csie.ntu.edu.tw/ cjlin/libsvm/.

[6] T. Evgeniou, M. Pontil, C. Papageorgiou, and T. Poggio. Image representations

for object detection using kernel classifiers, 2000.

[7] Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Statistical learn-

ing theory: A primer. International Journal of Computer Vision, 38(1):9–13,

2000.

[8] Francois Fleuret and Hichem Sahbi. Scale-invariance of support vector machines

based on the triangular kernel. Technical report, IMEDIA Research Group,

63

France, IMEDIA Research Group, INRIA, Domaine de Voluceau, 78150 Le Ches-

nay, France, 2002.

[9] David A. Forsyth and Jean Ponce. Computer Vision A Modern Approach. Pren-

tice Hall, 2003.

[10] Uwe Franke, Dariu Gavrila, Steffen Gorzig, Frank Lindner, Frank Paetzold, and

Christian Wohler. Autonomous driving goes downtown. IEEE Intelligent Sys-

tems, 13(6):40–48, 1998.

[11] A. Gersho and R.M. Gray. Vector quantization and signal compression. Kluwer

Academic, 1991.

[12] C. Harris and M. Stephens. A combined corner and edge detector. In Proceedings

Alvey Vision Conference, pages 147–151, University of Manchester, 1988.

[13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.

Springer, 2001. HAS t 01:1 1.Ex.

[14] B. Heisele, A. Verri, and T. Poggio. Learning and vision machines. In Proceedings

of the IEEE, pages 90:1164–1177, 2002.

[15] Bernd Heisele, Thomas Serre, S. Prentice, and Tomaso Poggio. Hierarchical

classification and feature reduction for fast face detection with support vector.

In Pattern Recognition, volume 36, pages 2007–2017, 2003.

[16] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to

support vector classification. Technical report, Department of Computer Science

and Information Engineering National Taiwan University Taipei 106, Taiwan,

2003.

[17] S. G. Mallat. A theory for multiresolution signal decomposition: The wavelet

representation. IEEE Trans. Pattern Anal. Mach. Intell., 11(7):674–693, 1989.

[18] D. Marr and T. Poggio. A computational theory of human stereo vision. In

Royal Society London, volume B 204, page 301 328, 1979.

64

[19] M. Bertozzi A. Broggi A. Fascioli S. Nichele. Stereo vision-based vehicle detec-

tion. IEEE Intelligent Vehicles Symposium, October 3-5 2000.

[20] C. Papageorgiou, T. Evgeniou, and T. Poggio. A trainable pedestrian detection

system, 1998.

[21] C. P. Papageorgiou and T. Poggio. A trainable object detection system: Car

detection in static images. Technical Report AI-Memo-1673, CBCL-180, Mas-

sachusetts Institute of Technology, Artificial Intelligence Laboratory and Center

for Biological and Computational Learning, October 1999.

[22] Marc Pollefeys. Visual 3d modeling from images. Technical report,

University of North Carolina - Chapel Hill, USA, 2003. Available at

http://www.cs.unc.edu/∼marc/tutorial/.

[23] Rayan M. Rifkin. Everything Old Is New Again: A Fresh Look at Historical

Approaches in Machine Learning. PhD thesis, Massachusetts Institute of Tech-

nology, August 2002.

[24] Ryan Rifkin. Svmfu. Technical report, Massachusetts Institute of Technology,

Artificial Intelligence Laboratory and Center for Biological and Computational

Learning, 2000. Available at http://five-percent-nation.mit.edu/SvmFu/.

[25] Patrice Y. Simard and Henrique S. Malvar. A wavelet coder for masked images.

In Data Compression Conference, pages 93–102, 2001.

[26] Eric J. Stollnitz, Tony D. DeRose, and David H. Salesin. Wavelets for com-

puter graphics: A primer, part 1. IEEE Computer Graphics and Applications,

15(3):76–84, 1995.

[27] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag

New York, Inc., 1995.

65

Stereovision based vehicle classi cation using support ...cbcl.mit.edu/publications/theses/thesis-masters-paysan.pdf · Stereovision based vehicle classi cation using support vector

Documents