-
Robust Multi-Resolution Pedestrian Detection in Traffic
Scenes
Junjie Yan Xucong Zhang Zhen Lei Shengcai Liao Stan Z. Li∗
Center for Biometrics and Security Research & National
Laboratory of Pattern RecognitionInstitute of Automation, Chinese
Academy of Sciences, China
{jjyan,xczhang,zlei,scliao,szli}@nlpr.ia.ac.cn
Abstract
The serious performance decline with decreasing resolu-tion is
the major bottleneck for current pedestrian detectiontechniques
[14, 23]. In this paper, we take pedestrian de-tection in different
resolutions as different but related prob-lems, and propose a
Multi-Task model to jointly considertheir commonness and
differences. The model contains res-olution aware transformations
to map pedestrians in differ-ent resolutions to a common space,
where a shared detectoris constructed to distinguish pedestrians
from background.For model learning, we present a coordinate descent
proce-dure to learn the resolution aware transformations and
de-formable part model (DPM) based detector iteratively. Intraffic
scenes, there are many false positives located aroundvehicles,
therefore, we further build a context model to sup-press them
according to the pedestrian-vehicle relationship.The context model
can be learned automatically even whenthe vehicle annotations are
not available. Our method re-duces the mean miss rate to 60% for
pedestrians taller than30 pixels on the Caltech Pedestrian
Benchmark, which no-ticeably outperforms previous state-of-the-art
(71%).
1. Introduction
Pedestrian detection has been a hot research topic incomputer
vision for decades, for its importance in real ap-plications, such
as driving assistance and video surveil-lance. In recent years,
especially due to the popularity ofgradient features, pedestrian
detection field has achievedimpressive progresses in both
effectiveness [6, 31, 43, 41,19, 33] and efficiency [25, 11, 18, 4,
10]. The leading de-tectors can achieve satisfactory performance on
high resolu-tion benchmarks (e.g. INRIA [6]), however, they
encounterdifficulties for the low resolution pedestrians (e.g.
30-80pixels tall, Fig. 1) [14, 23]. Unfortunately, the low
resolu-tion pedestrians are often very important in real
application-s. For example, the driver assistance systems need
detect
∗Stan Z. Li is the corresponding author.
Figure 1. Examples of multiple resolution pedestrian detection
re-sult of our method in the Caltech Pedestrian Benchmark [14].
the low resolution pedestrians to provide enough time
forreaction.
Traditional pedestrian detectors usually follow the
scaleinvariant assumption: a scale invariant feature based
detec-tor trained at a fixed resolution could be generalized to
allresolutions, by resizing the detector [40, 4], image [6, 19]
orboth of them [11]. However, the finite sampling frequencyof the
sensor results in much information loss for low reso-lution
pedestrians. The scale invariant assumption does nothold in the
case of low resolution, which leads to the dis-astrous drop of the
detection performance with the decreaseof resolution. For example,
the best detector achieves 21%mean miss rate for pedestrians taller
than 80 pixels in Cal-tech Pedestrian Benchmark [14], while
increases to 73% forpedestrians 30-80 pixels high.
Our philosophy is that the relationship among
differentresolutions should be explored for robust
multi-resolutionpedestrian detection. For example, the low
resolution sam-ples contain a lot of noise that may mislead the
detector inthe training phase, and the information contained in
highresolution samples can help to regularize it. We argue thatfor
pedestrians in different resolutions, the differences existin the
features of local patch (e.g. the gradient histogramfeature of a
cell in HOG), while the global spatial struc-ture keeps the same
(e.g. part configuration). To this end,we propose to conduct
resolution aware transformations tomap the local features from
different resolutions to a com-mon subspace, where the differences
of local features arereduced, and the detector is learned on the
mapped fea-
1
2013 IEEE Conference on Computer Vision and Pattern
Recognition
1063-6919/13 $26.00 © 2013 IEEEDOI 10.1109/CVPR.2013.390
3031
2013 IEEE Conference on Computer Vision and Pattern
Recognition
1063-6919/13 $26.00 © 2013 IEEEDOI 10.1109/CVPR.2013.390
3031
2013 IEEE Conference on Computer Vision and Pattern
Recognition
1063-6919/13 $26.00 © 2013 IEEEDOI 10.1109/CVPR.2013.390
3033
-
tures of samples from different resolutions, thus the
struc-tural commonness is preserved. Particularly, we extend
thepopular deformable part model (DPM) [19] to multi-taskDPM
(MT-DPM), which aims to find an optimal combina-tion of DPM
detector and resolution aware transformations.We prove that when
the resolution aware transformationsare fixed, the multi-task
problems can be transformed to bea Latent-SVM optimization problem,
and when the DPMdetector in the mapped space is fixed, the problem
equalsto a standard SVM problem. We divide the complex non-convex
problem into the two sub-problems, and optimizethem
alternatively.
In addition, we propose a new context model to improvethe
detection performance in traffic scenes. There is a phe-nomenon
that quite a large number of detections (33.19%for MT-DPM in our
experiments) are around vehicles. Thevehicle localization is much
easier than pedestrian, whichmotivates us to employ
pedestrian-vehicle relationship asan additional cue to judge
whether the detection is a false ortrue positive. We build an
energy model to jointly encodethe pedestrian-vehicle and geometry
contexts, and infer thelabels of detections by maximizing the
energy function onthe whole image. Since the vehicle annotations
are oftennot available in pedestrian benchmark, we further present
amethod to learn the context model from ground truth pedes-trian
annotations and noisy vehicle detections.
We conduct experiments on the challenging CaltechPedestrian
Benchmark [14], and achieve significantly im-provement over
previous state-of-the-art methods on all the9 sub-experiments
advised in [14]. For the pedestrians tallerthan 30 pixels, our
MT-DPM reduces 8% and our contex-t model further reduces 3% mean
miss rate over previousstate-of-the-art performance.
The rest of the paper is organized as follows: Section 2reviews
the related work. The multi-task DPM detector andpedestrian-vehicle
context model are discussed in Section3 and Section 4,
respectively. Section 5 shows the experi-ments and finally in
Section 6 we conclude the paper.
2. Related workThere is a long history of research on pedestrian
detec-
tion. Most of the modern detectors are based on
statisticallearning and sliding-window scan, popularized by [32]
and[40]. Large improvements came from the robust features,such as
[6, 12, 25, 3]. There are some papers fused HOGwith other features
[43, 7, 45, 41] to improve the perfor-mance. Some papers focused on
special problems in pedes-trian detection, including occlusion
handling [46, 43, 38, 2],speed [25, 11, 18, 4, 10], and detector
transfer in new scenes[42, 27]. We refer the detailed surveys on
pedestrian detec-tion to [21, 14].
Resolution related problems have attracted attention inrecent
evaluations. [16] found that the pedestrian detection
performance depends on the resolution of training samples.[14]
pointed that the pedestrian detection performance drop-s with
decreasing resolution. [23] observed similar phe-nomenon in general
object detection task. However, thereare very limited works
proposed to tackle this problem. Themost related work is [33],
which utilized root and part filter-s for high resolution
pedestrians, while only used the rigidroot filter for low
resolution pedestrians. [4] proposed to usea single model per
detection scale, but the paper is focusedon speedup.
Our pedestrian detector is built on the popular DPM (de-formable
part model) [19], which combined rigid root filterand deformable
part filters for detection. The DPM onlyperforms well for high
resolution objects, while our MT-DPM generalizes it to low
resolution case. The coordinatedescent procedure in learning is
motivated by the steerablepart model [35, 34], which trained the
shared part bases toaccelerate the detection. Note that [34]
learned a shared fil-ter bases, while our model learns a shared
classifier, whichresult in a quite different formulation. [26] also
proposeda multi-task model to handle dataset bias. The
multi-taskidea in this paper is motivated by works on face
recognitionacross different domains, such as [28, 5].
Context has been used in pedestrian detection. [24, 33]captured
the geometry constraint under the assumption thatcamera is aligned
with ground plane. [9] took the appear-ance of nearby regions as
the context. [8, 36, 29] capturedthe pair-wise spatial relationship
in multi-class object de-tection. To the best of our knowledge,
this is the first workto capture the pedestrian-vehicle
relationship to improvepedestrian detection in traffic scenes.
3. Multi-Task Deformable Part ModelThere are two intuitive
strategies to handle the multi-
resolution detection. One is to combine samples from d-ifferent
resolutions to train a single detector (Fig. 2(a)), andanother is
to train independent detectors for different reso-lutions (Fig.
2(b)). However, both of the two strategies arenot prefect. The
first one considers the commonness be-tween different resolutions,
while their differences are ig-nored. Samples from different
domains would increase thecomplexity of the detection boundary,
which probably be-yond the ability of a single linear detector. On
the contrary,multi-resolution model takes pedestrian detection in
differ-ent resolutions as independent problems, and the
relation-ship among them are missed. The unreliable features of
lowresolution pedestrians can mislead the learned detector andmake
it difficult to be generalized to novel test samples.
In this part, we present a multi-resolution detectionmethod by
considering the relationship of samples from d-ifferent
resolutions, including the commonness and the dif-ferences, which
are captured by a multi-task strategy simul-taneously. Considering
the differences of different resolu-
303230323034
-
Detector
HR Detector
LR Detector Multi-Task Detector HR
Transformation
LR Transformation
High Resolution Pedestrians
Low Resolution Pedestrians
(a)Single Resolution Detector
(b)Multi-Resolution Detector
(c)Multi-task Detector
Backgrounds
Figure 2. Different strategies for multi-resolution pedestrian
detection.
Resolution Aware Transformation
Resolution invariant feature matrix
Figure 3. Demonstration of the resolutionaware
transformations.
tions, we use the resolution aware transformations to
mapfeatures from different resolutions to a common subspace,in
which they have similar distribution. A shared detectoris trained
in the resolution-invariant subspace by samplesfrom all
resolutions, to capture the structural commonness.It easy to see
that the first two strategies are the special caseof the multi-task
strategy.
Particularly, we extend the the idea to popular DPM de-tector
[19] and propose a Multi-Task form of DPM. Herewe consider the
partition of two resolutions (low resolution:30-80 pixels tall, and
high resolution: taller than 80 pixels,as advised in [14]). Note
that extending the strategy for oth-er local feature based linear
detectors and more resolutionpartitions are straightforward.
3.1. Resolution Aware Detection Model
To simplify the notation, we introduce a matrix
basedrepresentation for DPM. Given the image I and the collec-tion
of m part locations L = (l0, l1, · · · , lm), the HOG fea-ture
φa(I, li) of the i-th part is a nh×nw×nf dimensionaltensor, where
nh, nw are the height and width of HOG cellsfor the part, and nf is
the dimension of gradient histogramfeature vector for a cell. We
reshape φa(I, li) to be a ma-trix Φa(I, li), where every column
represents features froma cell. Φa(I, li) is further concatenated
to be a large ma-trix Φa(I, L) = [Φa(I, l0),Φa(I, l1), · · ·Φa(I,
lm)]. Thecolumn number of Φa(I, L) is denoted as nc, which is
thesum number of cells in parts and root. Demonstration ofthe
procedure is shown in Fig. 3. The appearance filters inthe detector
are concatenated to be a nf × nc matrix Wain the same way. The
spatial features of different parts areconcatenated to be a vector
φs(I, L), and the spatial priorparameter is denoted as ws. With
these notations, the detec-tion model of DPM [19] can be written
as:
score(I, L) = Tr(WTa Φa(I, L)) + wTs φs(I, L), (1)
where Tr(·) is the trace operation defined as summation ofthe
elements on the main diagonal of a matrix. Given theroot location
l0, all the part locations are latent variables,and the final score
is maxL∗ score(I, L∗), where L∗ is thebest possible part
configurations when the root location is
fixed to be l0. The problem can be solved effectively bythe
dynamic programming [19]. Mixture can be used toincrease the
flexibility, but we ignore it for simplicity in no-tation and
adding mixture in the formulations is straightfor-ward.
In DPM, pedestrian consists of parts, and every part con-sists
of HOG cells. When the pedestrian resolution changes,the structure
of parts and the HOG cell spatial relationshipkeep the same. The
only difference among different res-olution lies in the feature
vector of evert cell, so that theresolution aware transformations
PL and PH are defined onit. The PL and PH are of the dimension nd ×
nf , and theymap the low and high resolution samples from the
originalnf dimensional feature space to the nd dimensional
sub-space. The features from different resolutions are mappedinto
the common subspace, so that can share the same de-tector. We still
denote the learned appearance parameters inthe mapped resolution
invariant subspace as Wa, which is and×nc matrix, and of the same
size with PHΦa(I, L). Thescore of a collection of part locations L
in the MT-DPM isdefined as:{Tr(WTa PHΦa(I, L)) + w
Ts φs(I, L), High Resolution
Tr(WTa PLΦa(I, L)) + wTs φs(I, L), Low Resolution.
(2)
The model defined above provides the flexibility to
describepedestrians of different resolutions, but also brings
chal-lenges, since the Wa, ws, PH , PL are all unknown. Inthe
following part, we present the objective function of themulti-task
model for learning, and show the optimizationmethod.
3.2. Multi-Task Learning
The objective function is motivated by the original singletask
DPM. Its matrix form can be written as:
arg minWa,ws
1
2‖Wa‖2F +
1
2wTs ws (3)
+C∑N
max[0, 1− yn(Tr(WTa Φa(In, L∗n)) + wTs φs(L∗n))],
where ‖ · ‖F is the Frobenius Norm, and ‖Wa‖2F =Tr(WaW
Ta ). yn is 1 if In(Ln) is pedestrian, and −1 for
303330333035
-
background. The first two terms are used for regularize
thedetector parameters, and the last term is the hinge loss inDPM
detection. The L∗n is the optimized part configurationthat
maximizes the detection score of In. In the learningphase, the part
locations are taken as latent variables, andthe problem can be
optimized by the Latent-SVM [19].
For multi-task learning, the relationship between differ-ent
tasks should be considered. In analogy to the originalDPM, MT-DPM
is formulated as:
arg minWa,ws,PH ,PL
1
2wTs ws (4)
+fIH (Wa, ws, PH) + fIL(Wa, ws, PL),
where IH and IL denote the high and low resolution train-ing
sets, including both pedestrian and background. Sincespatial term
ws is directly applied to the data from differen-t resolutions, it
can be regularized independently. fIH andfIL are used to consider
the detection loss and regularize theparameters PH , PL and Wa. fIH
and fIL are of the sameform, here we take fIH as an example. It can
be written as:
fIH (Wa, ws, PH) =1
2‖PTHWa‖2F (5)
+C∑NH
max[0, 1− yn(Tr(WTa PHΦa(IHn , L∗n)) + w
Ts φs(L
∗n))],
where the regularization term PTHWa is a nf × nc dimen-sional
matrix, and of the same dimension with the originalfeature matrix.
Since PH and Wa are applied to originalappearance feature
integrally in calculating the appearancescore Tr((PTHWa)
T Φa(I, L), we take them as an ensem-ble and regularize them
together. The second term is thedetection loss for resolution aware
detection, correspondingto the detection model in Eq. 2. The
parameters Wa and wsare shared between fIH and fIL . Note that more
partitionsof resolutions can be handle naturally in Eq. 4.
In Eq. 4, we need to find an optimal combination of Wa,ws, PH ,
and PL. However, Eq. 4 is not convex when allof them are free.
Fortunately, we show that given the twotransformations, the problem
can be transformed into a s-tandard DPM problem, and given the DPM
detector, it canbe transformed into a standard SVM problem. We
conduc-t a coordinate descent procedure to optimize the two
sub-problems iteratively.
3.2.1 Optimize Wa and ws
When PH and PL are fixed, we can map the features tothe common
space on which DPM detector can be learned.We denote PHPTH +
PLP
TL as A, A
12Wa as W̃a. For
high resolution samples we denote A−12PHΦa(In, L
∗n) as
Φ̃a(In, L∗n), and for low resolution samples we denote
A−12PLΦa(In, L
∗n) as Φ̃a(In, L
∗n). Eq. 4 can be reformu-
lated as:
arg minW̃a,ws
1
2‖W̃a‖2F +
1
2wTs ws (6)
+C∑
NH+NL
max[0, 1− yn(Tr(W̃aT
Φ̃a(In, L∗n)) + w
Ts φs(L
∗n))],
which has the same form with the optimization problemin Eq. 3,
and the Latent-SVM solver can be used here.Once the solution to Eq.
6 is achieved, Wa is computed by(PHP
TH + PLP
TL )− 12 W̃a.
3.2.2 Optimize PH and PL
When the Wa and ws are fixed, PH and PL are inde-pendent, thus
the optimization problem can be dividedinto two subproblems: arg
minPH fIH (Wa, ws, PH) andarg minPL fIL(Wa, ws, PL). Since they are
of the sameform, here we only give the details for optimizing PH
.
Given the Wa and ws, we first infer the part location ofevery
training samples L∗n by finding a part configurationsto maximize
Eq. 2. Denoting WaWTa as A, A
12PH as P̃H ,
and A−12WaΦa(IHn , L
∗n)
T as Φ̃a(IHn , L∗n), the problem
of Eq. 4 equals to:
arg minP̃H
1
2‖P̃H‖2F (7)
+C∑NH
max[0, 1− yn(Tr(P̃HT
Φ̃a(IHn , L∗n)) + w
Ts φs(L
∗n))].
The only difference between Eq. 7 and standard SVM is
anadditional term wTs φs(L
∗n). Since w
Ts φs(L
∗n) is a constant
in the optimization, it can be taken as an additional dimen-sion
of V ec(Φ̃a(IHn , L
∗n)). In this way, the Eq. 7 can be
solved by a standard SVM solver. After we get P̃H , the PHcan
then be computed by (WaWTa )
− 12 P̃H .
3.2.3 Training Details
To start the loop of the coordinate descent procedure, oneneed
to give initial values for either {Wa, ws} or {PH , PL}.In our
implementation, we calculate the PCA of HOG fea-tures from randomly
generated high and low resolutionpatches, and use the first nd
eigenvectors as the initial valueof PH and PL, respectively. We use
the HOG features in[19] and abandon the last truncation term, thus
nf = 31 inour experiment. The dimension nd determines how
muchinformation is kept for sharing. We examine the effect ofnd in
the experiments. The solver in optimizing the prob-lem Eq. 6 and
Eq. 7 are based on the [22]. The maximumnumber of the coordinate
descent loop is set to be 8. Thebin size in HOG is set to 8 for
high resolution model, and 4
303430343036
-
for low resolution. The root filter contains 8× 4 HOG cellsfor
both low and high resolution detection model.
4. Pedestrian-Vehicle Context in Traffic ScenesA lot of
detections are located around vehicles in traf-
fic scenes (33.19% for our MT-DPM detector on CaltechBenchmark),
as shown in Fig. 4. It is possible to use thepedestrian-vehicle
relationship to infer whether the detec-tion is true or false
positive. For example, if we know thelocation of vehicles in Fig.
4, the detections above a vehi-cle, and detection at the wheel
position of a vehicle can besafely removed. Fortunately, the
vehicles are more easierto be localized than pedestrians, which has
been proved inprevious work (e.g. Pascal VOC [17], KITTI [20]).
Since itis difficult to capture the complex relationship by
handcraftrules, we build a context model and learn it
automaticallyfrom data.
We split the spatial relationship between pedestrians
andvehicles into five types, including: “Above”, “Next-to”,“Below”,
“Overlap” and “Far”. We denote the feature ofpedestrian-vehicle
context as g(p, v). If a pedestrian de-tection p and a vehicle
detection1 v have one of the firstfour relationships, the context
features at the correspond-ing dimensions are defined as
(σ(s),∆cx,∆cy,∆h, 1), andother dimensions retain to be 0. If the
pedestrian detectionand vehicle detection are too far or there’s no
vehicle, allthe dimensions of its pedestrian-vehicle feature is 0.
Here∆cx = |cvx − cpx |, ∆cy = cvy − cpy , and ∆h = hv/hp,where (cvx
, cvy ), (cpx , cpy ) are the center coordinates of ve-hicle
detection v and pedestrian detection p, respectively.σ(s) = 1/(1 +
exp(−2s)) is used to normalize the detec-tion score to [0, 1]. For
the left-right symmetry, the absoluteoperation is conducted for
∆cx. Moreover, as pointed in[33], there also has a relationship
between the coordinateand the scale of pedestrians under the
assumption that thecameras is aligned with ground plane. We further
definethis geometry context feature for pedestrian detection p
asg(p) = (σ(s), cy, h, c
2y, h
2), where s, cy , h are the detectionscore, y-center and height
of the detection respectively, andcy and h are normalized by the
height of the image.
To fully encode the context, we defined the model onthe whole
image. The context score is the summation ofcontext scores of all
pedestrian detections, and context s-core of a pedestrian is
further divided to its geometry andpedestrian-vehicle scores.
Suppose there are n pedestriandetections P = {p1, p2, · · · , pn}
and m vehicle detectionsV = {v1, v2, · · · , vm} in an image, the
context score of theimage is defined as:
S(P, V ) =
n∑i=1
(wTp g(pi) +
m∑j=1
wTv g(pi, vj)), (8)
1We use a DPM based vehicle detector trained on Pascal VOC
2012[17] in our experiments.
Original Detection Context Result
Figure 4. Examples of original detection, and the detection
opti-mized by the context model.
where wp and wv are the parameters of geometry contextand
pedestrian-vehicle context, which ensure the truth de-tection (P, V
) has larger context score than any other de-tection
hypotheses.
Given the original pedestrians and vehicles detection Pand V ,
whether each detection is a false positive or truepositive is
decided by maximizing the context score:
arg maxtpi ,tvj
n∑i=1
(tpiwTp g(pi) + tpi
m∑j=1
tvjwTv g(pi, vj)), (9)
where tpi and tvj are the binary value, 0 means the
falsepositive and 1 means the true positive. Eq. 9 is a
integerprogramming problem, but becomes trivial when the labelof V
is fixed, since it equals to maximizing every pedes-trians
independently. In typical traffic scenes, the numberof vehicles is
limited. For example, in Caltech PedestrianBenchmark, there are no
more than 8 vehicles in an image,so that the problem can be solved
by no more than 28 trivialsub-problems, which can be very efficient
in real applica-tions.
For the linear property, Eq. 9 is equal to:
arg maxtpi ,tvj
[wp, wv][
n∑i=1
tpig(pi),
n∑i=1
tpi
m∑j=1
tvjg(pi, vj)]T ,
(10)Eq. 10 provides a natural way for max-margin learning. Weuse
wc to denote [wp, wv]. Given the ground truth hypothe-ses of
vehicles and pedestrians, a standard structural SVM[39] can be used
here to discriminatively learn wc by solv-ing the following
problem:
minwc,ξk1
2‖wc‖22 + λ
K∑k
ξk (11)
s.t.∀P ′, ∀V ′, S(Pk, Vk)− S(P ′k, V ′k) ≥ L(Pk, P ′k)− ξk,
where P ′k and V′k are arbitrary pedestrian and vehicle hy-
potheses in the kth image, and Pk and Vk are the groundtruth.
L(Pk, P ′k) is the Hamming loss of pedestrian detec-tion hypothesis
P ′k and ground truth Pk. The difficulty inpedestrian based
applications is that only pedestrian ground
303530353037
-
truth Pk is available in public pedestrian databases, and
ve-hicle annotation Vk is unknown. To address the problem,we use
the noisy vehicle detection result as the initial es-timation of
Vk, and jointly learn context model and inferwhether the vehicle
detection is true or false positive, byoptimizing the following
problem:
minwc,ξk1
2‖wc‖22 + λ
K∑k
ξk (12)
s.t.∀P ′, ∀V ′ : maxV̂k⊆Vk
S(Pk, V̂k)− S(P ′k, V ′k) ≥ L(Pk, P ′k)− ξk,
where V̂k is a subset of Vk, which reflects the current
in-ference of the vehicle detections by maximizing the
overallcontext score. Eq. 12 can be solved by optimizing the mod-el
parameters wc and the label of vehicles V̂k iteratively. Inthe
learning phase, the initial P ′k is the pedestrian detectionresult
of MT-DPM.
5. ExperimentsExperiments are conducted on the Caltech
Pedestrian
Benchmark [14]2. Following the experimental protocol,
theset00-set05 are used for training and set06-set10 are usedfor
test. We use the ROC or the mean miss rate3 to com-pare methods as
advised in [14]. For more details of thebenchmark, please refer to
[14]. There are various sub-experiments on the benchmark to compare
detectors in d-ifferent conditions. Due to the space limitation, we
on-ly report the most relevant and leave results of other
sub-experiments in the supplemental material. We emphasizethat our
method outperforms all the 17 methods evaluatedin [14] on the 9
sub-experiments significantly.
In the following experiments, we examine the influenceof the
subspace dimension in MT-DPM, then compare itwith other strategies
for low resolution detection. The con-tribution of context model is
also validated at different F-PPI. Finally we compare the
performance with other state-of-the-art detectors.
5.1. The Subspace Dimension in MT-DPM
The dimension of the mapped common subspace in MT-DPM reflects
the tradeoff between commonness and differ-ences among different
resolutions. The high dimensionalsubspace can capture more
differences, but may loss thegeneralities. We examine the parameter
between 8 and 18with a interval 2, and measure the performance on
pedes-trians taller than 30 pixels. We report the mean miss rate,as
shown in Fig. 5. The MT-DPM achieves the lowest miss
2http://www.vision.caltech.edu/Image
Datasets/CaltechPedestrians/3We used the mean miss rate defined in
P. Dollár’s toolbox, which is the
mean miss rate at 0.0100,0.0178, 0.0316, 0.0562, 0.1000, 0.1778,
0.3162,0.5623 and 1.0000 false-positive-per-image.
0.683913
0.658472 0.656717
0.641546
0.630547
0.640177
0.6
0.61
0.62
0.63
0.64
0.65
0.66
0.67
0.68
0.69
dim=8 dim=10 dim=12 dim=14 dim=16 dim=18
Influnce of the Space Dimension in Multi-Task DPMMissRate
Figure 5. Influence of the subspace dimension in MT-DPM.
Figure 6. Results of different methods in multi-resolution
pedes-trian detection.
pedestrians>30 FPPI=0.01FPPI=0.1 FPPI=1Original Detecti
0.7718 0.6305 0.4926Context Model 0.7551 0.6087 0.4603
0.7718
0.6305
0.4926
0.7551
0.6087
0.4603
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
FPPI=0.01 FPPI=0.1 FPPI=1
Improvements of the Context ModelOriginal Detection
Context Model
MissRate
Figure 7. Contributions of the context cues in
multi-resolutionpedestrian detection.
rate when the dimension is set to be 16, and tend to be
stablebetween 14 and 18. In the following experiments, we fix itto
be 16.
5.2. Comparisons with Other Detection Strategies
We compare the proposed MT-DPM with other strategiesfor
multi-resolution pedestrian detection. All the detectorsare based
on DPM and applied on original images exceptfor specially
mentioned. The compared methods including:(1) DPM trained on the
high resolution pedestrians; (2) DP-M trained on the high
resolution pedestrians and tested byresizing images 1.5, 2.0, 2.5
times, respectively; (3) DPMtrained on low resolution pedestrians;
(4) DPM trained onboth high and low resolution pedestrians data
(Fig. 2(a));(5) Multi-resolution DPMs trained on high resolution
andlow resolution independently, and their detection results
arefused (Fig. 2(b)).
ROCs of pedestrians taller than 30 pixels are reported
303630363038
-
10−3
10−2
10−1
100
101
.05
.10
.20
.30
.40
.50
.64
.80
1
false positives per image
mis
s r
ate
99% VJ
96% Shapelet
93% PoseInv
90% LatSvm−V1
87% HikSvm
86% FtrMine
86% HOG
84% HogLbp
82% MultiFtr
82% LatSvm−V2
78% Pls
78% MultiFtr+CSS
76% FeatSynth
75% FPDW
74% ChnFtrs
74% MultiFtr+Motion
71% MultiResC
63% MT−DPM
60% MT−DPM+Context
(a) Multi-resolution (taller than 30 pixels)
10−3
10−2
10−1
100
101
.05
.10
.20
.30
.40
.50
.64
.80
1
false positives per image
mis
s r
ate
99% VJ
97% Shapelet
93% LatSvm−V1
93% PoseInv
93% HogLbp
89% HikSvm
87% HOG
87% FtrMine
86% LatSvm−V2
84% MultiFtr
82% MultiFtr+CSS
82% Pls
80% MultiFtr+Motion
78% FPDW
78% FeatSynth
77% ChnFtrs
73% MultiResC
67% MT−DPM
64% MT−DPM+Context
(b) Low resolution (30-80 pixels high)
10−3
10−2
10−1
100
101
.05
.10
.20
.30
.40
.50
.64
.80
1
false positives per image
mis
s r
ate
95% VJ
91% Shapelet
86% PoseInv
80% LatSvm−V1
74% FtrMine
73% HikSvm
68% HOG
68% MultiFtr
68% HogLbp
63% LatSvm−V2
62% Pls
61% MultiFtr+CSS
60% FeatSynth
57% FPDW
56% ChnFtrs
51% MultiFtr+Motion
48% MultiResC
41% MT−DPM
38% MT−DPM+Context
(c) Reasonable (taller than 50 pixels)
Figure 8. Quantitative result of MT-DPM, MT-DPM+ Context and
other methods on the Caltech Pedestrian Benchmark.
in Fig. 6. High resolution model can not detect the
lowresolution pedestrians directly, but some of the low resolu-tion
pedestrians can be detected by resizing images. How-ever, the
number of false positives also increases, whichmay hurt the
performance (see HighResModel-Image1.5X,HighResModel-Image2.0X,
HighResModel-Image2.5X inFig. 6). The low resolution DPM
outperforms high resolu-tion DPM, since the low resolution
pedestrians is more thanhigh resolution pedestrians. Combining low
and high reso-lution would always help, but the improvement depends
onthe strategy. Fusing low and high resolution data to train a
s-ingle detector is better than training two independent
detec-tors. By exploring the relationship of samples from
differ-ent resolutions, our MT-DPM outperforms all other
meth-ods.
5.3. Improvements of Context Model
We apply the context model on the detections of MT-DPM, and
optimize every image independently. The missrate at 0.01, 0.1 and 1
FPPI for pedestrians taller than 30pixels are shown in Fig. 7. The
context model reduces themiss rate from 63.05% to 60.87% at 0.1
FPPI. The improve-ment of context is more remarkable when more
false posi-tives are allowed, for example, there is a 3.2%
reduction ofmiss rate at 1 FPPI.
5.4. Comparisons with State-of-the-art Methods
In this part, we compare the proposed method with oth-er
state-of-the-art methods evaluated in [14], including:Viola-Jones
[40], Shapelet [44], LatSVM-V1, LatSVM-V2[19], PoseInv [30], HOGLbp
[43], HikSVM [31], HOG[6],FtrMine [13], MultFtr [44], MultiFtr+CSS
[44], Pls [37],MultiFtr+Motion [44], FPDW [11], FeatSynth [1],
Chn-Ftrs [12], MultiResC [33]. The results of the proposedmethods
are denoted as MT-DPM, and MT-DPM+ Contex-t. For the space
limitation here, we only show results ofmulti-resolution
pedestrians (Fig. 8(a), taller than 30 pixel-s), low resolution
(Fig. 8(b), 30-80 pixels high), reasonable
(Fig. 8(c), taller than 50 pixels)4. Our MT-DPM signifi-cantly
outperforms previous state-of-the-art, at least a 6%margin mean
miss rate on all the three experiments. Theproposed Context model
further improves the performancewith about 3%. Because the ROC of
[9] is not available, itsperformance is not shown here. But as
reported in [9], itgot 48% mean miss rate on the reasonable
condition, whileour method reduces it to 41%. The most related
method isMultiResC [33], where multi-resolution model is also
used.Our method outperforms it with a 11% margin for
multi-resolution detection, which can prove the advantage of
theproposed method.
5.5. Implementation Details
The learned MT-DPM detector can benefit from a lot ofspeed up
methods for DPM. Specially for our implementa-tion, we modified the
code of the FFT based implementation[15] for the fast convolution
computation. The time for pro-cessing one frame is less than 1s on
a standard PC, includinghigh resolution and low resolution
pedestrian detection, ve-hicle detection and context model. More
speed-up can beachieved by parallel computing or pruning the search
spaceby the temporal information.
6. ConclusionIn this paper, we propose a Multi-Task DPM
detector
to jointly encode the commonness and differences
betweenpedestrians from different resolutions, and achieve
robustperformance for multi-resolution pedestrian detection.
Thepedestrian-vehicle relationship is modeled to infer the trueor
false positives in traffic scenes, and we show how tolearn it
automatically from the data. Experiments on chal-lenging Caltech
Pedestrian Benchmark show the significantimprovement over
state-of-the-art performance. Our futurework is to explore the
spatial-temporal information and ex-tend the proposed models to
general object detection task.
4Results of other sub-experiments are in the supplemental
material.
303730373039
-
Figure 9. Qualitative results of the proposed method on Caltech
Pedestrian Benchmark (the threshold corresponds to 0.1 FPPI).
AcknowledgementWe thank the anonymous reviewers for their
valuable
feedbacks. This work was supported by the ChineseNational
Natural Science Foundation Project #61070146,#61105023, #61103156,
#61105037, #61203267, NationalIoT R&D Project #2150510,
National Science and Technol-ogy Support Program Project
#2013BAK02B01, ChineseAcademy of Sciences Project No.
KGZD-EW-102-2, Eu-ropean Union FP7 Project #257289 (TABULA RASA),
andAuthenMetric R&D Funds.
References[1] A. Bar-Hillel, D. Levi, E. Krupka, and C.
Goldberg. Part-based feature synthe-
sis for human detection. ECCV, 2010. 7
[2] O. Barinova, V. Lempitsky, and P. Kholi. On detection of
multiple object in-stances using hough transforms. PAMI, 2012.
2
[3] C. Beleznai and H. Bischof. Fast human detection in crowded
scenes by contourintegration and local shape estimation. In CVPR.
IEEE, 2009. 2
[4] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool.
Pedestrian detectionat 100 frames per second. In CVPR. IEEE, 2012.
1, 2
[5] S. Biswas, K. W. Bowyer, and P. J. Flynn. Multidimensional
scaling for match-ing low-resolution face images. PAMI, 2012. 2
[6] N. Dalal and B. Triggs. Histograms of oriented gradients for
human detection.In CVPR. IEEE, 2005. 1, 2, 7
[7] N. Dalal, B. Triggs, and C. Schmid. Human detection using
oriented histogramsof flow and appearance. ECCV, 2006. 2
[8] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models
for multi-classobject layout. IJCV, 2011. 2
[9] Y. Ding and J. Xiao. Contextual boost for pedestrian
detection. In CVPR. IEEE,2012. 2, 7
[10] P. Dollár, R. Appel, and W. Kienzle. Crosstalk cascades
for frame-rate pedes-trian detection. In ECCV. Springer, 2012. 1,
2
[11] P. Dollár, S. Belongie, and P. Perona. The fastest
pedestrian detector in thewest. BMVC 2010, 2010. 1, 2, 7
[12] P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral
channel features. InBMVC, 2009. 2, 7
[13] P. Dollár, Z. Tu, H. Tao, and S. Belongie. Feature mining
for image classifica-tion. In CVPR. IEEE, 2007. 7
[14] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian
detection: An evalu-ation of the state of the art. TPAMI, 2012. 1,
2, 3, 6, 7
[15] C. Dubout and F. Fleuret. Exact acceleration of linear
object detectors. ECCV,2012. 7
[16] M. Enzweiler and D. Gavrila. Monocular pedestrian
detection: Survey andexperiments. TPAMI, 2009. 2
[17] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman.The pascal voc 2012 results. 5
[18] P. Felzenszwalb, R. Girshick, and D. McAllester. Cascade
object detection withdeformable part models. In CVPR. IEEE, 2010.
1, 2
[19] P. Felzenszwalb, R. Girshick, D. McAllester, and D.
Ramanan. Object detectionwith discriminatively trained part-based
models. TPAMI, 2010. 1, 2, 3, 4, 7
[20] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for
autonomous driving? thekitti vision benchmark suite. In CVPR. IEEE,
2012. 5
[21] D. Geronimo, A. Lopez, A. Sappa, and T. Graf. Survey of
pedestrian detectionfor advanced driver assistance systems. PAMI,
2010. 2
[22] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester.
Discriminatively traineddeformable part models, release 5.
http://people.cs.uchicago.edu/ rbg/latent-release5/. 4
[23] D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error in
object detec-tors. ECCV, 2012. 1, 2
[24] D. Hoiem, A. Efros, and M. Hebert. Putting objects in
perspective. IJCV, 2008.2
[25] C. Huang and R. Nevatia. High performance object detection
by collaborativelearning of joint ranking of granules features. In
CVPR. IEEE, 2010. 1, 2
[26] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A.
Torralba. Undoing thedamage of dataset bias. In ECCV. Springer,
2012. 2
[27] B. Kulis, K. Saenko, and T. Darrell. What you saw is not
what you get: Domainadaptation using asymmetric kernel transforms.
In CVPR. IEEE, 2011. 2
[28] Z. Lei and S. Z. Li. Coupled spectral regression for
matching heterogeneousfaces. In CVPR. IEEE, 2009. 2
[29] C. Li, D. Parikh, and T. Chen. Automatic discovery of
groups of objects forscene understanding. In CVPR. IEEE, 2012.
2
[30] Z. Lin and L. Davis. A pose-invariant descriptor for human
detection and seg-mentation. ECCV, 2008. 7
[31] S. Maji, A. Berg, and J. Malik. Classification using
intersection kernel supportvector machines is efficient. In CVPR.
IEEE, 2008. 1, 7
[32] C. Papageorgiou and T. Poggio. A trainable system for
object detection. IJCV,2000. 2
[33] D. Park, D. Ramanan, and C. Fowlkes. Multiresolution models
for object de-tection. ECCV, 2010. 1, 2, 5, 7
[34] H. Pirsiavash and D. Ramanan. Steerable part models. In
CVPR. IEEE, 2012.2
[35] H. Pirsiavash, D. Ramanan, and C. Fowlkes. Bilinear
classifiers for visualrecognition. In NIPS, 2009. 2
[36] M. Sadeghi and A. Farhadi. Recognition using visual
phrases. In CVPR, 2011.2
[37] W. Schwartz, A. Kembhavi, D. Harwood, and L. Davis. Human
detection usingpartial least squares analysis. In ICCV. IEEE, 2009.
7
[38] S. Tang, M. Andriluka, and B. Schiele. Detection and
tracking of occludedpeople. In BMVC, 2012. 2
[39] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun.
Large margin meth-ods for structured and interdependent output
variables. JMLR, 2006. 5
[40] P. Viola, M. Jones, and D. Snow. Detecting pedestrians
using patterns of motionand appearance. IJCV, 2005. 1, 2, 7
[41] S. Walk, N. Majer, K. Schindler, and B. Schiele. New
features and insights forpedestrian detection. In CVPR. IEEE, 2010.
1, 2
[42] M. Wang, W. Li, and X. Wang. Transferring a generic
pedestrian detectortowards specific scenes. In CVPR. IEEE, 2012.
2
[43] X. Wang, T. Han, and S. Yan. An hog-lbp human detector with
partial occlusionhandling. In ICCV. IEEE, 2009. 1, 2, 7
[44] C. Wojek and B. Schiele. A performance evaluation of single
and multi-featurepeople detection. DAGM, 2008. 7
[45] C. Wojek, S. Walk, and B. Schiele. Multi-cue onboard
pedestrian detection. InCVPR. IEEE, 2009. 2
[46] J. Yan, Z. Lei, D. Yi, and S. Z. Li. Multi-pedestrian
detection in crowdedscenes: A global view. In CVPR. IEEE, 2012.
2
303830383040