Learning Spatial Awareness to Improve Crowd Counting Zhi-Qi Cheng 1,2∗ , Jun-Xiu Li 1,3∗ , Qi Dai 3 , Xiao Wu 1 , Alexander G. Hauptmann 2 1 Southwest Jiaotong University, 2 Carnegie Mellon University, 3 Microsoft Research {zhiqic, alex}@cs.cmu.edu, {lijunxiu@my, wuxiaohk@home}.swjtu.edu.cn, [email protected]Abstract The aim of crowd counting is to estimate the number of people in images by leveraging the annotation of cen- ter positions for pedestrians’ heads. Promising progresses have been made with the prevalence of deep Convolutional Neural Networks. Existing methods widely employ the Eu- clidean distance (i.e., L 2 loss) to optimize the model, which, however, has two main drawbacks: (1) the loss has diffi- culty in learning the spatial awareness (i.e., the position of head) since it struggles to retain the high-frequency vari- ation in the density map, and (2) the loss is highly sensi- tive to various noises in crowd counting, such as the zero- mean noise, head size changes, and occlusions. Although the Maximum Excess over SubArrays (MESA) loss has been previously proposed by [16] to address the above issues by finding the rectangular subregion whose predicted density map has the maximum difference from the ground truth, it cannot be solved by gradient descent, thus can hardly be integrated into the deep learning framework. In this pa- per, we present a novel architecture called SPatial Aware- ness Network (SPANet) to incorporate spatial context for crowd counting. The Maximum Excess over Pixels (MEP) loss is proposed to achieve this by finding the pixel-level subregion with high discrepancy to the ground truth. To this end, we devise a weakly supervised learning scheme to generate such region with a multi-branch architecture. The proposed framework can be integrated into existing deep crowd counting methods and is end-to-end trainable. Ex- tensive experiments on four challenging benchmarks show that our method can significantly improve the performance of baselines. More remarkably, our approach outperforms the state-of-the-art methods on all benchmark datasets. 1. Introduction The problem of crowd counting is described in [16]. Dif- ferent from visual object detection, it is impossible to pro- vide bounding boxes for all pedestrians due to the extremely dense crowds. On the other side, when only the total crowd * indicates equal contribution. This work was done when Zhi-Qi Cheng and Jun-Xiu Li were visiting at Microsoft Research. Xiao Wu is the corre- sponding author. Figure 1: The L 2 loss function has difficulty in learning the spatial awareness and is sensitive to various noises in crowd counting, which will lead to a lower estimation in high-density regions (the first row of each ex- ample), and a higher estimation in low-density regions (the second row of each example). Note that the corresponding improvements of our method are shown in Figure 5. counts of the images are provided, the training process will become notably difficult since the spatial awareness is com- pletely ignored. Therefore, to preserve as many spatial con- straints as possible and reduce annotation cost, the previous work [16] started to only provide center points of heads and utilizes Gaussian distribution to generate ground truth den- sity maps. It is worth noting that this annotation scheme is widely adopted by subsequent studies. Existing crowd counting approaches mainly focus on im- proving the scale invariance of feature representation, in- cluding the multi-column networks [13, 38, 39, 42, 52, 6], scale aggregation modules [3, 47], and scale-invariant net- works [9, 17, 20, 39, 45]. Despite the architectures of these methods are different, the L 2 loss function is employed by most of them. As a result, the spatial awareness in crowd image is largely ignored, though more scale information is embedded into their features. We have examined three state-of-the-art approaches (i.e., MCNN [52], CSRNet [17], and SANet [3]) on four crowd counting datasets (i.e., ShanghaiTech [52], UCF CC 50 [11], WorldExpo’10 [48], and UCSD [4]). Two examples are shown in Figure 1. Similar to [3, 19, 20], we observe that dense-crowd regions are usually underesti- mated, while sparse-crowd regions are overestimated. Such phenomenon is due to two main factors. First, the pixel- wise L 2 loss struggles to retain the high-frequency variation 6152
10
Embed
Learning Spatial Awareness to Improve Crowd Countingopenaccess.thecvf.com/content_ICCV_2019/papers/Cheng... · 2019-10-23 · Learning Spatial Awareness to Improve Crowd Counting
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Learning Spatial Awareness to Improve Crowd Counting
Zhi-Qi Cheng1,2∗, Jun-Xiu Li1,3∗, Qi Dai3, Xiao Wu1, Alexander G. Hauptmann2
1Southwest Jiaotong University, 2Carnegie Mellon University, 3Microsoft Research
AbstractThe aim of crowd counting is to estimate the number
of people in images by leveraging the annotation of cen-
ter positions for pedestrians’ heads. Promising progresses
have been made with the prevalence of deep Convolutional
Neural Networks. Existing methods widely employ the Eu-
clidean distance (i.e., L2 loss) to optimize the model, which,
however, has two main drawbacks: (1) the loss has diffi-
culty in learning the spatial awareness (i.e., the position of
head) since it struggles to retain the high-frequency vari-
ation in the density map, and (2) the loss is highly sensi-
tive to various noises in crowd counting, such as the zero-
mean noise, head size changes, and occlusions. Although
the Maximum Excess over SubArrays (MESA) loss has been
previously proposed by [16] to address the above issues by
finding the rectangular subregion whose predicted density
map has the maximum difference from the ground truth, it
cannot be solved by gradient descent, thus can hardly be
integrated into the deep learning framework. In this pa-
per, we present a novel architecture called SPatial Aware-
ness Network (SPANet) to incorporate spatial context for
crowd counting. The Maximum Excess over Pixels (MEP)
loss is proposed to achieve this by finding the pixel-level
subregion with high discrepancy to the ground truth. To
this end, we devise a weakly supervised learning scheme to
generate such region with a multi-branch architecture. The
proposed framework can be integrated into existing deep
crowd counting methods and is end-to-end trainable. Ex-
tensive experiments on four challenging benchmarks show
that our method can significantly improve the performance
of baselines. More remarkably, our approach outperforms
the state-of-the-art methods on all benchmark datasets.
1. Introduction
The problem of crowd counting is described in [16]. Dif-
ferent from visual object detection, it is impossible to pro-
vide bounding boxes for all pedestrians due to the extremely
dense crowds. On the other side, when only the total crowd
∗indicates equal contribution. This work was done when Zhi-Qi Cheng
and Jun-Xiu Li were visiting at Microsoft Research. Xiao Wu is the corre-
sponding author.
Figure 1: The L2 loss function has difficulty in learning the spatial
awareness and is sensitive to various noises in crowd counting, which will
lead to a lower estimation in high-density regions (the first row of each ex-
ample), and a higher estimation in low-density regions (the second row of
each example). Note that the corresponding improvements of our method
are shown in Figure 5.
counts of the images are provided, the training process will
become notably difficult since the spatial awareness is com-
pletely ignored. Therefore, to preserve as many spatial con-
straints as possible and reduce annotation cost, the previous
work [16] started to only provide center points of heads and
utilizes Gaussian distribution to generate ground truth den-
sity maps. It is worth noting that this annotation scheme is
widely adopted by subsequent studies.
Existing crowd counting approaches mainly focus on im-
proving the scale invariance of feature representation, in-
cluding the multi-column networks [13, 38, 39, 42, 52, 6],
scale aggregation modules [3, 47], and scale-invariant net-
works [9, 17, 20, 39, 45]. Despite the architectures of these
methods are different, the L2 loss function is employed by
most of them. As a result, the spatial awareness in crowd
image is largely ignored, though more scale information is
embedded into their features.
We have examined three state-of-the-art approaches
(i.e., MCNN [52], CSRNet [17], and SANet [3]) on
four crowd counting datasets (i.e., ShanghaiTech [52],
UCF CC 50 [11], WorldExpo’10 [48], and UCSD [4]).
Two examples are shown in Figure 1. Similar to [3, 19, 20],
we observe that dense-crowd regions are usually underesti-
mated, while sparse-crowd regions are overestimated. Such
phenomenon is due to two main factors. First, the pixel-
wise L2 loss struggles to retain the high-frequency variation
6152
in the density map: minimizing L2 loss encourages finding
pixel-wise averages of plausible solutions which are typi-
cally overly-smooth and thus have poor spatial awareness
[15]. Second, L2 loss is highly sensitive to typical noises in
crowd counting, including the zero-mean noise, head size
changes, and head occlusions. We take a simple statistics
and show that the co-occurrence of zero-mean noise and
overestimation could reach 96% (6,776 out of 7,044 testing
images). We further find that almost all estimated density
maps inaccurately predict the head positions or sizes when
occlusion occurs, which could result in underestimation in
high-density areas. Moreover, the generated ground truth
density could also be imprecise due to the annotation error
and the fixed variance in Gaussian kernel. It is noted that the
corresponding improvements of our method are illustrated
in Figure 5.
To fully utilize the spatial awareness, previous work [16]
proposes a loss named Maximum Excess over SubArrays
(MESA) to handle the above problems. Generally speak-
ing, MESA loss attempts to find the rectangular subregion
whose predicted density map has the maximum difference
from the ground truth. It directly optimizes the counts of
this subregion instead of the pixel-level density. Since the
set of subregions could include the full image, MESA loss
is an upper bound for the count estimation of the entire im-
age. Besides, this loss is only sensitive to the spatial lay-
out of pedestrians and is robust to various noises. How-
ever, the complexity of MESA loss function is extremely
high. [16] utilizes Cutting-Plane optimization to obtain an
approximate solution. Since this method cannot be solved
by the conventional gradient descent, MESA loss has not
been employed in any existing CNN-based approach.
Motivated by the MESA loss, in this paper we present
a novel deep architecture called SPatial Awareness Net-
work (SPANet) to retain the high-frequency spatial varia-
tions of density. Instead of finding the mismatched rect-
angular subregion as in MESA, the Maximum Excess over
Pixels (MEP) loss is proposed to optimize the pixel-level
subregion which has high discrepancy to the ground truth
density map. To obtain such pixel-level subregion, the
weakly-supervised ranking information [23] is exploited to
generate a mask indicating the pixels with high discrepan-
cies. We further devise a multi-branch architecture to lever-
age the full image for discrepancy detection by imitating the
salience region detection [33, 50, 54], where patches with
increasing areas are used for ranking. The proposed frame-
work could be easily integrated into existing CNN-based
methods and is end-to-end trainable.
The main contribution of this work is the proposed Spa-
tial Awareness Network and Maximum Excess over Pixels
loss for addressing the issue of crowd counting. The solu-
tion also provides the elegant views of what kind of spatial
context should be exploited and how to effectively utilize
such spatial awareness in crowd images, which are prob-
lems not yet fully understood in the literature.
2. Related Work
2.1. Detectionbased Methods
The methods in this category use object detector to lo-
cate people in images. Given the individual localization of
each people, crowd counting becomes trivial. There are two
directions in this line, i.e., detection on 1) whole pedestri-
ans [2, 7, 53] and 2) parts of pedestrians [8, 12, 18, 43].
Typically, local features [7, 18] are first extracted and then
are exploited to train various detectors (e.g., SVM [18] and
AdaBoost [41]). Though spatial information is well learned
in these methods, they are not applicable in challenging sit-
uations, such as the high-density clogging crowds.
2.2. Regressionbased Methods
Different from detection-based methods, regression-
based approaches avoid the hard detection problem and es-
timate crowd counts from image features. Earlier methods
[4, 5, 11, 28] usually predict the counts directly from the
features, which will lead to poor performance as the spa-
tial awareness is completely ignored. Later methods try to
estimate the density map for counting [16, 26, 29], where
the crowd count is obtained by integrating all pixel values
over the density map. Though learning the density map
somewhat provides the spatial information, their models
still have difficulties in preserving the high-frequency vari-
ation in the density map.
2.3. CNNbased Methods
Deep CNN based crowd counting methods have shown
very strong performance improvements over the shallow
learning counterparts. Existing methods mainly focus on
coping with the large variation in pedestrian scales, where
many multi-column networks are extensively studied. A
dual-column network is proposed by [1] to combine shallow
and deep layers for estimating the count. Inspired by this
work, a famous three-column network MCNN is proposed
by [52], which employs different filters on separate columns
to obtain features with various scales. Many works have im-
proved MCNN [13, 38, 39, 42] to further enhance the scale
adaptation. Sam et al. [32] introduce a switching structure,
which uses a classifier to assign input image patches to ap-
propriate columns. Recently, Liu et al. [19] propose a multi-
column network to simultaneously estimate crowd density
by detection and regression based models. Ranjan et al. [27]
utilize a two-column network to iteratively train their model
with images of different resolution.
There are a lot of other attempts to further improve the
scale invariance, including 1) study on the fusion of vari-
ous scale information [22, 40, 45, 46], 2) study on multi-
blob based scale aggregation networks [3, 47], 3) design of
6153
scale-invariant convolutional or pooling layers [9, 17, 20,
39, 45], and 4) study on the automated scale adaptive net-
works [30, 31, 49]. Typically, Li et al. [17] propose CSRNet
that exploits dilated convolutional layers to enlarge recep-
tive fields for boosting performance. Cao et al. [3] propose
SANet to aggregate multi-scale features for more accurate
crowd count. These two approaches have achieved state-of-
the-art performance. Additionally, there also exist studies
devoted to utilization of perspective maps [35], geometric
constraints [21, 51], and region-of-interest (ROI) [20] to im-
prove the counting accuracy.
The aforementioned methods utilize the Euclidean dis-
tance, i.e. L2 loss to optimize the model. Although these
methods can obtain scale-invariant features, their perfor-
mances are still unsatisfactory since the spatial awareness
is largely ignored. Note that, SANet [3] also tries to solve
the problem of L2 loss and adds local pattern consistency
(Lc loss) in the training phase. However, we find that Lc
still cannot learn the spatial context well. In our experi-
ment, when integrating our MEP loss (Lmep) into SANet,
we achieve significant performance improvement. Our pro-
posed MEP loss could fully utilize the spatial awareness,
which is a key factor for the task of crowd counting.
3. Our Method
In this section, we first review the problem of crowd
counting and two loss functions (i.e., MESA loss and L2
loss). Then we present the proposed SPANet and MEP loss
in details. It is worth noting that our method can be directly
applied to all CNN-based crowd counting networks.
3.1. Problem Formulation
Recent technologies define the crowd counting task as
a density regression problem [3, 16, 52]. Given N images
I = {I1, I2, · · · , IN} as the training set, each image Ii is
annotated with a total of ci center points of pedestrians’
heads Pgti = {P1, P2, · · · , Pci}. Typically, the ground truth
density map for each pixel p in image Ii is defined as Dgt,i,
∀p ∈ Ii, Dgt,i(p) =
∑
P∈Pgti
N gt(p;µ = P, σ2), (1)
where N gt is a Gaussian distribution. The number of peo-
ple ci in image Ii is equal to the sum of density values over
all pixels as∑
p∈IiDgt,i(p) = ci. With these training data,
the aim of crowd counting task is to learn the predicted den-
sity map Dpr towards the ground truth density map Dgt.
MESA loss. To make use of the spatial awareness in
annotations (i.e., center head positions Pgt), the previous
work [16] has proposed the Maximum Excess over SubAr-
rays (MESA) loss Lmesa as follows,
Lmesa
(
Dpr, Dgt)
=1
N
N∑
i=1
maxB∈B
∣
∣
∣
∣
∣
∣
∑
p∈B
Dpr,i (p)−∑
p∈B
Dgt,i (p)
∣
∣
∣
∣
∣
∣
,
(2)
Figure 2: Computation process of MESA loss. It is required to tra-
verse all possible subregions and calculate the differences between their
predicted density maps and the ground truth. Then the subregion with
maximum difference is selected for optimization.
where B is the set of all potential rectangular subregions
in image. As illustrated in Figure 2, MESA loss tries to
find the box subregion whose predicted density map has the
maximum difference from the ground truth. It can be treated
as an upper bound for the count estimation of the entire im-
age, as B could include the full image. Besides, this loss
is directly related to the counting objective instead of the
pixel-level density, and is only sensitive to the spatial lay-
out of pedestrians. In the 1D case, Kolmogorov-Smirnov
distance [24] can be seen as a special case of Lmesa.
Despite the above merits, it is difficult to optimize the
MESA loss due to the hard process of finding such subre-
gion. One has to traverse all potential subregions to achieve
this, which is obviously an impossible task in practical ap-
plication. To solve it, previous approach [16] converts the
optimization of MESA loss to a convex quadratic program
problem with limited constraints and utilizes Cutting-Plane
optimization to obtain an approximate solution. However,
since this method cannot be solved by the traditional gradi-
ent descent, MESA loss has not been exploited in any exist-
ing CNN-based crowd counting methods.
L2 loss. To facilitate the computation in deep frame-
works, existing CNN-based methods [17, 27, 52] all di-
rectly use L2 loss to minimize the difference between the
estimated and ground truth density maps,
L2
(
Dpr, Dgt)
=1
2N
N∑
i=1
∑
p∈Dpr,i
∣
∣
∣
∣Dpr,i (p)−Dgt,i (p)∣
∣
∣
∣
2
2. (3)
However, as discussed in Sec. 1, we reveal that L2 loss
can hardly retain the high-frequency variation in the density
map, leading to the poor spatial awareness. Furthermore, it
is also highly sensitive to typical noises in crowd counting,
including the zero-mean noise, head size changes, and head
occlusions. For example, existing methods always overes-
timate the density value in low-density areas and underesti-
mate it in high-density regions.
3.2. Spatial Awareness Network
The proposed Spatial Awareness Network (SPANet)
aims to leverage the spatial context for accurately predict-
ing the density values. Instead of searching the mismatched
6154
Figure 3: The framework of our proposed SPatial Awareness Network (SPANet). The input images are first fed into the backbone network to extract
feature representations and output the estimated density maps Dpr . A K-branch architecture is devised. In each branch k, the network is optimized with
the ranking objective by sampling two patches (one is sub-patch of the other) and outputs a new density map Dprk
. Then the two density maps are utilized to
produce the subregion Sk which has high discrepancy to the ground truth. The density values within the generated Sk is erased in next branch to facilitate
the latter optimization. In the end, K subregions from K branches are fused to form the final pixel-level subregion S, which is exploited to calculate the
Maximum Excess over Pixels (MEP) loss.
rectangular subregion as in MESA loss, which is the main
obstacle for optimization, we try to find the pixel-level sub-
region S which has high discrepancy to the ground truth
density map. Since there is not any annotation of such
region, this problem is unsupervised and will still be sig-
nificantly difficult to solve. Inspired by the recent weakly-
supervised method [23], we exploit an obvious ranking re-
lation to achieve this, i.e., one patch of a crowded scene im-
age is guaranteed to contain the same number or fewer per-
sons than the original image. By sampling a pair of patches
(where one is the sub-patch of the other), the network is op-
timized with the ranking objective and outputs a new den-
sity map, which is in turn utilized to produce the subregion
with high discrepancy, together with the previous one. We
further devise a multi-branch architecture to leverage the
full image by sampling multiple pairs of patches. Note that
the whole SPANet could be end-to-end trained.
Figure 3 illustrates the framework of our proposed
SPANet. Input images I are first fed into the backbone
network to generate the predicted density maps Dpr. The
desired pixel-level subregion generation, i.e., Sk, is con-
ducted by branch k using a pair of patches sampled from
density maps Dpr. To leverage the full image for dis-
crepancy detection, a multi-branch architecture with Kbranches is devised to produce multiple subregions by imi-
tating the salience region detection [50, 54]. Finally, K sub-
regions (S1, S2, ..., SK) are combined to produce the final
S, which is then exploited to compute our proposed Maxi-
mum Excess over Pixels (MEP) loss. We will present these
three sub-modules in details below.
Pixel-level Subregion Generation. The subregion S in-
dicates the area with high density discrepancy to the ground
truth. Unfortunately, directly subtracting the predicted Dpr
from the ground truth Dgt would make the problem go
round in circles – the bias is usually large enough to prevent
it from providing accurate region. Consequently, we turn to
find the region with high changes along with the network
training. It is natural that one can pick two density maps of
the same image from different iterations. However, the ob-
tained area only reflects the region that is already “revised”,
which still seriously suffers from the poor spatial perception
of the original L2 loss. To this end, we exploit the weakly
supervised ranking clues to produce the subregion. Instead
of considering the pixel-level density, the ranking clue is
directly related to the comparison of crowd counts.
In each branch k, two parallel image patches are first
sampled. As the feature maps of deep convolutional layers
already contain rich location information, we treat the sam-
pling process as the mask pooling operation on the density
map. The strategy of selecting patches will be described
later. Without loss of generality, suppose the two masks
M1
k and M2
k are the 2-dimensional matrix with 0 or 1 (1 in-
dicates the patch area), and M1
k is the sub-patch of M2
k . The
crowd counts C(M1
k ) and C(M2
k ) under the masks M1
k and
M2
k could be obtained by integrating the values of density
map over individual mask, which could be implemented as
the mask pooling as follows,
C(
M1k
)
=∑
p∈Dprk
(
Dprk
⊙M1k
)
,
C(
M2k
)
=∑
p∈Dprk
(
Dprk
⊙M2k
)
,(4)
where ⊙ is the element-wise product, and p indicates the
pixel on density map Dprk . It is worth noting that we uti-
lize the same predicted density map Dprk when calculating
the counts for two masks, rather than generating individual
maps at two consecutive iterations. The reason is that the
density map Dprk is not restricted to be positive, thus pool-
ing on the pair of patches could also provide the ranking
information. We have conducted an experiment showing
6155
that the two schemes have similar results. Besides, directly
pooling on the same map is more efficient than the other.
With the assumption that M1
k is the sub-patch of M2
k ,
the explicit constraint is that the number of people in M1
k
is fewer than that in M2
k . Therefore, we employ a pairwise
ranking hinge loss Lr to model such relationship, which is
formulated as
Lr
(
C(M1k ), C(M2
k ))
= max(
0, C(M1k )− C(M2
k ) + ξ)
,
(5)
where ξ is a margin value that is set to the upper bound of
the difference in the ground truth. The gradient of Lr loss
is calculated as
▽θLr =
0, if C(
M1k
)
− C(
M2k
)
+ ξ 6 0,
▽θC(
M1k
)
− ▽θC(
M2k
)
, otherwise.
(6)
Once the network parameters θ are updated with Lr by
back-propagation, the renewed density map Dprk estimated
by the network is computed by
Dpr
k = Conv (I, θ) , (7)
where I is the input image, and Conv(·) refers to a forward
pass of the network. Given the updated density map Dprk
and the old one Dprk , the desired subregion Sk is obtained
by thresholding the difference ▽Dprk between them, where
▽Dprk = |Dpr
k −Dprk |. To make it differentiable, we utilize
a Sigmoid thresholding function, and Sk is given by
Sk =1
1 + exp (−δ (▽Dpr
k −Σ)), (8)
where Σ is a threshold matrix with all elements being σ. δis the parameter to ensure that the value of Sk is approxi-
mately equal to 1 when ▽Dprk (p) > σ, otherwise 0.
Multi-branch Architecture. Note that in above sec-
tion, only a pair of patches are sampled for generating the
subregion. In principle, we hope that the full density map
could be leveraged to provide more information. Instead of
only sampling a small-large pair of patches, which may in-
volve large bias error due to the large difference between
two patches, we adopt a multi-branch architecture as shown
in Figure 3. The bottom right corners of all patches are lo-
cated at the same position, i.e., the bottom right corner of
the density map. The area of patch is gradually enlarged
along with the branches, until it reaches the size of full den-
sity map. Such design guarantees both the small bias error
in each branch and the full utilization of training images.
To eliminate the influence of the detected subregion Sk
for better optimization in latter branches, we imitate the
salience region detection [50] to erase the density values
within Sk in next branch, which is formulated as
Dpr
k+1= D
pr
k+1⊙ (1− Sk), (9)
where 1 is the matrix with all elements being 1, and ⊙ is
the element-wise product.
Maximum Excess over Pixels (MEP) loss. In the end,
K subregions (S1, S2, ..., SK) are generated by the Kbranches. The final desired pixel-level subregion S is com-
puted by simply combining them together as
S =
K∑
k=1
{Sk} , (10)
where∑
indicates merging pixels with values close to 1
in all subregion masks {Sk}, rather than the direct summa-
tion. In practice, we take the maximum value at each pixel
position from all masks. The final output S is the mask that
indicates the pixels which should be optimized. Based on
that, our proposed MEP loss is then given by
Lmep
(
Dpr, Dgt)
=1
N
N∑
i=1
∣
∣
∣
∣
∣
∣
∑
p∈S
Dpr,i(p)−∑
p∈S
Dgt,i(p)
∣
∣
∣
∣
∣
∣
. (11)
3.3. Model Learning
Our SPANet could be easily integrated into existing
crowd counting methods, which is equivalent to adding a
pooling layer with different masks on the final convolu-
tional layer. It is trained by sequentially optimizing the Ktimes ranking loss, MEP loss, and the original loss of exist-
ing methods. When calculating the original loss, the mask
pooling layer is removed. The overall training objective is
formulated as
Lglobal =K∑
k=1
Lr + Lmep + Lvanilla, (12)
where Lvanilla refers to the original loss of existing ap-
proach. In most cases, Lvanilla is the L2 loss. More details
of the ground truth generation and data augmentation are
described in supplementary material.
4. Experiment
4.1. Experiment Settings
Networks. We evaluate our method by combining it
with three networks, i.e., MCNN [52], CSRNet [17], and
SANet [3]. The implementations of MCNN1 and CSR-
Net2 are from others, while SANet is implemented by
us. In general, there are four main differences between
them: (1) Different size of networks. Specifically, MCNN,
SANet, and CSRNet are corresponding to small, medium,
and large crowd counting networks. (2) Different architec-
tures. MCNN and SANet are multi-column/multi-blob net-
works, while CSRNet is a single column network. In ad-
dition, SANet uses the Instance Normalization (IN) layer
and the deconvolutional layer, while CSRNet utilizes the
dilated convolutional layer. (3) Different size of density
Figure 6: Learning Curves. Mean absolute error (MAE) on training and validation sets, vs. the number of training epochs of MCNN [52], CSRNet [17]
and SANet [3] on ShanghaiTech Part A dataset [52].
Table 4: Density map quality comparison. Values on the left of ‘|’ are from original baselines, while values on the right of ‘|’ are results when integrating