Automatic Vehicle Classification using Center
Strengthened Convolutional Neural Network Kuan-Chung Wang, Yoga Dwi Pranata, and Jia-Ching Wang
Department of Computer Science and Information Engineering, National Central University, Taiwan
Abstract—Vehicle classification is one of the major part
for the smart road management system and traffic
management system. The use of appropriate algorithms
has a significant impact in the process of classification. In
this paper, we propose a deep neural network, named
center strengthened convolutional neural network (CS-
CNN), for handling central part image feature
enhancement with non-fixed size input. The main
hallmark of this proposed architecture is center
enhancement that extract additional feature from central
of image by ROI pooling. Another, our CS-CNN, based
on VGG network architecture, joint with ROI pooling
layer to get elaborate feature maps. Our proposed
method will be compared with other typical deep learning
architecture like VGG-s and VGG-Verydeep-16. In the
experiments, we show the outstanding performance
which getting more than 97% accuracy on vehicle
classification with only few training data from Caltech256
datasets.
Keywords- Deep learning, Convolutional Neural
Network, ROI pooling, Vehicle classification
I. INTRODUCTION
Nowadays, motorists rarely pay attention to the traffic
signs that exist. Motorists make shortcuts to get to the
destination quickly. But it can unconsciously cause some
harm to both the driver of the vehicle itself and others.
Whereas every rider of the vehicle already knows the rules
on the road but ignore it. As transportation system has
become increasingly intelligent with the rapid increase of
traffic demand in these years, applying Intelligent
Transportation System (ITS) technology becomes one of the
fundamental measures to make use of the existing
transportation infrastructures reasonably and scientifically.
Meanwhile, vehicle detection and classification technology is
an important component of Intelligent Transportation System,
which provides initial and necessary information of the traffic
for Intelligent Transportation System.
Up to date, there have been numbers of the proposed
approaches to discuss the problem of vehicle classification
[1]–[5]. Most of the proposed approaches can be seen as
sensor-based and visual-based approach. The sensor-based
approach needs some particular sensor installation in the road
networks. This method seems easy to implement but we need
consider some factor likes high cost, less flexibility in the
system, and the weather forecast. Generally, by using some
sensors installed on the road networks (e.g. magnetic sensor
[1], piezoelectric sensor [2], Traffic Inductive sensors[3],
infrared transceivers, or other sensor devices), these methods
obtain relevant physical parameters of the vehicles such as
the width, height, and the number of tiers, and then use that
information to directly classify the vehicles type.
In the visual-based approach, it needs some visual
appearance of the vehicle to the system to classify the vehicle.
The advantages of the visual-based are low-cost and have a
high accuracy to classify the vehicle. Visual information
about the vehicle can be represented that computer can
identify the image, then the type of vehicles can be obtained
by running a particular classification algorithm. Surendra et
al[4] proposed a vision-based vehicle classification using
segmentation and blob-tracking. Andrew et al.[5] classify the
vehicle using a rectangle in the images and estimate the
dimension of the vehicle. Jun and Yong[6] proposed two
steps of vehicle classification that is inter-class vehicle
classification and intra-class vehicle classification.
For some algorithm, there are some limitations for the
visual-based approach to classifying the vehicle. Canny
algorithm there are limitations that cannot recognize the
vehicle when night comes or in the dark with a long response
time, while for the algorithm Robert and Prewitt very bad in
recognizing the moving object.
In recent years, deep learning method have many
successes in the areas of classification such as speech and
image. Specially, convolutional neural network base method
performs state-of-the-art on many image classification task.
Since from 2012 on ImageNet Large Scale Visual
Recognition Competition (ILSVR), there were many typical
architectures coming out [6]-[8]. In view of data-driven
learning, deep learning also make problem more easy than
designing an algorithm by ourselves. Based on these reasons,
our proposal depends on deep learning method and develop
CS-CNN.
In this paper, we proposed an end-to-end Convolutional
Neural Network to classify the vehicle based on VGG net.
We use VGG network architecture as the pre-trained model.
We combine ROI pooling from SPP [12] network and
develop center strengthened net. Before fully connected layer,
we spread the feature maps into two, the first one is to get the
feature from the images and the second is to get the feature
Proceedings of APSIPA Annual Summit and Conference 2017 12 - 15 December 2017, Malaysia
978-1-5386-1542-3@2017 APSIPA APSIPA ASC 2017
Figure 1. CS-CNN object classification architecture. Our system (1) takes an non-fixed input image, (2) computes VGG feature
representation, (3) compute full and center ROI features (4) classifies using combined two ROI feature
from the center of the images (centroid). Given that resizing
images lead to object deformation, our input is the non-fixed
size two-dimensional RGB images. We choice vehicle
classes from the Caltech-256 datasets as training and testing
data. The result will be compared with the others visual-based
classification method.
The rest of the research is organized as follows. Section
2 provides the methods. Section 3 includes experiment and
discussion. Section 4 will discuss the conclusion to classify
the six vehicle classes from Caltech-256 dataset.
II. METHODS
Recently, deep learning becomes a famous method for
many tasks in image processing. Some papers also have the
great result when they tried to apply deep learning in vehicle
classification [9, 10].
Automatic vehicle classification is important for making
fast and accurate vehicle type in intelligent transportations
system. The purpose of the vehicle classification is to help the
system analyze the type of vehicle from the input images. The
proposed vehicle classification method using Convolutional
Neural Network for classifying the vehicle type. We use the
ROI pooling from the spatial pyramid-pooling network to get
the region of interest from the input images. Each Step in the
entire process is explained in detail in the following
subsections.
A. The Proposed Architecture
Our CS-CNN is illustrated in 0. In this work, we proposed
a robust CNN architecture to get the outstanding classification
result. Generally, deep learning methods from ILSVRC
competition use fixed size as input using crop or resize. We
concern about that would cause bad image deformation, so our
work resizes input images through same
proportional scale. However, extreme size will make network
work bad. In this case, small image would cause feature maps
too coarse for classification and by contrast, big size has
problem of out of memory. So our work resizes short side to
400 and limits another side not bigger than 800. In the light
of VGGNet [8]’s excellent performance on ILSVRC
challenge, We choice it as base model of our architecture.
B. Center Strengthened RoI pooling
The center Strengthened ROI is illustrated in 0. After
computing the feature maps from multiple convolution and
pooling, we use ROI pooling from SPP network [12] to get
fixed size feature maps. In addition to one ROI over entire
feature maps, other stream is ROI that focus on the centroid
region of same features maps. Like the SPPnet in [12], we
follow ROI idea and add it to our architecture. Different from
normal pooling operator, the ROI pooling performs dynamic
max pooling over a × b output bins and get fixed scale output.
In our work, we choice 7x7 as our ROI size. Besides output
fixed map making it easy connect to FC layer, ROI pooling’s
calculation is fast and simple. Because of these characteristics,
making it popular with some difficult tasks like object
bounding box detection. Unlike SPPnet, we only perform one
ROI size of 7x7 scale instead of combining multi small size of
4x4、2x2、1x1 that may losing too much information.
Figure 2. Center Enhanced RoI pooling
In 0, our proposal introduce CE ROI. After last
convolution from VGG-16, we crop central region of feature
maps that we observe that almost object have meaningful
information in central part of image. The crop center
Proceedings of APSIPA Annual Summit and Conference 2017 12 - 15 December 2017, Malaysia
978-1-5386-1542-3@2017 APSIPA APSIPA ASC 2017
(1)
enhanced width range is from 1
8W to
7
8W and height is from
1
8H to
7
8H, where W and H is input image’s size.
The result of two feature concatenate together then
flatten to one-dimensional vector become the input of FC
layer. Last stage, we use three FC layer and follow softmax
to do multi classes classification (Fire truck, Motorbikes,
School bus, Segway, Bike, and Car).
C. Testing Step
The testing process is the process of using classification
weight and bias of the training process results. There are two
steps in this testing. The first one is testing the result using the
model from the training step. The second one calculates the
accuracy of the classification. This process is not much
different from the training process. The differences are there
is no backpropagation process after feedforward process. So
the result of this process are the accuracy of the classification,
data which failed to be classified, the image number failed to
be classified, and form a network formed from the
feedforward process.
With the weight and bias of the new feedforward process
then generates the output layer. The output layer is fully
connected with the label. Results are fully connected data
obtained which failed and successfully classified.
III. EXPERIMENT AND DISCUSSION
A. Experiment on Caltech-256
Our CNN training procedure follows[8], learning on
ILSVRC-2012 using gradient descent with momentum. Our
experiment parameter setting is momentum 0.9; weight decay
1x10−4; initial learning rate 5x10−4 , which is decreased to
5x10−5 after 20 epoch. Our training batch size is 10 per epoch.
Some modified layers are initialized from a Normal
distribution with a zero mean and standard deviation equal to
1x10−2.
We evaluate CS-CNN on six Caltech-256 datasets and
compare performance with other method [8][13][14]. Our task
focus on six classes that are the bicycle, school bus, car,
motorbike, segway, and the fire truck with total 1,422 images.
Each class the have different number of images. Each image
also have different size. In this Work, we will use three
different number of images for the experiment . And our
testing time is about 13 images per second with one GTX 1080.
Figure 3. Examples of vehicle subset in Caltech-256 dataset.
B. Performance
The proposed architecture network training has 50 epochs
for Caltech-256 datasets in the training step. We use the last
epoch training weights for testing and classification step
because the graph showed a convergent result. Figure 4 and 5
represent the objective and accuracy result of each epoch from
training step on Caltech-256 datasets.
Figure 4. Objective on training data along with 50 epochs.
left: 10 data/per class, mid: 20 data/per class, right: 30
data/per class
Figure 5. Accuracy on training data along with 50 epochs.
left: 10 data/per class, mid: 20 data/per class, right: 30
data/per class
The classification result of each architecture can be
calculated with the following equation.
𝑦 = 𝐸
𝑇 𝑋 100%
where 𝑦 is the accuracy, 𝐸 is the images that failed to be
classified and 𝑇 is the total testing images. The result of
classification of each architecture from Caltech-256 datesets
Proceedings of APSIPA Annual Summit and Conference 2017 12 - 15 December 2017, Malaysia
978-1-5386-1542-3@2017 APSIPA APSIPA ASC 2017
from the proposed architecture show in the Table I. We have
highest accuracy results for the classification that is 93.9% for
the 10 each class training images, 96.93% for 20 each class
training images and 97.75% for 30 each class training images.
TABLE I. CLASSIFICATION RESULT
10 images 20 images 30 images
VGG-s 88.74% 92.70% 96.38%
VGG-Verydeep-16 91.18% 94.55% 90.18%
CS-CNN 93.75% 96.93% 97.75%
Besides above experiments, we also proved that deep
learning is better feature extraction method by comparing with
non-deep learning method CE-SPM [14] which tested on
same datasets. In figure 6, deep learning methods perform
superior results which get more than 10 percent above the
recognition rate when comparing with CE-SPM.
Figure 6. The testing results of our proposed method with
different training images compared with VGG-s [13], VGG-
Verydeep-16 [8] and CE-SPP [14]
IV. CONCLUSIONS
In this work, we have present CS-CNN, an end-to-end
deep convolutional neural network architecture for joint with
a center strengthened method. Center strengthened enhance
center region feature over feature maps of last convolutional
layer. Another, by joining with ROI pooling, CS-CNN can
receive non-fixed size data as network input that avoid object
deformation in images. In the experiments, our method get
the outstanding result compared with VGG-s 、 VGG-
Verydeep-16 and CE-SPM . We also get over 97% testing
accuracy with only few training data on Caltech-256 dataset.
REFERENCES
[1] Y. He, Y. Du,L. Sun,"Vehicle Classification Method Based onSingle-
Point Magnetic Sensor,"International Conference on Traffic and
Transportation Studies Changsha, (2012)
[2] S. A. Rajab, A. S. Othman, and H. H. Refai,"Novel vehicle and
motorcycle classification using single element piezoelectric sensor," in
Proceedings of IEEE Conference on Intelligent Transportation Systems
(ITSC), pp. 496-501 (2012)
[3] J. J. Lamas-Seco, P. M. Castro, A. Dapena, F. J. Vazquez-
Araujo,"Vehicle Classification Using the Discrete Fourier Transform
with Traffic Inductive Sensors," sensors open access, (2015)
[4] S. Gupte, O. Masoud, N. P. Papanikolopoulos,"Vision-Based Vehicle
Classification," IEEE Intelligent Transportation Systems Conference
Proceedings Dearborn (MI), USA, (2000)
[5] Andrew H. S. Lai, George S. K. Fung and Nelson H. C. Yung,"Vehicle
Type Classification from Visual-Based Dimension Estimation," IEEE
Intelligent Transportation Systems Conference Proceedings - Oakland
(CA), USA, (2001)
[6] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet
classification with deep convolutional neural networks." Advances in
neural information processing systems. 2012.
[7] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition.
2015.
[8] K. Simonyan and A. Zisserman ,"Very Deep Convolutional Networks
for Large-Scale Image Recognition," in ICLR (2015).
[9] Jun Yee Ng, Yong Haur Tay ,"Image-based Vehicle Classification
System," 11th asia-pacific ITS Forum & Exhibition, (2011)
[10] A. Dehghan, S. Z. Masood, G. Shu, E. G. Ortiz ,"View Independent
Vehicle, Make, Model, and Color Recognition Using Convolutional
Neural Network," ArXiv, (2017)
[11] Y. Gao, H. J. Lee,"Local Tiled Deep Networks for Recognition of
Vehicle Make and Model," Sensors, 16, 226; doi:10.3390/s16020226,
(2016)
[12] K. He, X. Zhang, S. Ren, J. Sun,"Spatial Pyramid Pooling in Deep
Convolutional Networks for Visual Recognition," arXiv:1406.4729v4,
(2015)
[13] Chatfield, Ken, et al. "Return of the devil in the details: Delving deep
into convolutional nets." arXiv preprint arXiv:1405.3531 (2014).
[14] Santoso, Andri, et al. "Kernel Sparse Representation Classifier with
Center Enhanced SPM for Vehicle Classification." Computer Software
and Applications Conference (COMPSAC), 2015 IEEE 39th Annual.
Vol. 2. IEEE, 2015.
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
6 0 1 2 0 1 8 0
REC
OG
NIT
ION
RA
TE(%
)
N TRAINING IMAGES
CE-SPM + K-SRC CE-SPM + Libsvm
VGG-s VGG-Verydeep-16
CS-CNN
Proceedings of APSIPA Annual Summit and Conference 2017 12 - 15 December 2017, Malaysia
978-1-5386-1542-3@2017 APSIPA APSIPA ASC 2017