An Improved Tiny YOLOv3 for Face and Facial Key Parts Detection …static.tongtianta.site/paper_pdf/da7483da-72bd-11e9-9f4a... · 2019-05-10 · Article An Improved Tiny YOLOv3 for

Article

An Improved Tiny YOLOv3 for Face and Facial KeyParts Detection of Cattle

Yaojun Geng1,†,*, Peijie Dong 1,†, Nan Zhao 1 and Yue Lu 1

1 Current address: College of Information Engineering, Northwest A&F University, 712100 Yangling, China;[email protected]

* Correspondence: [email protected]; Tel.: (86)15829637039† These authors contributed equally to this work.

Version May 9, 2019 submitted to Appl. Sci.

Abstract: Various techniques aiming at identifying individual cattle by its face or nose have alreadybeen developed, and accordingly requires that the cattle face or nose is the frontal view of the cattleface or nose is highly required when images are being gathered. Nevertheless, these approachescall for the manual intervention since cattle do not consciously face the camera to assist with dataacquisition. To reduce manual work and achieve better service in the identification of individualcattle, a method is needed to detect the face and facial key parts (eyes and noses) of cattle quickly andaccurately. This paper proposes a fast and accurate detector for detecting face and facial key partsof cattle, using Tiny YOLOv3, the fast object detector, with respect to speed/accuracy trade-off asthe baseline. It includes a new backbone network using dilated convolutions in shallow layers and amulti-level feature fusion method, which more contextual information in Tiny YOLOv3 is introducedand the performance of detecting small targets and multi-scale targets are elevated greatly. In detailedfusion operation, compared to Tiny YOLOv3, Mini YOLOv3 adds a detection layer in the shallowlayer. Since nodes in different layers have different receiving fields, a layer with a small receptivefield is utilized to predict small targets. Therefore, by using unsample layers and skipping joins, wecan inject more semantic information into the dense feature map, which in turn helps to predict smalltargets. It can be clearly seen in the experimental results that Mini YOLOv3 achieves higher mAPthan baseline Tiny YOLOv3 by 13.39% respectively, especially with 32 points improvements on smallobject detection. The average inference speed is 9 ms per image, which meets the requirements ofreal-time detection. Moreover, compared with the two-stage method such as Faster R-CNN, FPN,Mini YOLOv3 has higher performance in both speed and accuracy. A new dataset named NWAFUCis also constructed.

Keywords: object detection, real-time detection, small object, multi-scale

1. Introduction

Cattle identification plays a non-trivial role in animal breeding, production, and distributionof the animal races. Traditional cattle identification methods use ear tattoos, the embodiment ofmicrochips and ear-tag [1,2] to identify individual cattle. Although these methods are effective, theyare invasive and will leave flaws in the body of the cow [3]. At the same time, these methods arelaborious and expensive for the researcher. Recently visual animal biometrics is an emerging field ofstudy that develops effective methods of the identification of individuals [4]. They can provide anefficient solution to accurate identification of the animal, while do not invade in the body of the animal.Biometric characteristics, such as muzzle point image [5], face biometric features, retinal image [6] andiris image [7], are applied for cattle identification. These techniques consider face cattle identificationwith specific constraints, limiting the variability of the cattle, such as frontal view, homogeneous

Submitted to Appl. Sci., pages 1 – 15 www.mdpi.com/journal/applsci

http://www.mdpi.com

http://www.mdpi.com/journal/applsci

Version May 9, 2019 submitted to Appl. Sci. 2 of 15

background, mug shot image and so on. However, such limitations need to be relaxed in practicebecause the posture of cattle faces changes at any time. To identify individual cattle using visualbiometrics, such as cattle face, nose in the cattle farm, the prerequisite is to detect key features on theimage. Therefore, developing an efficient and accurate method for detecting cow face and its keyfeatures is necessary for cattle identification.

With the development of machine learning, deep learning technology has been widely used inagriculture [8] and animal husbandry [9]. It has also become feasible to detect the head and face ofa cow using state-of-the-art target detection algorithms. The existing deep learning based detectorscan be roughly categorized into two main types of pipelines: two-stage approaches and one-stageapproaches. Two-stage approaches include a region proposal network to extract Regions of Interests(ROIs), followed by a classification network to generate the prediction and regress the RoIs. Despitethe high accuracy, two-stage models such as Faster R-CNN ,FPN [10], are difficult to achieve real-timespeed due to the expensive training process and the inefficiency of region proposition [11]. One-stageapproaches directly remove the region proposal and directly classify and regress the anchor boxes in adense manner on the feature maps. YOLO [12–14], RetinaNet [15], SSD [16] are typically one-stagemethods, which have relatively fast inference speed but slightly poorer accuracy than two-stagemethods.

Due to a large amount of computation and high model complexity, the two-stage object detectionmethod is difficult to achieve real-time detection. Hence, we intend to use Tiny YOLO as the baseline,to greatly improve the detection accuracy of small targets and multi-scale targets without affecting theinference speed as much as possible.

In this paper, we focus on developing accuracy and fast detection method to detect cattle faceand facial key parts including eye and nose in the cattle farm scenes. Firstly, we collect a new dataset,named NWAFU Cattle(NWAFUC), for training and evaluating the neural network. The characteristicsof our dataset mainly focus on the following four aspects: (i) Cattle face, eye, and nose are annotated.(ii) Joint use for face recognition and face detection benchmarking. (iii) Full pose variation. (iv)Multi-scale variation of subjects. Secondly, we propose an improved Tiny YOLOV3 method, namedMini YOLOv3, for detecting the face, eyes, nose of cattle, which reaches a higher mean averageprecision and achieve real-time detection. To improve the precision of small targets and multi-scaletargets, dilated convolution is used to extract features in the shallow layers of the backbone network.Moreover, this model concatenates upsampled feature maps with shallow layers, so the performanceof detecting multi-scale objects is substantially improved.

Section 2 details the progress of small targets detection and multi-scale target detection in recentyears. Section 3 mainly explains the data acquisition and a new data augmentation method forsimulating the impact of natural environment changes on image quality. Section 4 describes thebaseline model Tiny YOLOv3 and new network architecture designed specifically for small targetsand multi-scale targets. Section 5 mainly analyzes and evaluates Mini YOLOv3. Section 5.2 and5.1 introduce the evaluation indexes and experimental setup respectively. Section 5.3 makes acomprehensive evaluation of Mini YOLOv3 from both qualitative and quantitative aspects.

2. Related Work

In the past few years, traditional machine learning algorithms have been widely used for livestockidentification and analysis of their behaviors automatically on the farm, which increases productivityand reduces the incidence of disease. For example, P. Bruyère et al. [17] proposed an approach forestrus detection by the “Cameta-Icons” method in dairy cattle, when combined direct visual method,the detection rate was 88.6%, which was significantly higher than the detection rate allowed by thedirect visual method. In [18], Santosh Kumar et al. presented a face recognition algorithm by usingSpeeded Up Robust Feature (SURF) and Local Binary Patterns (LBP) from different Gaussian pyramidlevels feature extractions approaches for recognition of cattle. The presented algorithm achieved anaccuracy of 92.5% on facial images dataset of cattle. Further, Tillett et al. [19] introduced a model based


on image processing to track pig’s movements in surveillance video. Measuring activity and locatingspecific parts of pigs also can be realized by the technique. In a similar direction, Simona M.C. Portoet al. [20] proposed a computer vision-based system (CVBS) based on the Viola-Jones algorithm forautomatically detecting the lying behavior of dairy cattle in free-stall barns. The system also providedaccurate detection in various environmental conditions. The ability of the CVBS to identify the lyingbehaviors of dairy cattle achieved about 92%.

Currently, using deep learning approaches to identify livestock is feasible to improve the efficiencyof livestock management. Researched by Santosh Kumar et al. [21], it was shown that a deeplearning-based recognition approach to perform the recognition of cattle based on muzzle point imagepattern, consisted of rich and dense texture features, which also played a non-trivial role in trackingand monitoring of individual cattle. With the recognition system of cattle, identification accuracy of98.99% was achieved. In the same field, Mayanda Mega Santoni et al. [22] developed a model calledGray Level Co-occurrence Matrix Convolutional Neural Networks (GLCM-CNN) to recognize cattlerace automatically. This method was feasible to manage cattle livestock. Without the segmentationprocess for classification of cattle race, the identification system still obtained a high accuracy in theidentification of cattle race. In a similar direction, Mark F. Hansen et al. [23] used a non-invasiveimaging system based on CNN (Convolutional Neural Network) model for recognition of pigs fromtheir faces, and such an approach was completely applicable to an on-farm system. The experimentalresults showed that the proposed model achieved the accuracy rates of 96.7% on images of 10 pigs.

3. Description of Dataset

3.1. Image Data Acquisition

As part of this work, we created a new dataset named NWAFUC to train and evaluate ourdetection algorithms. The dataset contains images of cattle captured under natural scenes fromNational Beef Cattle Improvement Center’s experimental farm in Yangling, China. There are a total of1442 images in the datasets, including images of 2872 heads, 3818 eyes, and 2549 noses. The image datawere collected in various conditions, which have a high degree of variability in illumination (shownin Figure 2), blur, occlusion, and low resolution. Moreover, pose variation (shown in Figure 1) dueto body dynamics and head movement, different sizes and the small scale targets in the perspective(shown in Figure 3) bring a great challenge to our research. Compared to the dataset generally havingthe foreground object prominent and the background single characteristic, our dataset is closely relatedto practical application and can better simulate natural scenes.

Figure 1. Special Posture Figure 2. Low Illumination

3.2. Data Augmentation

In the natural scene, cattle appearance can change according to occlusion and cattle are denselydistributed in space, so the objects in the dataset are variable and hierarchical, which is a great challengefor object detection. Moreover, since different lighting and weather conditions vary greatly during the


Figure 3. Small-scale targets in the perspective and large-scale targets in the close-up

day, whether the neural network can correctly detect images in different conditions depends on thediversity and integrity of the dataset. In order to expand the dataset and increase the diversity andintegrity of dataset, we adopted 12 image processing methods to do data enhancement.

Firstly, we generate images based on the original image by adjusting the brightness value, contrastvalue and sharpness value. Besides, the collected images are pre-processed in terms of 180◦ rotation,mirror symmetry, Gauss blur, Gauss filtering, salt and pepper noise, and PCA jittering.

The dataset was augmented as shown in Figure 4 and Figure 5.

(a) (b) (c)

(d) (e) (f)

Figure 4. Image augmentation methods:(a) low brightness, (b) low contrast, (c) low sharpness, (d)high brightness,(e) high contrast and (f) high sharpness.

Table 1 shows a comparison of the number of images in the enhanced dataset with that of theoriginal dataset. The 1442 images are expanded to 12 fold using data augmentation methods. Besides,the distribution of each target is also provided. It can be seen that the sample size of each category isroughly balanced.


(a) (b) (c)

(d) (e) (f)

Figure 5. Image augmentation methods:(a) 180◦ rotation, (b) Mirror symmetry, (c) Gauss blur, (d)Gauss filtering,(e) Salt and pepper noise and (f) PCA jittering.

Table 1. The number of images generated by data augmentation methods.

Dataset name Head Eye Nose Total

Before augmentation 2872 3818 2549 1442After augmentation 34464 45816 30588 17304

3.3. Image Annotation and Dataset Production

Manual annotation was applied after the images were collected. Bounding boxes were labelledand the categories were classified manually using LabelImg [24]. In the case of occlusion, the targets arenot labelled if the occlusion area exceeds 80%. Besides, targets with unclear pixel area are not labelled.Objects with insufficient or unclear area were not labelled, which can prevent model overfitting. ThePascal Visual Object Classes (VOC) [25] challenge is a benchmark in visual object category recognitionand detection, which provides a standard dataset of images and annotation, and standard evaluationprocedures. Following the de facto standard for object detection, images in the training set wereconverted to the Pascal VOC format.

4. Approach

4.1. Tiny YOLOv3

The YOLOv3 [14] is evolved from YOLO [12] and YOLOv2 [13]. The YOLO series is developed tocreate a one-stage progress involving detection and classification. The fast architecture of YOLO isable to achieve 45 FPS and a smaller version, Tiny YOLOv2, achieves up to 244 FPS.

Unlike two-stage algorithms, YOLO frame object detection as a regression problem to spatiallyseparated bounding boxes and associated class probabilities and achieve end-to-end detection. Asshown in Figure 6, the neural network divides each input image into S × S(S = 7) grids. If the centerof the ground truth of target falls in a grid, then the grid is responsible for detecting the target. Eachgrid predicts B bounding boxes and their confidence scores. Confidence refers to the probability atarget exists in each bounding box and is defined as:

Con f idence = Pr(Object)× IoUtruthpred (1)

Pr(Object) = 1 means this grid is responsible for detecting the target and otherwise Pr(Object) =0. IoUtruth

pred means the confidence score to equal the intersection over union between the predicted box


Bounding box + confidence

Class probability map

Final detectionsS×S grid on input

Figure 6. It divides the image into an S × S grid and for each grid cell predicts B bounding boxes,confidence for those boxes, and C class probabilities.

and the ground truth. If multiple bounding boxes detect the same target, YOLO uses the non-maximumsuppression (NMS) algorithm to select the best bounding box.

Although the speed of YOLO is high, the accuracy of YOLO is poor. In order to improve theaccuracy, YOLOv2 and YOLOv3 were proposed. Among them, YOLOv3 uses multi-scale training,batch normalization, direct location prediction, deep neural network(darknet53). These methods havegreatly improved the accuracy of small targets. At the same time, Tiny YOLOv3 was also proposed.

Tiny YOLOv3 [14] is a simplified version of YOLOv3 with fewer convolution layers, while withthe same loss function and optimization strategies as YOLOv3. Compared to YOLOv3, Tiny YOLOv3has fewer parameters, less computational complexity, and is much faster inference speed than YOLOv3.The parameters of Tiny YOLOv3 are depicted in the left half of Figure 8. Tiny YOLOv3 consists of 9layers of convolutional layer and 6 layers of max pooling layers alternately to form the feedforwardneural network, which leads to the loss of useful information in the process of transmission. The inputsize of image is 416 × 416, reduced 32 times to 13 × 13. In addition, Tiny YOLOv3 detects targets attwo different scales (13× 13 and 26× 26), as shown in the Figure 8. In the second scale(26× 26), featurefusion is carried out before detection, and feature map obtained by sampling on the 13 × 13 layer iscombined by concatenation, and then the final prediction result is obtained by using non-maximumsuppression method.

4.2. Mini YOLOv3

In order to develop a more accurate network, we chose Tiny YOLOv3, a model with high inferencespeed but poor performance in detecting small targets and multi-scale targets, as our baseline. Facedwith the problem of multi-scale detection and dense object detection for models, this paper proposesMini YOLOv3, which improves the backbone network, feature fusion and detection part respectively,as shown in Figure 8.

4.2.1. Backbone Network

Object detection not only needs to recognize the category of the object instances but also spatiallylocate the position [26]. Large downsampling factor brings large valid receptive field, which isgood for image classification but compromises the object location ability [27]. In this case, the majorimprovements are in three aspects. On the one hand, convolution is adopted for sampling instead


of max pooling. In this way, the calculation amount of Mini YOLOv3 is increased, but the globalinformation is retained, which is conducive to the detection of small targets. On the other hand, theproposed dilated convolution [28] recently presents a good alternative that realizes a larger receptivefield with the same computation and memory costs while also preserving the dimensions of data atthe output layer and maintaining the ordering of data. Hence, we use dilated convolution insteadof traditional convolution in the shallow layer, which can expand the receptive field of the modelwith limited computing resources. Moreover, considering the computational cost and accuracy, weincreased the size of the network input from 416 × 416 to 608 × 608, which improved the spatialresolution of the feature map and significantly improved the accuracy of the small targets.

4.2.2. Detection Part

d=2 d=2 d=2

Conv Conv

608×608×16

304×304×32

152×152×64

shadow layers

76×76×128

38×38×256

19×19×1024

Conv

Conv

Conv

Res

Route

Route

RouteScale3

Scale2

Scale1

Mini YOLOv3

Figure 7. Mini YOLOv3 network structure diagram.

In order to improve the ability of multi-scale prediction of the model, we choose three differentscales (Scale 1, Scale 2 and Scale 3) for object detection respectively, as shown in Figure 7. Compared toTiny YOLOv3, Mini YOLOv3 adds a detection layer in the shallow layer to fuse the features of Scale 3and Scale2. Besides, to improve the fidelity of small targets, residual connection [29] is used to ourupsampled features at Scale 3, which allows the detector access to finer grained of feature map.

In order to attain better localization performance and higher values of precision and recall, weadopt soft non-maximum suppression algorithm, as an alternative to the traditional NMS algorithm,to obtain final detections in the current detection pipeline. This can be easily implemented without anyadditional hyper-parameters while the computational complexity is the same as NMS. Using soft NMSinstead of NMS during testing time, the performance of Mini YOLOv3 in small targets was furtherimproved.

We run k-means clustering to determine our bounding box priors by IoU distance metric [13],which ensures a sufficiently high IoU rate with the ground truth objects. In this way, Mini YOLOv3can give suitable anchor boxes for prediction of targets in the different scales.

5. Experiments and Discussion

5.1. Experiment setup and Platform

The Mini YOLOv3 model was trained and tested on a single NVIDIA Tesla P100 server. Wetrained with leaky relu gradient descent and chose three scales for prediction (19×19, 38×38, 76×76).The initialization parameters of the model are shown in Table 2.


conv

Route Route Route

ShortcutUpsample Upsample Upsample

Type Filters Size Output

Dilated-Conv 16 3×3/1 608×608

Conv 16 3×3/2 304×304

Dilated-Conv 32 3×3/1 304×304

Conv 32 3×3/2 152×152

Dilated-Conv 64 3×3/1 152×152

Conv 64 3×3/2 76×76

Conv 128 3×3/1 76×76

Conv 128 3×3/2 38×38

Conv 256 3×3/1 38×38

Conv 256 3×3/2 19×19

Conv 512 3×3/1 19×19

Conv 512 1×1/1 19×19

Conv 1024 3×3/1 19×19

Conv 256 1×1/1 19×19

Conv 512 3×3/1 19×19

Conv N 1×1/1 19×19

Type Filters Size Output

Conv 16 3×3/1 416×416

Max 2×2/2 208×208

Conv 32 3×3/1 208×208

Max 2×2/2 104×104

Conv 64 3×3/1 104×104

Max 2×2/2 52×52

Conv 128 3×3/1 52×52

Max 2×2/2 26×26

Conv 256 3×3/1 26×26

Max 2×2/2 13×13

Conv 512 3×3/1 13×13

Max 2×2/1 13×13

Conv 1024 3×3/1 13×13

Conv 256 1×1/1 13×13

Conv 512 3×3/1 13×13

Conv N 1×1/1 13×13

convconv

Soft non-maximal suppression

Conv

Mini-yolov3 Detection

Conv Conv Conv Conv

Non-maximal suppression

Yolov3-tiny Detection

Res

Figure 8. Network parameters of Tiny YOLOv3 and Mini YOLOv3. The final layer providespredictions of bounding boxes and classes, and has size: N f = Nboxes × (Nclasses + 5), where Nboxes isthe number of boxes per grid (9 by default), and Nclasses is the number of object classes

In the experiment, we set the weight attenuation coefficient to the empirical value of 0.0005.Theactivation function is set to leaky relu and the IOU threshold is 0.65. In order to improve the accuracyof small targets, we increase the network resolution from 416 × 416 to 608 × 608. Taking the size of thedataset and the expression ability of the neural network, the batch size was set to 32 to avoid overfitting.Additionally, we used batch normalization in each hidden layer to eliminate the variety of the featuredistribution across the train and test data. To ensure the converges and performance of Mini YOLOv3,we trained it for 72,000 steps and we finally chose the weights at 50,000 steps considering its bestperformance. Parameters such as momentum, initial learning rate, and weight decay regularizationreferred to the original parameters in the YOLOv3 model.

The learning rate decreased to 0.0001 after 40,000 steps and to 0.00001 after 45,000 steps.Additionally, 80% of the cattle imagery for training and the rest are for testing. We trained thenetwork from scratch, without using fine-tuning methods from ImageNet pre-trained models.

Table 2. Initialization parameters of Mini YOLOv3 network

Parameter name Value

Input size 608×608Batch 32

Momentum 0.9Initial learning rate 0.001

Decay 0.0005

5.2. Evaluation

In this paper, a series of experiments are carried out on the test images to verify the performanceof these models. According to the evaluation method of Pascal VOC, we selected some evaluationindicators for target detection. The relevant indicators for evaluating the effectiveness of the networkmodel are as follows:


5.2.1. IoU

loU is the overlap rate between the predicted bounding box in the test image generated by thedetection model and the original ground truth bounding box, which reflects the detection accuracy ofthe model from the perspective of simple geometric metrics.

IoU =DetectionResult

⋂GroundTruth

DetectionResult⋃

GroundTruth(2)

Intersection UnionFigure 9. Intersection and Union

5.2.2. AP and mAP

The accuracy of a category can be obtained by calculating the ratio of the correctly predictedclass to the total number of classes in the image. Average Precision is commonly used to calculatethe average detection accuracy and measure the performance of the detector in each category, whileMean Average Precision can evaluate multi-target detector performance and measure the performanceof the detector across all categories. The value of the mAP in the [0,1] interval takes the average ofall categories after obtaining the average detection accuracy (AP) of each category. It is an importantindicator that object detection often used to measure the accuracy of detection. The higher the value ofmAP, the better the recognition accuracy is.

Precision =N(True Positives)N(TotalObjects)

(3)

AveragePrecision =∑ Precision

N(Total Images)(4)

MeanAveragePrecision =∑ AveragePrecision

N(Classes)(5)

5.2.3. Precision, Recall and F1 Score

Depending on the combination of real class and predicted result, all the samples can be dividedinto four types: true positive (TP), false positive (FP), true negative (TN), and false negative (FN).Recall (R) is a proportion that reflects predicted samples that are correctly identified as positive samplesof all the categories in the test dataset. Precision(P) is a ratio of the correct predicted samples in all thepredicted results. The P-R curve is formed when the corresponding precision and recall are drawn inthe coordinate system when different thresholds are selected. The area enclosed by the P-R curve is theaverage precision.

R =TP

TP + FN(6)

P =TP

TP + FP(7)

Generally speaking, if the Precision is higher, the Recall is lower. They have the opposite trend inperformance. Therefore, we can use the score F1 to harmonize the two.


F1 =2 × P × R

P + R(8)

5.2.4. FPS

FPS is a measure of the number of frames processed per second, which is mainly used to evaluatethe speed and efficiency of video detection.

5.3. Detailed Performance Analysis

In this section, we compare the performance of Mini YOLOv3 with the one-stage approaches:Tiny YOLOv2 and Tiny YOLOv3 and two-stage approaches: Faster R-CNN and FPN.

Firstly, we compare Mini YOLOv3 with two benchmarks from the perspective of qualitativeanalysis. Secondly, from the perspective of quantitative analysis, Mini YOLOv3 is compared withone-stage methods(Tiny YOLOv2, Tiny YOLOv3) and two-stage methods(Faster R-CNN, FPN).

5.3.1. Qualitative Evaluation

In different scenes, the performance of Mini YOLOv3 is better than the other two models (TinyYOLOv2 and Tiny YOLOv3). Figure 10 shows some examples of the detection of three models indifferent scenes. Based on the above results, it can be seen that the Tiny YOLOv2 model is basically notsensitive to small targets and is very easy to miss the target in the background. Tiny YOLOv3 has betterperformance than Tiny YOLOv2, while it is prone to errors in predicting targets in the backgroundand the performance of Tiny YOLOv3’s detection of small targets is not stable enough. In the case ofocclusion, different brightness levels, and special postures, Mini YOLOv3 achieved good detectionperformance. Furthering, the bounding boxes obtained by the Mini YOLOv3 model can better fit thetarget ground truth boxes, and the ground truth box is matched with the prior box with higher IoU,so the detection is more accurate than the other two models. From the third line of Figure 10, MiniYOLOv3 is better for detection in multi-scale targets. To sum up, Mini YOLOv3 is a highly robustmodel and performs better for detection of small targets and multi-scale targets.

5.3.2. Quantitative Evaluation

We make a quantitative analysis by comparing Mini YOLOv3 with one-stage methods andtwo-stage methods.One-stage methods

The Mini YOLOv3 is compared quantitatively with the one-stage methods including Tiny YOLOv2and Tiny YOLOv3. Figure 11 shows the PR curve of Mini YOLOv3 for three categories on the test set. Itcan be seen from the PR curve that the proposed model has a better performance on the detection of thehead of cattle but less effective on that of the eye and nose. The eyes and nose are small targets relativeto the head, and there are fewer anchors positively matched with small targets during training. Thus,we choose three scales (19×19, 38×38, 76×76) to detect large, medium and small targets respectivelyas shown in Figure 8.

Moreover, this paper tests various indicators of Mini YOLOv3 to make a comprehensivecomparison with advanced methods such as Tiny YOLOv2, Tiny YOLOv3. The AP, IoU, FPS, Recalland F1 scores of the models are presented in Table 3 and Table 4.

According to the results of Table 3 and Table 4, Mini YOLOv3 has obtained good performance.Mini YOLOv3 reached 95.38% in Head AP and 82.62% in Nose AP, which is considerably better thanboth Tiny YOLOv2 and Tiny YOLOv3. The AP value of small targets (Eye) is 30%-50% higher than thatof Tiny YOLOv2 and Tiny YOLOv3. Compared to 68.89% mAP of Tiny YOLOv3 model and 54.02%mAP of Tiny YOLOv2 model, Mini YOLOv3 can greatly improve the detection accuracy of multi-scaletargets, achieving an accuracy of 88.28%.


Tiny YOLOv2 Tiny YOLOv3 Mini YOLOv3

Illumination

Occlusion

Multi-scale

Small size

Special posture

Figure 10. Sample detection results of the three model

Figure 11. PR curve of Mini YOLOv3


Table 3. Comparison of various method results

Detection method Head AP(%) Eye AP(%) Nose AP(%) mAP value(%)

Tiny YOLOv2 84.13 36.10 41.80 54.02Tiny YOLOv3 88.88 50.21 67.57 68.89Mini YOLOv3 95.38 82.62 86.85 88.28

At the same time, the IoU value of Mini YOLOv3 is much higher than the other two models,which shows that Mini YOLOv3 predicts the position of the object more accurately. For the detectionspeed, Mini YOLOv3 reached 111 FPS, although it is not as fast as the other two models, it can stillmeet the requirements of real-time detection. The recall of Mini YOLOv3 has reached a staggering90%, which means most of the targets can be detected. The F1 score of Mini YOLOv3 is 76%, which ishigher than the other two models. Overall, the accuracy and confidence provided by Mini YOLOv3are significantly higher than the other two models, reflecting the superiority of Mini YOLOv3.

Table 4. Comparison with one-stage methods

Detection method IoU(%) Speed(fps) Recall(%) F1 score(%)

Tiny YOLOv2 33.01 357.00 48.00 48Tiny YOLOv3 55.55 294.20 73.00 74Mini YOLOv3 67.23 111.00 90.32 76

Two-stage methodsThe Mini YOLOv3 is compared quantitatively with the two-stage approaches including Faster

R-CNN and RPN. Table 5 shows the performance of Faster R-CNN with vgg16, Faster R-CNN withResNet101, FPN with ResNet101 and Mini YOLOv3. From the above results, we can see that theinference speed of all two-stage methods is 0-2 FPS, which is far from the requirement of real-timedetection. In addition, because of the use of multi-scale detection, the accuracy of FPN model indetecting small targets is higher than that of Faster r-cnn. In general, the Mini YOLOv3 not onlyoutpaced Faster r-cnn and FPN in both speed and accuracy, but also had the best performance indetecting small targets.

Table 5. Comparison with two-stage methods

Detection method Head AP(%) Eye AP(%) Nose AP(%) mAP value(%) FPS

FPN(res101) 86.60 61.40 63.10 70.40 0.87Faster R-CNN(vgg16) 89.65 59.54 71.75 73.64 0.84Faster R-CNN(res101) 90.36 61.29 70.96 74.20 1.18

Mini YOLOv3 95.38 82.62 86.85 88.28 111.00

SummaryFigure 12 shows comparative performance between our methods and other detection methods.

From the results, we can see that Faster R-CNN and FPN method do not meet the requirement ofreal-time detection, but the accuracy was higher than Tiny YOLOv2 and Tiny YOLOv3. Tiny YOLOv2and Tiny YOLOv3 have absolute advantages in speed, but poor accuracy. The proposed methodgreatly improves the accuracy of model detection at the cost of slightly reducing the speed.


−100 0 100 200 300 400Inference ti e(fps)

50

60

70

80

90

100

Mea

n av

erag

e prec

ision

(%)

Tiny YOLOv2

Tiny YOLOv3Faster R-CNN(vgg16)

Faster R-CNN(res101)

FPN(res101)

Our method

Figure 12. A comparison of the speed and accuracy of our method with those of other one- andtwo-stage methods. The dashed red line indicates whether real-time detection is achieved.

6. Conclusion

This paper has proposed an accurate and fast approach for detecting face and facial key parts(eyesand noses) of cattle, named Mini YOLOv3, which mainly focuses on improving the ability to detectsmall targets and multi-scale targets. On the one hand, we designed the backbone network specificallydesigned for cattle key parts detection using dilated convolution in shallow layers, which expands thereceptive field and captures more contextual information. On the other hand, we designed a multi-levelfeature fusion method. By using unsample layers and skipping joins, we can inject more semanticinformation into the dense feature map, which in turn helps to predict small targets. The experimentalresults denote that Mini YOLOv3 proposed in this paper has better performance compared to thebaseline(Tiny YOLO3) and is superior to two-stage methods such as Faster R-CNN, FPN in speedand accuracy. The inference speed of Mini YOLOv3 is not as fast as Tiny YOLOv2 and Tiny YOLOv3,which is a tradeoff between speed and accuracy. In addition, due to its speed and accuracy, MiniYOLOv3 will better serve the individual identification of cattle. For future work, we hope to use newtechniques to improve the speed of real-time detection and also plan to increase our detection accuracyusing more powerful strategies.

Author Contributions: Y.G. contributed significantly to proposing the main idea, manuscript preparation andrevision, and providing the research project; P.D. contributed significantly to conducting the experiment, andmanuscript preparation and revision;N.Z. contributed to the investigation, data acquisition and manuscriptpreparation; Y.L. assisted gathering the experimental data and contributed the effective experimental labeling.

Funding: This research was funded by the Jilin Province Science and Technology Development Foundation ofChina under grant number 20180201013GX.

Acknowledgments: The authors appreciate Long Yang and Mengbo You for the academic guidance during thepreparation of this manuscript.

Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this paper:


NWAFUC Northwest Agriculture and Forestry University of CattleCNN Convolutional Neural NetworkSURF Speeded Up Robust FeatureHOG Histogram Oriented GradientNMS non-maximum suppressionROIs Region of InterestYOLO You Only Look OnceLBP Local Binary PatternCVBS computer vision-based systemGLCM-CNN Gray Level Co-occurrence Matrix Convolutional Neural NetworksVOC Visual Object ClassesIoU Intersection over UnionAP average precisionmAP mean average precisionTP True PositiveFP False PositiveTN True NegativeFN False NegativeFPS Frames Per SecondPR Precision and RecallFPN Feature Pyramid Networks

References

1. Beadles, M.; Miller, J.; Shelley, B.; Ingenhuett, D. Comparison of the efficacy of ear tags, leg bands, and tailtags for control of the horn fly on range cattle. Southwestern entomologist 1979.

2. Hayes, N.J.; Shaw, R.J. Multiple purpose animal ear tag system, 1986. US Patent 4,612,877.3. Kumar, S.; Singh, S.K. Visual animal biometrics: survey. IET Biometrics 2016, 6, 139–156.4. Kumar, S.; Singh, S.K.; Singh, R.; Singh, A.K. Animal Biometrics - Techniques and Applications; Springer, 2017;

pp. 1–243.5. Kumar, S.; Singh, S.K.; Singh, R.S.; Singh, A.K.; Tiwari, S. Real-time recognition of cattle using animal

biometrics. Journal of Real-Time Image Processing 2017, 13, 505–526.6. Rusk, C.P.; Blomeke, C.R.; Balschweid, M.A.; Elliot, S.; Baker, D. An evaluation of retinal imaging

technology for 4-H beef and sheep identification. Journal of Extension 2006, 44, 1–33.7. Lu, Y.; He, X.; Wen, Y.; Wang, P.S.P. A new cow identification system based on iris analysis and recognition.

IJBM 2014, 6, 18–32.8. Kamilaris, A.; Prenafeta-Boldu, F.X. Deep learning in agriculture: A survey. CoRR 2018, abs/1807.11809.9. Okafor, E.; Pawara, P.; Karaaba, F.; Surinta, O.; Codreanu, V.; Schomaker, L.; Wiering, M. Comparative

study between deep learning and bag of visual words for wild-animal recognition. SSCI. IEEE, 2016, pp.1–8.

10. Lin, T.Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks forObject Detection. CVPR. IEEE Computer Society, 2017, pp. 936–944.

11. Huang, R.; Pedoeem, J.; Chen, C. YOLO-LITE: A Real-Time Object Detection Algorithm Optimizedfor Non-GPU Computers. 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018, pp.2503–2510.

12. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection.Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.

13. Redmon, J.; Farhadi, A. YOLO9000: better, faster, stronger. Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2017, pp. 7263–7271.

14. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 2018.15. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. Proceedings of the

IEEE international conference on computer vision, 2017, pp. 2980–2988.16. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox

Detector. CoRR 2015, abs/1512.02325.


17. Bruyère, P.; Hétreau, T.; Ponsart, C.; Gatien, J.; Buff, S.; Disenhaus, C.; Giroud, O.; Guérin, P. Canvideo cameras replace visual estrus detection in dairy cows? Theriogenology 2012, 77, 525 – 530.doi:https://doi.org/10.1016/j.theriogenology.2011.08.027.

18. Kumar, S.; Tiwari, S.; Singh, S.K. Face recognition for cattle. 2015 Third International Conference on ImageInformation Processing (ICIIP), 2015, pp. 65–72. doi:10.1109/ICIIP.2015.7414742.

19. Tillett, R.; Onyango, C.; Marchant, J. Using model-based image processing to track animalmovements. Computers and Electronics in Agriculture 1997, 17, 249 – 261. Livestock Monitoring,doi:https://doi.org/10.1016/S0168-1699(96)01308-7.

20. Porto, S.M.; Arcidiacono, C.; Anguzza, U.; Cascone, G. A computer vision-based system for the automaticdetection of lying behaviour of dairy cows in free-stall barns. Biosystems Engineering 2013, 115, 184 – 194.doi:https://doi.org/10.1016/j.biosystemseng.2013.03.002.

21. Kumar, S.; Pandey, A.; Satwik, K.S.R.; Kumar, S.; Singh, S.K.; Singh, A.K.; Mohan, A. Deep learningframework for recognition of cattle using muzzle point image pattern. Measurement 2018, 116, 1 – 17.doi:https://doi.org/10.1016/j.measurement.2017.10.064.

22. Santoni, M.M.; Sensuse, D.I.; Arymurthy, A.M.; Fanany, M.I. Cattle Race Classification Using GrayLevel Co-occurrence Matrix Convolutional Neural Networks. Procedia Computer Science 2015, 59, 493– 502. International Conference on Computer Science and Computational Intelligence (ICCSCI 2015),doi:https://doi.org/10.1016/j.procs.2015.07.525.

23. Hansen, M.F.; Smith, M.L.; Smith, L.N.; Salter, M.G.; Baxter, E.M.; Farish, M.; Grieve, B. Towards on-farmpig face recognition using convolutional neural networks. Computers in Industry 2018, 98, 145 – 152.doi:https://doi.org/10.1016/j.compind.2018.02.016.

24. Tzutalin. LabelImg. Free Software: MIT License, 2015.25. Everingham, M.; Gool, L.J.V.; Williams, C.K.I.; Winn, J.M.; Zisserman, A. The Pascal Visual Object Classes

(VOC) Challenge. International Journal of Computer Vision 2010, 88, 303–338.26. Agarwal, S.; Terrail, J.O.D.; Jurie, F. Recent Advances in Object Detection in the Age of Deep Convolutional

Neural Networks. arXiv preprint arXiv:1809.03193 2018.27. Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Detnet: A backbone network for object detection. arXiv

preprint arXiv:1804.06215 2018.28. Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. international conference on

learning representations 2016.29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE

conference on computer vision and pattern recognition, 2016, pp. 770–778.

c© 2019 by the authors. Submitted to Appl. Sci. for possible open access publication under the terms and conditionsof the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

https://doi.org/https://doi.org/10.1016/j.theriogenology.2011.08.027

https://doi.org/10.1109/ICIIP.2015.7414742

https://doi.org/https://doi.org/10.1016/S0168-1699(96)01308-7

https://doi.org/https://doi.org/10.1016/j.biosystemseng.2013.03.002

https://doi.org/https://doi.org/10.1016/j.measurement.2017.10.064

https://doi.org/https://doi.org/10.1016/j.procs.2015.07.525

https://doi.org/https://doi.org/10.1016/j.compind.2018.02.016

http://creativecommons.org/licenses/by/4.0/.

An Improved Tiny YOLOv3 for Face and Facial Key Parts Detection …static.tongtianta.site/paper_pdf/da7483da-72bd-11e9-9f4a... · 2019-05-10 · Article An Improved Tiny YOLOv3 for

Documents