DiVA portal - Object Detection Using Convolutional …1267446/...Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University, 2018

Master of Science Thesis in Electrical EngineeringDepartment of Electrical Engineering, Linköping University, 2018

Object Detection UsingConvolutional NeuralNetwork Trained onSynthetic Images

Margareta Vi

Master of Science Thesis in Electrical Engineering

Object Detection Using Convolutional Neural Network Trained on SyntheticImages

Margareta Vi

LiTH-ISY-EX--18/5180--SE

Supervisor: Mikael Perssonisy, Linköpings universitet

Alexander PooleCompany

Examiner: Michael Felsbergisy, Linköpings universitet

Computer Vision LaboratoryDepartment of Electrical Engineering

Linköping UniversitySE-581 83 Linköping, Sweden

Copyright © 2018 Margareta Vi

Abstract

Training data is the bottleneck for training Convolutional Neural Networks. Alarger dataset gives better accuracy though also needs longer training time. Itis shown by finetuning neural networks on synthetic rendered images, that themean average precision increases. This method was applied to two differentdatasets with five distinctive objects in each. The first dataset consisted of ran-dom objects with different geometric shapes. The second dataset contained ob-jects used to assemble IKEA furniture. The neural network with the best perfor-mance, trained on 5400 images, achieved a mean average precision of 0.81 ona test which was a sample of a video sequence. Analysis of the impact of thefactors dataset size, batch size, and numbers of epochs used in training and dif-ferent network architectures were done. Using synthetic images to train CNN’sis a promising path to take for object detection where access to large amount ofannotated image data is hard to come by.

iii

Acknowledgments

I would like to thank my supervisor at my company Alexander Poole, for alwaysbeing helpful and coming with interesting ideas. I would also like to thank mysupervisor at the university, Mikael Persson for helping me with the report andmy examiner Michael Felsberg.

Additionally, I would like to give my thanks to IKEA for providing the CADmodels. Lastly, I would like to thank my family and boyfriend for supporting methrough all the hard times.

Linköping, November 2018Margareta Vi

v

Contents

Notation ix

1 Introduction 11.1 Neural network/convolutional neural network in brief . . . . . . . 21.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related work 52.1 Using synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Object classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Summary: Related Work . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Method and Experiments 113.1 Generating the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.1 Rendering Images . . . . . . . . . . . . . . . . . . . . . . . . 123.1.2 Video Recording . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.3 Creation of Ground Truth Data . . . . . . . . . . . . . . . . 12

3.2 Dataset Distribution and Network Pairings . . . . . . . . . . . . . . 123.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.1 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3.2 PASCAL Mean Average Precision . . . . . . . . . . . . . . . 14

3.4 Parameters to tune . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Results 174.1 Testing Different Network Configuration . . . . . . . . . . . . . . . 17

4.1.1 Faster R-CNN and Inception . . . . . . . . . . . . . . . . . . 174.1.2 SSD and Inception . . . . . . . . . . . . . . . . . . . . . . . 284.1.3 SSD and MobileNet . . . . . . . . . . . . . . . . . . . . . . . 354.1.4 Summary: Single-Shot Multibox Detector . . . . . . . . . . 41

vii

viii Contents

4.1.5 Summary: Different Network Architecture And Batch Sizes 434.2 Epochs Versus Dataset Size . . . . . . . . . . . . . . . . . . . . . . . 464.3 Testing on real images . . . . . . . . . . . . . . . . . . . . . . . . . . 494.4 Automatic and Manual Annotations . . . . . . . . . . . . . . . . . . 50

5 Discussion 575.1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.1 Single-Shot Multibox Detector . . . . . . . . . . . . . . . . . 575.1.2 Faster R-CNN and Inception . . . . . . . . . . . . . . . . . . 57

5.2 Epochs versus Batches . . . . . . . . . . . . . . . . . . . . . . . . . 585.3 Testing On Real Images, Video Sequence . . . . . . . . . . . . . . . 585.4 Annotation: Manual vs Automatic . . . . . . . . . . . . . . . . . . . 58

6 Conclusions And Future Work 616.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

A Datasets 65

Bibliography 67

Notation

Abbreviations

Abbreviation Description

cad Computer Aided Designilsvrc ImageNet Large Scale Visual Recognition Challengecnn Convolutional Neural Networksvm Support Vector Machinemap Mean Average Precisionrcnn Regional Convolutional Neural Networkssd Single Shot Multibox Detectoriou Intersection over Union

ix

1Introduction

Everyone has at least once compiled furniture from IKEA. They are relativelycheap and come in flat packages, the key thing is you need to build them yourselfwith the help of a booklet. The compilation starts by laying out all the pieces infront of you. The large pieces are easy to recognize, though the screws and plugsmight cause problems. The items are small and have the same look, which in-creases the difficulty to distinguish them from each other. Therefore when build-ing IKEA furniture, one could say the hardest part is to find the correct piece touse.

We humans solve this problem by first doing a coarse filtering by localizingobjects with the same form as the one we seek. The next step is a fine searchthrough the remaining items, with a specific image of the component in mind.The problem boils down to having a scene full of items and we want to localize aspecific object i.e. object detection. Object detection is finding where an object isand what type of object it is.

Manual feature matching is costly, therefore it is desirable to computerizethe task of finding such features. Two common techniques within object detec-tion are using handcrafted features combined with machine learning approachessuch as Support Vector Machine, and using artificial neural networks. Hand-crafted features are image properties derived using different algorithms. Thesefeatures include among other SIFT (Scale Invariant Feature Transform), SURF(Speeded up robust feature), and BRIEF (Binary Robust Independent ElementaryFeatures). An artificial neural network on the other hand is a complicated modelinspired by how the human brain is structured. The difference between hand-crafted features and artificial neural networks is that the neural network try tolearn patterns.

1

2 1 Introduction

Neural networks have great potential to solve problems which involve detec-tion of patterns or trends. Scientists have created neural networks which cansolve tasks such as digit or word recognition, image classifications, face recogni-tion, and object detection to name a few. Examples of neural networks whichsolve such tasks are Watson and AlphaGo. Watson played and won against Jeop-ardy champions [12] and AlphaGo was the first computer program which wonagainst a Go world champion player [11].

1.1 Neural network/convolutional neural network inbrief

In a human brain neurons are connected to each other via synapses, while inartificial neural network neurons are functions and synapses are weights. Themodel is shown in Figure 1.0.

Figure 1.0: A biological neuron and its mathematical representation. Imageacquired from [22].

1.1 Neural network/convolutional neural network in brief 3

From here on artificial neural networks will be referred as neural networks(NNs). A neural network consists of many different layers: the input layer, thehidden layer, and the output layer. The input layer contains of images, and theoutput layer is the result of the task that the NN is trying to solve i.e. object de-tection. The hidden layer consist of many different layers. In each layer differentmathematical operations occur, such as pooling, normalization, and convolution.

Input layer Hidden layer Output layer

Figure 1.1: Neural network

A specific type of a neural network which focuses on object detection is theconvolutional neural network (CNN). CNN is a neural network which uses theconvolution operation in at least one of its layers [15].

One of the big disadvantages of NNs is that they need a large amount of train-ing data to have adequate performance. Therefore, getting access to data is thebottleneck for neural networks. On the internet many 3D models of differentobjects are available for free in various formats such as Computer Aided Design(CAD). From a CAD model, it is possible to generate thousands of different syn-thetic images by alternating the background and adding texture to the objects.

It is possible to decrease the amount of training data needed by using a methodcalled finetuning. This method is described further in chapter 2.

Many different types of neural networks exist, where the difference lies inthe combination of hidden layers. In this thesis, the networks used are Faster R-CNN, Inception, SingleShot Multibox Detector, and MobileNet, all described inchapter 2.

4 1 Introduction

1.2 Problem Formulation

This thesis will investigate if neural networks can be fine-tuned with syntheticimages for the task of object detection on a video sequence. To optimize thedevelopment, the network will first be tested on images before being tested on avideo sequence.

1.2.1 Limitation

To reduce training time, fine-tuning will be used. There also will be limitationson what type of objects the network will be able to detect.

There will be two different datasets: dataset A and dataset B. Dataset A con-sists of objects such as screw and plugs provided by IKEA. Background and tex-ture combinations in dataset A were realistic since the purpose was to test ifthe network could differentiate objects in the real world. Dataset B is a videosequence taken of the real objects in dataset A. Dataset A and B are shown inAppendix A.

A computer with Intel Core i7-7700, NVIDIA GTX 1080 Ti was used. A Hover-Cam web camera was used to capture the video sequence. No new neural networkarchitecture will be created. Instead an API called Tensorflow Object DetectionAPI (version 1.7) [18] will be used, together with OpenCV (version 3.4.1) [8],Python (3.5) and Blender [7].

1.3 Thesis Outline

In chapter 2 the related work is presented. The method used is described inchapter 3. The experiment is presented in section 3.5. The results are shown inchapter 4 and discussed in chapter 5. The conclusions and future work of thisthesis are presented in chapter 6.

2Related work

Four topics are addressed in this chapter: neural networks trained on syntheticdata, the concept of finetuning, object classification, and object classification us-ing convolutional neural network. This chapter ends with a conclusion, describ-ing a solution to the problems.

2.1 Using synthetic data

The time consuming parts of NNs are the training time and the gathering of train-ing data. For the detection and classification of the object, the training data con-sist of two parts: the images and the corresponding annotations for each image.For this thesis, annotation means the creation of ground truth data, i.e. the bound-ing box for each object in the images. To get access to lots of training data, one canuse synthetic data since it is possible to generate them automatically. Syntheticimages, in this thesis, are images generated by sampling CAD models unless oth-erwise stated. By generating data automatically the ground truth is always acces-sible.

Annotating training data is a problem for scientists since it takes a long timeand good accuracy is needed. Richter et al. were creative with annotating theirdata. Using the video game engine from Grand Theft Auto they could get accessto both scenes with realistic appearances and labels at pixel level [32].

By using these realistic images, they showed that the work needed for annota-tion could be notably reduced. By combining the semantic segmentation datasetwith real-world images, the accuracy increase even more.

Successful attempts have been made to train NNs using synthetic data to solveclassification problems. In this case, successes mean having the best result for a

5

6 2 Related work

specific type of benchmark. The neural network created by Jaderberg et al. wastrained for scene text classification [20], to classify whole words. The trainingimages were computer generated with different fonts, shadows, and color. Distor-tion and noise were added to the rendered images to simulate the real world. Itoutperformed previous state-of-the-art methods for scene word classifications inthe benchmarks ICDAR 2003, Street View Text, and IIIT5k-dataset. ICDAR 2003is a competition in robust reading [25]. The amount of training data used wasbetween 4 millions and 9 million, depending on the benchmark used.

Jaderberg et al. also created another neural network for text spotting, mean-ing detection and recognition of words. They created an end-to-end system fortext spotting [21]. For the word detection part, they used a region proposal basedmechanism and a CNN for the word recognition task. Their dataset was createdin the same way as in the work [20]. The dataset contained 9 million images,32x100 pixels. They used 900000 for testing, the same amount for validationand the rest for training. For the task of text recognition, their method had thebest accuracy compared to the previous state of the art methods. Jaderberg etal. had good performance in the text spotting task, outperforming the previousstate-of-the-art method [21].

Georgakis et al. trained their network with a combination of both real imagesand synthetic images [13]. The synthetic images were real images augmented.Objects with different scales and positions had been superimposed onto these im-ages. The task for the network was to do object detection in a cluttered indoorenvironment.

Another work which trains a convolutional neural network with synthetic im-ages is [30].

The network’s task was to predict a bounding box and the object class categoryfor each object of interest on RGB images captured inside a refrigerator. Trainingthe neural network with 4000 synthetic images, the network scored a mean Av-erage Precision (mAP) of 24% on a test set. By adding 400 real images the mAPincreased with 12%. In this paper, they used IoU (intersection over Union) forevaluating the bounding box predictions, see section 3.3.1 for a description ofIoU.

2.2 Finetuning

The concept of finetuning refers to the approach of reusing training weights.These training weights comes from another neural network that as been createdfor another task. The weights are used to initialize the training [28]. For example,a neural network trained to classify cats can be fine-tuned to classify dogs. Thismethod has resulted in state-of-the-art performances for several tasks. Examplesof such tasks are object detection [33], [26], [27], tracking [36], segmentation [4],and human pose estimation[9]. With finetuning, the training time can also be

2.3 Object classification 7

reduced [6].

2.3 Object classification

Object classification is identification of the object class in an image. Automaticclassification, where no human is involved in the classification step, can be doneusing machine learning. Support Vector Machines (SVM) are methods used forclassification. The Support Vector Machine, was first invented for binary classi-fication problem [10]. An SVM tries to find a function which can separate theinput data into categories, by mapping the input data non-linearly to a high di-mensional vector space. In, for example, [14], [17], and [29], SVMs were used forthe task of classifying land cover images.

More recent progress in object classification has been achieved by neural net-works. Two state-of-the-art object classification networks are ResNet [5] and In-ception net[34].

ResNet is a deep residual network, hence the name ResNet, and consists of152 layers. Due to its large depth, it managed to achieve a 3.6% error rate (top-5error) in the 2015’s edition of ImageNet Large Scale Visual Recognition Competi-tion (ILSVRC) and thus won the classification task in the 2015 edition of ILSVRC[23]. A human has an error rate between 5 − 10%, meaning ResNet outperformshumans on this task [5].

The other neural network, Inception net, is a network that consists of incep-tion modules. An inception module is a block of multiple parallel convolutionaland max-pooling layers with different kernel sizes. The inception module makesthe Inception net different from the traditional networks, which stack up convo-lutional and max-pooling layers [34]. It won the classification and detection taskof ILSVRC in 2014 [23].

Neural networks are computationally heavy, requiring capable hardware todo the calculations. However, there is a network MobileNet which is a neuralnetwork developed for Mobile Vision Applications. Instead of both filtering andcombining the output signal in one go, Mobilenet divides this step into two lay-ers, one for filtering and one for combining. The two-layer separation greatlyreduced the computation and model size [16].

8 2 Related work

2.4 Object detection

Object detection includes object classification, since object detection is about find-ing the object’s location and its category. The object’s location is mostly repre-sented as a bounding box, shown in Figure 2.1 and Figure 2.2.

Figure 2.1: Input image Figure 2.2: Result

Two of the recent state-of-the-art methods for object detection are Faster R-CNN and Single-Shot Multibox Detector (SSD).

A Faster Region based convolutional network (F-CNN) consists of two mod-ules. One module is a deep, fully convolutional network, a Region Proposal Net-work (RPN). A RPN takes an image as the input and outputs a set of rectangu-lar regions. Each rectangle has a score indicating if the region is an object orbackground. The second module is a Fast R-CNN detector, which applies objectdetection on the regions proposed from the RPN. Faster R-CNN achieved a state-of-the-art accuracy on the dataset PASCAL VOC 2007 [31].

Single-Shot Multibox Detector is a feed-forward CNN. It produces a collec-tion of bounding boxes with fixed size, and the probability for the presence ofthe object class in each box. To get the final detection, it has a non-maximumsuppression step. SSD achieved an increase in accuracy and speed, compared toFaster R-CNN, when tested on the PASCAL VOC 2007 dataset [24].

The networks in section 2.4 used PASCAL VOC 2007 as a benchmark. Theirdatasets were divided as follows: 50% for training/validation and 50% for testing,i.e. images the network had not seen before, with a total of 9963 images [3].

Since they had the double amount of images, 9936 versus i 4830, this thesisused 10% of the dataset for testing and the rest to train to compensate.

The remaining 90% was divided between the training and validation, 70% fortraining and 30% for validation.

2.5 Summary: Related Work 9

2.5 Summary: Related Work

Getting access to a large dataset is a limiting parameter for neural networks; ittakes time and it is costly. This thesis will investigate how well NN performs afterfinetuning with synthetic images.Hyper-parameters connected to the images are the batch sizes (and the imagesize), and number of batches/epochs to run the training. The effect of these pa-rameters was the focus of this thesis and therefore pre-trained networks wereused. The Support vector machine is an old technique and newer methods havesurfaced with better accuracy. Thus, only deep learning will be used. The main in-terest was compare the Single-Shot multi-box Detector against the Faster R-CNN.Also to combine the object detection networks with the object classification net-works, since all networks are state-of-the-art methods, is an interesting aspect.section 3.2 states all combinations this thesis will use. The Residual network wasnot used due to computer limitations.Being inspired by [32], a comparison of the time needed to do manual and auto-matic annotation on a dataset was done. It is also investigated how the networkperforms on automatically annotated datasets vs manually annotated. The proce-dure is described later in subsection 3.1.3.

3Method and Experiments

First presented in this chapter is the rendering of synthetic data, followed bycombinations of neural networks and the evaluation method. In Figure 3.1 a flowchart over the work flow is shown.

CADmodel

Texture ofobject

Background images

Test data Validation dataTraining data

SyntheticImages

Webcamera

Output

NeuralNetwork

Figure 3.1: Flow chart of work flow.

11

12 3 Method and Experiments

3.1 Generating the Datasets

This section will describe the creation of the two datasets; A and B.Dataset A is synthetic images of five different objects; attachment, shelf plug,dowel, expandable plug, and screw. This dataset was used to train the neuralnetwork. Dataset B are images sampled from a video sequence containing thephysical objects and was used to evaluate the networks. Examples of the twodataset are seen in Appendix A.

3.1.1 Rendering Images

All computer generated data were created from CAD models. The models wereeither provided by the furniture company IKEA or found on the website GrabCad[2]. The images were rendered by the open source 3D creation suite programBlender [7]. To generate a large variety of data, different backgrounds, objecttexture and object rotation and camera locations were used. The backgroundimages were taken from the website Pexels [1].

3.1.2 Video Recording

To create the dataset B, physical objects of the dataset A were acquired. By record-ing with a web camera, the objects were introduced into the scene one by one.

3.1.3 Creation of Ground Truth Data

Ground truth data was created using either an open-source program, LabelImg[35] or Blender. LabelImg allows the user to create a bounding box around eachobject and save the data as an .xml (extensible markup language, file). This filecan then be converted to other types. The same information was created whenrendering the synthetic images using Blender. In this thesis, annotation impliescreating ground truth data.

3.2 Dataset Distribution and Network Pairings

Different network architectures were compared and evaluated against each other.The method is described in section 3.3.

The network pairings used are stated below:

• Faster R-CNN + Inception

• SSD + Inception

• SSD + MobileNet

The networks were chosen based the literature study that was done in chap-ter 2 and the availability of pre-trained models. Pre-trained models means thenetwork has already been trained on another dataset. Since no pre-trained mod-els exists for Faster R-CNN + Mobilenet this pairing was not used.

3.3 Evaluation Metrics 13

3.3 Evaluation Metrics

To evaluate the networks, two different losses were used: classification loss andlocalization loss. They were calculated for the three different stages: training,validation, and testing. These losses were provided by the Tensorflow ObjectDetection API. PASCAL mAP was also used on the validation and testing dataset.

3.3.1 Losses

This thesis used the same loss as the networks stated in section 3.2 due to usageof fine-tuning. The loss for categorizing a detected object into categories, objectvs background, is the binary classification loss and is described as a sigmoid func-tion, shown in (3.1). The localization is the loss of the bounding box regressionand is represented as a smooth L1 loss, the Auber loss , see (3.2).

Lc(x) =1

1 + e−x(3.1)

LR ={

0.5x2 if |x| < 1|x| − 0.5 otherwise

(3.2)

The lower the losses, the better the network performs.

Intersection Over Union

Intersection over Union (IoU) is a measurement for the overlap of two boundingboxes, A and B. In this case it is the overlap between the ground truth and thenetwork’s output. The IoU is the quotient of the intersection and the area ofunion [19]. In both [30] and [21], IoU was used as an evaluation metric. Due tothe simplicity of interpreting IoU, this metric will be used for evaluation withinthis thesis.

Intersection area

Figure 3.2: Area of intersection

Union area

Figure 3.3: Area of union


3.3.2 PASCAL Mean Average Precision

To describe PASCAL mAP we need five terms:

• True Positive (TP)

• False Positive (FP)

• False Negative (FN)

• Precision

• Recall

The true positive rate is the number of correct detections. False negatives aremissed detections. False positive occurs when multiple detections of the sameobject are detected, all detections other than the first correct one are false.

Recall is defined as the proportion of all positive detections with IoU equal orgreater than a certain value, in this case 0.5 [3].

Precision is the proportion of all recalls that are true positive [3].PASCAL mAP is defined as the mean precision at a set of eleven equally

spaced recall levels [0, 0.1, .., 1] [3], see (3.3).

AP =1

11

∑Recalli

P recision(Recall, i) (3.3)

The higher the mAP value is, the better the network performs.

3.4 Parameters to tune

When training a neural network several parameters can be tuned to give betterperformance. The ones evaluated in this thesis are:

• Batch size: number of images in one batch.

• Number of epochs: number of times all of the training data has gone throughthe network.

• Total numbers of images used in training

3.5 Experiments 15

3.5 Experiments

As stated in chapter 3, the parameters tuned were batch size, number of epochsand the total number of images used in training. Three main experiments wereexecuted; experiment 1, experiment 2, and experiment 3. 100000 batches wereused for all runs: training, validation and testing.

Experiment 1 only used synthetic data and consists of the following sub-experiments:

1. Testing different network configurations

2. Batch size vs epochs

3. Largest image size manageable

Sub-experiment 1 was done using sub-expeiment 2. Table 3.1 specifies howthe batch size and image size was varied in experiment 2.

In sub-experiment 3, due to large images, the batch size needs to be small.The image size of 600x1040 with a batch size of 1 was used due to hardwarelimitations. Also this experiment was done with the network architecture thathad the best performance when testing on dataset B, which would later be shownto be Faster R-CNN + Inception net. Sub-experiment 2 and 3 used the networkthat had the best performance in experiment 1.

Table 3.1: Experiment 1: Testing different batch size and epochs

Test Baseline #1 #2 #3Batch Size 1 24 35 1Image Size 300x300 300x300 240x240 600x1040Epochs 100000 4166 2857 100000

In experiment 2 five different networks were tested, which are listed below:

• Faster R-CNN + Inception: 10 percent



• SSD + Inception

• SSD + MobileNet.

These networks were trained on the dataset A and then validated on dataset B.Table 3.2 shows how many images that were used to train the different Faster R-CNN + Inception networks. The reason why there are three different versions ofFaster R-CNN + Inception is due to it had the best mAP when testing on datasetB, see Figure 4.44.


Table 3.2: Experiement 2: Testing different dataset size. The percentage isin terms of total amount of images in the dataset.

Test #1 #2 #3Number of real images 540 (10%) 2686 (50%) 5392 (100%)Batch size 24 24 24

For experiment 2 the only interesting evaluation metric is the mean averageprecision, since no parametera are tuned, thus only the mAP will be plotted.

Experiment 3 was to compare the automatic annotation with manual anno-tation. In this experiment SSD + MobileNet was used due to the short trainingtime, shown in Figure 4.40. The validation was done by comparing the IoU of theground truth, the automatic generated and the manual one, and to compare theclassification and localization loss.

4Results

In this chapter, the results of the different network configurations stated in sec-tion 3.5 are presented. The chapter also includes the comparison of manual andautomatic annotation is evaluated.

In all figures where mean average precision is plotted for the whole dataset,only results from the validation and the testing are shown. This is due the Ten-sorflow API only calculating the mAP for the validation and testing set.

4.1 Testing Different Network Configuration

In this section, the results of the different network architectures are presented.The classification and localization losses, and mean average precision are plottedfor each subset: training, validation, and testing.

4.1.1 Faster R-CNN and Inception

Here the results of the different configurations with Faster R-CNN + Inceptionnet are presented. The dataset used is dataset B.

Baseline

Time needed for training: 3.3 hoursImage size: 300x300Batch size: 1Results are shown in Figure 4.1 - Figure 4.3.

17

18 4 Results

0 20000 40000 60000 80000 100000

Batches

0.00

0.05

0.10

0.15

0.20

Lo

ss

Classificat ion Loss

Training

Validation

Test ing

Figure 4.1: Classification loss, Faster R-CNN + In-ception, batch size 1

0 20000 40000 60000 80000 100000

Batches

0.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

Lo

ss

Localizat ion Loss

Training

Validation

Test ing

Figure 4.2: Localization loss, Faster R-CNN + In-ception, batch size 1

4.1 Testing Different Network Configuration 19

0 20000 40000 60000 80000 100000

Batches

0.0

0.1

0.2

0.3

0.4

0.5

0.6m

AP

Mean Average Precision

Validation

Test ing

Figure 4.3: Mean average precision, Faster R-CNN+ Inception, batch size 1

20 4 Results

Configuration 1, #1

Time needed for training: 22 hoursImage size: 300x300Batch size: 24Results are shown in Figure 4.4 - Figure 4.6.

0 20000 40000 60000 80000 100000

0.00

0.05

0.10

0.15

0.20

Lo

ss

Classificat ion loss

Training

Validat ion

Test ing

Batches

Figure 4.4: Classification loss, Faster R-CNN +Inception, batch size 24


0 20000 40000 60000 80000 100000

0.00

0.01

0.02

0.03

0.04

Lo

ss

Batches

Localizat ion loss

Training

Validat ion

Test ing

Figure 4.5: Localization loss, Faster R-CNN + In-ception, batch size 24

0 20000 40000 60000 80000 100000

Batches

0.0

0.2

0.4

0.6

0.8

1.0

mA

P


Validation

Test ing

Figure 4.6: Mean average precision, Faster R-CNN+ Inception, batch size 24

22 4 Results

Configuration 1, #2


0 20000 40000 60000 80000 100000

Batches

0.00

0.01

0.02

0.03

0.04

0.05

Lo

ss


Training

Validation

Test ing

Figure 4.7: Classification loss, Faster R-CNN +Inception, batch size 35


0 20000 40000 60000 80000 100000

Batches

0.00

0.01

0.02

0.03

0.04

0.05Lo

ss

Localizat ion Loss

Training

Validation

Test ing

Figure 4.8: Localization loss, Faster R-CNN +Inception, batch size 35

0 20000 40000 60000 80000 100000

Batches

0.0

0.1

0.2

0.3

0.4

0.5

0.6

mA

P


Validation

Test ing

Figure 4.9: Mean average precision, Faster R-CNN + Inception, batch size 35

24 4 Results

Batch of size 1, Image size 600x1040


0 20000 40000 60000 80000 100000Batches

0.00

0.05

0.10

0.15

0.20

0.25

Loss

Classification LossTrainingEvaluationTesting

Figure 4.10: Classification loss, Faster R-CNN +Inception, batch size of 1


0 20000 40000 60000 80000 100000Batches

0.00

0.02

0.04

0.06

0.08Lo

ss

Localization LossTrainingEvaluationTesting

Figure 4.11: Localization loss, Faster R-CNN +Inception, batch size of1

0 20000 40000 60000 80000 100000

Steps

0.0

0.2

0.4

0.6

0.8

1.0

me

an

av

era

ge

pre

cis

ion

Mean Average precision, 0.5 IOU

Evaluat ion

Figure 4.12: Mean average precision, Faster R-CNN + Inception, batch size of 1

26 4 Results

Summary: Faster RCNN and Inception

The results of the testing dataset with all three different batch sizes, image size300x300, are plotted together in Figure 4.13 to Figure 4.15.

0 20000 40000 60000 80000 100000Batches

−0.01

0.00

0.01

0.02

0.03

0.04

0.05

Loss

Classification LossFaster R-CNN: Batch size 1Faster R-CNN: Batch size 24Faster R-CNN: Batch size 35

Figure 4.13: Classification loss, Faster R-CNN +Inception


0 20000 40000 60000 80000 100000Batches

0.000

0.005

0.010

0.015

0.020

0.025Lo

ss

Localization LossFaster R-CNN: Batch size 1Faster R-CNN: Batch size 24Faster R-CNN: Batch size 35

Figure 4.14: Localization loss, Faster R-CNN +Inception

0 20000 40000 60000 80000 100000Batches

0.0

0.2

0.4

0.6

0.8

1.0

mAP


Faster R-CNN: Batch size 1Faster R-CNN: Batch size 24Faster R-CNN: Batch size 35

Figure 4.15: Mean average precision, Faster R-CNN + Inception

28 4 Results

4.1.2 SSD and Inception

In the following sections, results from different SSD + Inception runs are pre-sented.

Baseline


0 20000 40000 60000 80000 100000

Batches

0

5

10

15

20

25

30

Lo

ss


Training

Validation

Test ing

Figure 4.16: Classification loss, SSD + Incep-tion, batch size of 1


0 20000 40000 60000 80000 100000

Batches

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Lo

ss

Localizat ion Loss

Training

Validation

Test ing

Figure 4.17: Localization loss, SSD + Inception,batch size of 1

0 20000 40000 60000 80000 100000

Batches

0.00

0.05

0.10

0.15

0.20

0.25

0.30

mA

P


Validation

Test ing

Figure 4.18: Mean average precision, SSD + In-ception, batch size of 1

30 4 Results

Figure 4.17 has some incomplete values for the validation run, the values wereNaN and therefore not plotted.


Configuration 1, #1


0 20000 40000 60000 80000 100000

Batches

0.5

1.0

1.5

2.0

2.5

3.0

Lo

ss


Training

Validation

Test ing


32 4 Results

0 20000 40000 60000 80000 100000

Batches

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Lo

ss

Localizat ion Loss

Training

Validation

Test ing


0 20000 40000 60000 80000 100000

Step

0.0

0.2

0.4

0.6

0.8

1.0

me

an

av

era

ge

pre

cis

ion


Validation

Test ing

Figure 4.21: Mean Average Precision, SSD + In-ception„ batch size of 24


Configuration 1, #2

Time needed for training: 8.4 hoursImage size: 300x300Batch size: 35Results are shown in Figure 4.22 - Figure 4.24

0 20000 40000 60000 80000 100000

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Lo

ss


Training

Validat ion

Test ing

Batches


34 4 Results

0 20000 40000 60000 80000 100000

0.1

0.2

0.3

0.4

0.5

Lo

ss

Localizat ion Loss

Training

Validat ion

Test ing

Batches


0 20000 40000 60000 80000 100000

0.0

0.2

0.4

0.6

0.8

1.0

mA

P


Validat ion

Test ing

Batches

Figure 4.24: Mean Average Precision, SSD + In-ception, batch size of 35


4.1.3 SSD and MobileNet

In this section, the results when using SSD together with mobilenet are presented.

Baseline

Time needed for training: 1.7 hoursImage size: 300x300Batch size: 1Results are shown in Figure 4.25 - Figure 4.27

0 20000 40000 60000 80000 100000

Batches

0

10

20

30

40

Lo

ss


Training

Validation

Test ing

Figure 4.25: Classification loss, SSD + Mo-bileNet batch size of 1

36 4 Results

0 20000 40000 60000 80000 100000

Batches

0

2

4

6

8

10

12Lo

ss

Localizat ion Loss

Training

Validation

Test ing

Figure 4.26: Localization loss, SSD + MobileNetbatch size of 1

0 20000 40000 60000 80000 100000

Batches

0.00

0.05

0.10

0.15

0.20

mA

P


Validation

Test ing



Configuration 1, #1


0 20000 40000 60000 80000 100000

Batches

0.0

0.5

1.0

1.5

2.0

Lo

ss


Training

Validation

Test ing

Figure 4.28: Classification loss, SSD + Mo-bileNet, batch size of 24

38 4 Results

0 20000 40000 60000 80000 100000

Batches

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Lo

ss

Localizat ion Loss

Training

Validation

Test ing

Figure 4.29: Localization loss, SSD + MobileNet,batch size of 24

0 20000 40000 60000 80000 100000

Batches

0.0

0.2

0.4

0.6

0.8

1.0

mA

P


Validation

Test ing



Configuration 1, #2


0 20000 40000 60000 80000 100000

Batches

0

2

4

6

8

10

Lo

ss


Training

Validation

Test ing

Figure 4.31: Classification loss, SSD + Mo-bileNet, batch size of 35

40 4 Results

0 20000 40000 60000 80000 100000

Batches

0

1

2

3

4

5Lo

ss

Localizat ion Loss

Training

Validation

Test ing

Figure 4.32: Localization loss, SSD + MobileNet,batch size of 35

0 20000 40000 60000 80000 100000

Batches

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

mA

P


Validation

Test ing

Figure 4.33: Mean Average Precision, SSD +MobileNet, batch size of 35


4.1.4 Summary: Single-Shot Multibox Detector

In this section, all results for Single-shot Multibox Detector are combined andplotted in Figure 4.34 to Figure 4.36. The network with the best performances isthe SSD with a batch size of 24, irrespective of using Mobilenet or Inception net.SSD + Inception with a batch size of 35 performed almost as well as the batchsize of 24 in the loss categories. Its mean average precision converged after thesame number of epochs as batch 24. Using a batch size of 1 gave a high loss andthe mAP was low.

0 20000 40000 60000 80000 100000Batches

0

5

10

15

20

Loss

Classification LossSSD Inception: Batch size 1SSD Inception: Batch size 24SSD Inception: Batch size 35SSD Mobilenet: batch size 1SSD Mobilenet: batch size 24SSD Mobilenet: batch size 35

Figure 4.34: Classification loss, SSD

42 4 Results

0 20000 40000 60000 80000 100000Batches

0

1

2

3

4

5Lo

ss

Localization LossSSD Inception: Batch size 1SSD Inception: Batch size 24SSD Inception: Batch size 35SSD Mobilenet: batch size 1SSD Mobilenet: batch size 24SSD Mobilenet: batch size 35

Figure 4.35: Localization loss, SSD

0 20000 40000 60000 80000 100000Batches

0.0

0.2

0.4

0.6

0.8

1.0

mAP


SSD Inception: Batch size 1SSD Inception: Batch size 24SSD Inception: Batch size 35SSD Mobilenet: batch size 1SSD Mobilenet: batch size 24SSD Mobilenet: batch size 35

Figure 4.36: Mean Average Precision, SSD


4.1.5 Summary: Different Network Architecture And Batch Sizes

In Figure 4.37 and Figure 4.38, the classification loss and the localization loss areplotted for all the different network architectures. The results are from using thetesting dataset. Training time is summarized in Figure 4.40.

0 20000 40000 60000 80000 100000Batches

0

1

2

3

4

5

Loss

Classification LossFaster R-CNN: Batch size 1Faster R-CNN: Batch size 24Faster R-CNN: Batch size 35SSD Inception: Batch size 1SSD Inception: Batch size 24SSD Inception: Batch size 35SSD Mobilenet: batch size 1SSD Mobilenet: batch size 24SSD Mobilenet: batch size 35Faster R-CNN: Image size 600x1040

Figure 4.37: Classification loss of testing dataset

44 4 Results

0 20000 40000 60000 80000 100000Batches

0

1

2

3

4

5

Loss

Localization LossFaster R-CNN: Batch size 1Faster R-CNN: Batch size 24Faster R-CNN: Batch size 35SSD Inception: Batch size 1SSD Inception: Batch size 24SSD Inception: Batch size 35SSD Mobilenet: batch size 1SSD Mobilenet: batch size 24SSD Mobilenet: batch size 35Faster R-CNN: Image size 600x1040

Figure 4.38: Localization loss of testing dataset


0 20000 40000 60000 80000 100000Batches

0.0

0.2

0.4

0.6

0.8

1.0

mAP


Faster R-CNN: Batch size 1Faster R-CNN: Batch size 24Faster R-CNN: Batch size 35SSD Inception: Batch size 1SSD Inception: Batch size 24SSD Inception: Batch size 35SSD Mobilenet: batch size 1SSD Mobilenet: batch size 24SSD Mobilenet: batch size 35SSD Mobilenet: batch size 35Faster R-CNN: Image size 600x1040

Figure 4.39: Mean average precision of testing dataset

3,35,5

22

31,2

2,5

14

18,4

1,7

14

19,5

FASTER RCNN +

INCEPTION: BATCH SIZE 1: 300X300

FASTER RCNN +

INCEPTION: BATCH SIZE

1: 600X1040

FASTER RCNN +


24

FASTER RCNN +


35

SSD + INCEPTION: BATCH SIZE

1


24


35

SSD + MOBILENET: BATCH SIZE

1


24


35

Time to train (h)

Figure 4.40: Time to train

46 4 Results

4.2 Epochs Versus Dataset Size

In this section the results of varying the training dataset size are shown in Fig-ure 4.41 to Figure 4.43. Is is shown in subsection 4.1.5 that the Faster R-CNN+ Inception had the best performance in all three categories. Therefore this net-work configuration with a batch size of 24 was chosen. The training ran for 400epochs and the dataset used was the testing data from dataset B.

50 100 150 200 250 300 350 400Epochs

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

Loss

Classification Loss10 percent50 percentWhole dataset

Figure 4.41: Classification loss, varying dataset size

4.2 Epochs Versus Dataset Size 47

0 50 100 150 200 250 300 350 400Epochs

0.005

0.010

0.015

0.020

Loss

Localization Loss10 percent50 percentWhole dataset

Figure 4.42: Localization loss, varying dataset size

48 4 Results

0 50 100 150 200 250 300 350Epochs

0.0

0.2

0.4

0.6

0.8

1.0

mAP


10 percent50 percentWhole dataset

Figure 4.43: Mean average Precision, varying dataset size

4.3 Testing on real images 49

4.3 Testing on real images

The interesting metric to look at in this case is the mean average precision, whichis shown in Figure 4.44.

Faster R-CNN: 10 pe

rcent

Faster R-CNN: 50 pe

rcent

Faster R-CNN: 100 p

ercentSSD Inception

SSD MobileNet

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

mAP

0.720.78

0.81

0.73

0.15

Mean average Precision - Evaluation on real data

Figure 4.44: Mean average precision on dataset C

50 4 Results

4.4 Automatic and Manual Annotations

The time needed to generate the dataset via manual annotation and automaticannotation is shown in Table 4.1.

Table 4.1: Time needed to create the dataset via manualand automatic annotation

Manual AutomaticNumber of images 3561 7806Time 6h 38 min

Figure 4.45 shows the histogram over the IoU for the ground truth data betweenthe manual and automatic method.

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.000

25

50

75

100

125

150

175

200

Bin

Cou

nt

Intersect ion over Union, Manual vs autom at ic annotat ion ground truth

IoU

Figure 4.45: Intersection over union, groundtruth data

A comparison of manual and automatic annotated data with the SSD + Mo-bilenet was done and the result is shown in figure Figure 4.46 to Figure 4.51. Abatch size of 24 with 3561 images was used.

4.4 Automatic and Manual Annotations 51

0 20000 40000 60000 80000 100000

0

1

2

3

4

5

Lo

ss


Training - manual

Training - autom at ic

Batches

Figure 4.46: Classification loss comparison, trainingdataset

52 4 Results

0 20000 40000 60000 80000 100000

1

2

3

4

5

Lo

ss


Validation - manual

Validation - automatic

Batches

Figure 4.47: Classification loss comparison, validationdataset


0 20000 40000 60000 80000 100000

0

1

2

3

4

5

6

7

8L

oss


Test ing - manual

Test ing - autom at ic

Batches

Figure 4.48: Classification loss comparison, testingdataset

54 4 Results

0 20000 40000 60000 80000 100000

0.2

0.4

0.6

0.8

1.0

Lo

ss

Localizat ion loss

Training - manual

Training - autom at ic

Batches

Figure 4.49: Localization loss comparison, trainingdataset


0 20000 40000 60000 80000 100000

0

1

2

3

4

5

Lo

ss

Localizat ion loss

Validat ion - manual

Validat ion - autom at ic

Batches

Figure 4.50: Localization loss comparison, validationdataset

56 4 Results

0 20000 40000 60000 80000 100000

0.0

0.2

0.4

0.6

0.8

Lo

ss

Localizat ion loss

Test ing - manual

Test ing - autom at ic

Batches

Figure 4.51: Localization loss comparison, testingdataset

5Discussion

In this chapter, the results from chapter 4 are analyzed.

5.1 Networks

In the following sections, the different results from each network configurationare discussed. First the Single-Shot multibox Detector, followed by Faster R-CNN.

5.1.1 Single-Shot Multibox Detector

It is showen in Figure 4.34 and Figure 4.35 that increasing the batch size to amoderate size for SSD networks gives better results. However a huge batch sizeyields a poorer outcome. The batch size of 35 had worse results for both SSD +Inception and SSD + Mobilenet in the loss category. While for the mean averageprecision, see Figure 4.36, a batch size of 35 gave the same result as a batch sizeof 24 for the SSD + Inception network, it converged to 1. Training loss for allnetwork architectures and batch sizes was unstable at each run with which couldbe a sign for overfitting. But also high learning rate or regulariazation couldcause this pattern.

5.1.2 Faster R-CNN and Inception

A batch size of 1 had an mAP converging to roughly 0.58, a batch size of 35converged to 0.6 and a batch size of 24 converged to approximately 0.95. Onereason why a batch size of 1 performed worst could be that the network has yetnot learned enough features. The network suffered from overfitting when using

57

58 5 Discussion

a batch size of 35. It is also shown in Figure 4.39 that the outcome of using animage size of 660x1040 with batch size 1

is as good as using a batch size of 24 with image size of 300x300.

5.2 Epochs versus Batches

It is shown in section 4.2 that using the smallest size of the dataset, 10% of totalamount of test images, gave the worst result in the classification/localization loss.In addition, it required longer training time for the mean average precision toconverge. Using 50% or 100% of the dataset gave approximately the same resultswhen comparing the classification/localization loss. The mean average precisionconverged the fastest when using 50% of the dataset.

5.3 Testing On Real Images, Video Sequence

The mAP decreased for all the networks when comparing the metrics betweensynthetic data and real images. The network’s mAP will converge to 1 (see Fig-ure 4.39), and it is shown in Figure 4.44 that when testing the network after100000 batches the best result we get is 0.81. The decrease in performance mightbe explained by the scale of an object in each image. An example of a frame canbe seen in Figure 5.1, where the distance to the camera is further away comparedto the training images can be seen in Appendix A.

Figure 5.1: Frame from video sequence

Another reason for worse performance could be the sharpness of the images.Even though some of the training images had blur added to them as a pre-processingstep, the network had trouble with the object being out of focus.

It is shown in Figure 4.44 that all the networks performed approximately thesame except for SSD + Mobilnet.

5.4 Annotation: Manual vs Automatic

As seen in Table 4.1, the time necessary to manually annotate the images wasmuch longer. The manual annotation had a total of 3561 images due to a time

5.4 Annotation: Manual vs Automatic 59

limitation, doing 3561 images took 6 h.It is shown in Figure 4.46 - Figure 4.51 that the two losses are higher in every

case. Implying that the automatic generated ground truth yields higher accuracy.The reason for this can be found in Figure 4.45, which is human error.

6Conclusions And Future Work

In this chapter, the conclusions and future works are presented.

6.1 Conclusions

There are several conclusions that can be drawn from this thesis. A neural net-work with the purpose of object detection can be fine-tuned using synthetic datato detect other objects. Faster R-CNN + Inception network had the best accu-racy. Out of the three different network architectures used, while also taking thelongest time to train.

The result shows further that longer training time does not necessarily givethe best result, what mattered was the size of the datasets and the batch sizes.The larger the dataset, the higher the accuracy. Yet having too large batch sizeresults in overfitting.

Large dataset requires a lot of labeling, if automatically generating groundtruth data can both increase the accuracy but also reduces the amount of manuallabor then large dataset would no longer be a problem. Less manual labor alsodecreases the chance of human errors.

6.2 Future Work

Easy access to ground truth data is achievable by generating synthetic data auto-matically. Instead of saving the bounding box of an object, also an object maskcould be used. An object mask means only the pixels of the object are marked.The reason one would want to save an object mask instead of the bounding box

61

62 6 Conclusions And Future Work

is due to the bounding box having background noise while an object mask wouldonly contain the interesting pixels i.e. the object.It would be interesting to verify whether this improves the accuracy further.

Also, in this thesis the networks were finetuned. An interesting aspect wouldbe to train a neural network from scratch using only computer-generated imagesin order to verify that synthetic data is suitable for learning from scratch, too.

Appendix

ADatasets

Two different datasets were used throughout, dataset B was only used for testingpurposes.

Dataset ADataset A consists of five different objects and consists of 5392 images. These

can be seen in Figure A.6 and Figure A.10.Dataset B

Dataset B is a video recorded by a web-camera of the real objects in real life.

Figure A.1: handle Figure A.2: car Figure A.3: eStop

65

66 A Datasets

Figure A.4: cabel-Protetor

Figure A.5: Turnknob

Figure A.6: Attachment Figure A.7: Shelf plug

Figure A.8: Dowel Figure A.9: Expandable plug

Figure A.10: Screw

Bibliography

[1] Pexel. https://www.pexels.com/. Accessed: 2018-04-17. Cited on page12.

[2] Grabccad. https://grabcad.com. Accessed: 2018-02-28. Cited on page12.

[3] The pascal visual object classes challenge 2007. http://host.robots.ox.ac.uk/pascal/VOC/voc2007/, 2007. Accessed: 2018-04-18. Citedon pages 8 and 14.

[4] Rich feature hierarchies for accurate object detection and semantic segmen-tation. 2013. URL https://arxiv.org/abs/1311.2524. Cited on page6.

[5] Deep residual learning for image recognition. 2016 IEEE Conference onComputer Vision and Pattern Recognition (CVPR), Computer Vision andPattern Recognition (CVPR), 2016 IEEE Conference on, page 770, 2016.ISSN 978-1-4673-8851-1. URL https://arxiv.org/abs/1512.03385.Cited on page 7.

[6] Nicholas Becherer, John Pecarina, Scott Nykl, and Kenneth Hopkinson. Im-proving optimization of convolutional neural networks through parameterfine-tuning. Neural Computing and Applications, Nov 2017. ISSN 1433-3058. doi: 10.1007/s00521-017-3285-0. URL https://doi.org/10.1007/s00521-017-3285-0. Cited on page 7.

[7] Blender Online Community. Blender. https://www.blender.org/. Ac-cessed: 2018-04-17. Cited on pages 4 and 12.

[8] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools,2000. Cited on page 4.

[9] X. Chu, W. Ouyang, W. Yang, and X. Wang. Multi-task recur-rent neural network for immediacy prediction. 2015 IEEE Interna-tional Conference on Computer Vision (ICCV), Computer Vision (ICCV),

67

https://www.pexels.com/

https://grabcad.com

http://host.robots.ox.ac.uk/pascal/VOC/voc2007/

http://host.robots.ox.ac.uk/pascal/VOC/voc2007/

https://arxiv.org/abs/1311.2524


https://doi.org/10.1007/s00521-017-3285-0

https://doi.org/10.1007/s00521-017-3285-0

https://www.blender.org/

68 Bibliography

2015 IEEE International Conference on, Computer Vision, IEEE In-ternational Conference on, page 3352, 2015. ISSN 978-1-4673-8391-2. URL http://www.ee.cuhk.edu.hk/~wlouyang/Papers/Chu_Multi-Task_Recurrent_Neural_ICCV_2015_paper.pdf. Cited onpage 6.

[10] Corinna Cortes and Vladimir Vapnik. Support-vector networks. MachineLearning, 20(3):273–297, Sep 1995. ISSN 1573-0565. doi: 10.1023/A:1022627411411. URL https://doi.org/10.1023/A:1022627411411.Cited on page 7.

[11] DeepMind. Alphago. https://deepmind.com/, 2010. Accessed: 2018-04-24. Cited on page 2.

[12] D. A. Ferrucci. Introduction to "this is watson";. IBM Journal of Research andDevelopment, 56(3.4):1:1–1:15, May 2012. ISSN 0018-8646. URL https://ieeexplore.ieee.org/document/6177724. Cited on page 2.

[13] Georgios Georgakis, Arsalan Mousavian, Alexander C. Berg, and JanaKosecka. Synthesizing training data for object detection in indoor scenes.CoRR, abs/1702.07836, 2017. URL http://arxiv.org/abs/1702.07836. Cited on page 6.

[14] Anthony Gidudu, Greg Hulley, and Tshilidzi Marwala. Classification of im-ages using support vector machines. CoRR, abs/0709.3967, 2007. URLhttp://arxiv.org/abs/0709.3967. Cited on page 7.

[15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MITPress, 2016. http://www.deeplearningbook.org. Cited on page 3.

[16] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Wei-jun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mo-bilenets: Efficient convolutional neural networks for mobile vision appli-cations. CoRR, abs/1704.04861, 2017. URL http://arxiv.org/abs/1704.04861. Cited on page 7.

[17] C. Huang, J. R. G. Townshend, and L. S. Davis. An assessment of support vec-tor machines for land cover classification. International Journal of RemoteSensing, 23:725–749, February 2002. doi: 10.1080/01431160110040323.URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.134.4958&rep=rep1&type=pdf. Cited on page 7.

[18] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Ko-rattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, SergioGuadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern con-volutional object detectors. 2016. URL https://arxiv.org/abs/1611.10012. Cited on page 4.

[19] Paul Jaccard. The distribution of the flora in the alpine zone.The New Phytologist, (2):37, 1912. ISSN 0028646X. URL

http://www.ee.cuhk.edu.hk/~wlouyang/Papers/Chu_Multi-Task_Recurrent_Neural_ICCV_2015_paper.pdf

http://www.ee.cuhk.edu.hk/~wlouyang/Papers/Chu_Multi-Task_Recurrent_Neural_ICCV_2015_paper.pdf

https://doi.org/10.1023/A:1022627411411

https://deepmind.com/

https://ieeexplore.ieee.org/document/6177724


http://arxiv.org/abs/1702.07836



http://www.deeplearningbook.org



http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.134.4958&rep=rep1&type=pdf

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.134.4958&rep=rep1&type=pdf



Bibliography 69

https://nph.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-8137.1912.tb05611.x. Cited on page 13.

[20] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.Synthetic data and artificial neural networks for natural scene text recogni-tion. CoRR, abs/1406.2227, 2014. URL http://arxiv.org/abs/1406.2227. Cited on page 6.

[21] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.Reading text in the wild with convolutional neural networks. InternationalJournal of Computer Vision, 116(1):1 – 20, 2016. ISSN 09205691. URLhttps://arxiv.org/abs/1412.1842. Cited on pages 6 and 13.

[22] Andrej Karpathy. Neural network = http://cs231n.github.io/neural-networks-1/, 2018. Accessed: 2018-04-17. Cited on page 2.

[23] Stanford Visual Lab. Imagenet large scale visual recognition challange.http://www.image-net.org, 2010. Accessed: 2018-02-26. Cited onpage 7.

[24] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E.Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multiboxdetector. CoRR, abs/1512.02325, 2015. URL http://arxiv.org/abs/1512.02325. Cited on page 8.

[25] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young. Icdar2003 robust reading competitions. In Seventh International Conferenceon Document Analysis and Recognition, 2003. Proceedings., pages 682–687, Aug 2003. URL https://ieeexplore.ieee.org/document/1227617. Cited on page 6.

[26] W. Ouyang, X. Wang, X. Zeng, Shi Qiu, P. Luo, Y. Tian, H. Li, Shuo Yang,Zhe Wang, Chen-Change Loy, and X. Tang. DeepID-Net: Deformable DeepConvolutional Neural Networks for Object Detection. 2014. URL https://arxiv.org/abs/1409.3505. Cited on page 6.

[27] W. Ouyang, H. Li, X. Zeng, and X. Wang. Learning deep representation withlarge-scale attributes. 2015 IEEE International Conference on Computer Vi-sion (ICCV), Computer Vision (ICCV), 2015 IEEE International Conferenceon, Computer Vision, IEEE International Conference on, page 1895, 2015.ISSN 978-1-4673-8391-2. URL https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Ouyang_Learning_Deep_Representation_ICCV_2015_paper.pdf. Cited on page 6.

[28] W. Ouyang, X. Wang, C. Zhang, and X. Yang. Factors in Finetuning DeepModel for object detection. ArXiv e-prints, January 2016. URL https://arxiv.org/abs/1601.05150. Cited on page 6.

[29] Mahesh Pal and Paul M. Mather. Support vector classifiers for land coverclassification. CoRR, abs/0802.2138, 2008. URL http://arxiv.org/abs/0802.2138. Cited on page 7.

https://nph.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-8137.1912.tb05611.x

https://nph.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-8137.1912.tb05611.x




http://cs231n.github.io/neural-networks-1/

http://cs231n.github.io/neural-networks-1/

http://www.image-net.org







https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Ouyang_Learning_Deep_Representation_ICCV_2015_paper.pdf







70 Bibliography

[30] Param S. Rajpura, Ravi S. Hegde, and Hristo Bojinov. Object detection usingdeep cnns trained on synthetic images. CoRR, abs/1706.06782, 2017. URLhttp://arxiv.org/abs/1706.06782. Cited on pages 6 and 13.

[31] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time objectdetection with region proposal networks. IEEE Transactions on PatternAnalysis and Machine Intelligence, 39(6):1137–1149, 2017. ISSN 01628828.URL https://login.e.bibl.liu.se/login?url=https://search-ebscohost-com.e.bibl.liu.se/login.aspx?direct=true&AuthType=ip,uid&db=edselc&AN=edselc.2-52.0-85019258369&lang=sv&site=eds-live&scope=site. Cited onpage 8.

[32] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playingfor data: Ground truth from computer games. CoRR, abs/1608.02192, 2016.URL http://arxiv.org/abs/1608.02192. Cited on pages 5 and 9.

[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. ArXive-prints, September 2014. URL https://arxiv.org/abs/1409.4842.Cited on page 6.

[34] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Ra-binovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.URL http://arxiv.org/abs/1409.4842. Cited on page 7.

[35] Tzutalin. Labelimg, git code. https://github.com/tzutalin/labelImg, 2015. Cited on page 12.

[36] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully con-volutional networks. 2015 IEEE International Conference on ComputerVision (ICCV), Computer Vision (ICCV), 2015 IEEE International Confer-ence on, Computer Vision, IEEE International Conference on, page 3119,2015. ISSN 978-1-4673-8391-2. URL https://ieeexplore.ieee.org/document/7410714. Cited on page 6.


https://login.e.bibl.liu.se/login?url=https://search-ebscohost-com.e.bibl.liu.se/login.aspx?direct=true&AuthType=ip,uid&db=edselc&AN=edselc.2-52.0-85019258369&lang=sv&site=eds-live&scope=site







https://github.com/tzutalin/labelImg

https://github.com/tzutalin/labelImg



DiVA portal - Object Detection Using Convolutional …1267446/...Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University, 2018

Documents