Page 1
MOUSE FACE TRACKING USING CONVOLUTIONAL NEURALNETWORKS
A THESIS SUBMITTED TOTHE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES
OFMIDDLE EAST TECHNICAL UNIVERSITY
BY
IBRAHIM BATUHAN AKKAYA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR
THE DEGREE OF MASTER OF SCIENCEIN
ELECTRICAL AND ELECTRONICS ENGINEERING
SEPTEMBER 2016
Page 3
Approval of the thesis:
MOUSE FACE TRACKING USING CONVOLUTIONAL NEURAL NETWORKS
submitted by IBRAHIM BATUHAN AKKAYA in partial fulfillment of the require-ments for the degree of Master of Science in Electrical and Electronics Engineer-ing Department, Middle East Technical University by,
Prof. Dr. Gülbin Dural ÜnverDean, Graduate School of Natural and Applied Sciences
Prof. Dr. Tolga ÇilogluHead of Department, Electrical and Electronics Engineering
Prof. Dr. Ugur HalıcıSupervisor, Electrical and Electronics Engineering Depart-ment, METU
Examining Committee Members:
Prof. Dr. Gözde Bozdagı AkarElectrical and Electronics Engineering Department, METU
Prof. Dr. Ugur HalıcıElectrical and Electronics Engineering Department, METU
Assoc. Prof. Dr. Ilkay UlusoyElectrical and Electronics Engineering Department, METU
Assoc. Prof. Dr. Emine Eren KoçakInst. of Neurological Sci. and Psychiatry, Hacettepe Uni.
Assist. Prof. Dr. Elif VuralElectrical and Electronics Engineering Department, METU
Date: 09.09.2016
Page 4
I hereby declare that all information in this document has been obtained andpresented in accordance with academic rules and ethical conduct. I also declarethat, as required by these rules and conduct, I have fully cited and referenced allmaterial and results that are not original to this work.
Name, Last Name: IBRAHIM BATUHAN AKKAYA
Signature :
iv
Page 5
ABSTRACT
MOUSE FACE TRACKING USING CONVOLUTIONAL NEURALNETWORKS
AKKAYA, Ibrahim BatuhanM.S., Department of Electrical and Electronics Engineering
Supervisor : Prof. Dr. Ugur Halıcı
September 2016, 97 pages
Laboratory mice are frequently used in biomedical studies. Facial expressions ofmice provide important data about various issues. For this reason real time trackingof mice provide output to both researcher and software that operate on face imagedirectly. Since body and face of mice is the same color and mice moves fast, track-ing of face of mice is a challenging task. In recent years, methods that use artificialneural networks provide effective solutions to problems such as classification, deci-sion making and object recognition thanks to their ability to abstract training dataset.Especially, convolutional neural networks, which are inspired by visual cortex of an-imals, are very successful in computer vision tasks.
In this study, a method based on deep learning which uses convolutional neural net-works is proposed for real time tracking of face of mice. Convolutional neural net-works are good at extracting hierarchical features from training dataset. High levelfeatures contains semantic features and low level features has high spatial resolution.Target information is extracted from combination of low and high level features byconvolutional layer to achieve robust and accurate tracker. Although proposed methodis specialized in tracking face of mouse, it can be adapted any target by changingtraining dataset.
v
Page 6
Keywords: Convolutional Neural Networks, Machine Learning, Object Tracking
vi
Page 7
ÖZ
EVRISIMSEL SINIR AGLARI KULLANILARAK FARE YÜZÜ TAKIBI
AKKAYA, Ibrahim BatuhanYüksek Lisans, Elektrik ve Elektronik Mühendisligi Bölümü
Tez Yöneticisi : Prof. Dr. Ugur Halıcı
Eylül 2016 , 97 sayfa
Biyomedikal çalısmalarda laboratuvar fareleri sıklıkla kullanılmaktadır. Yapılan ça-lısmalar sırasında fare yüz mimikleri, ilgili arastırmacıya pek çok konuda ipuçlarıvererek önemli veriler saglamaktadır. Bu sebeple söz konusu farenin yüzünün deneysırasında gerçek zamanlı takibi hem arastırmacı için hem de yüz üzerinde dogrudançalısan yazılımlar için çıktı saglamaktadır. Laboratuvar farelerinin bedenlerinin yüz-leri ile aynı renk olması ve farenin çok hareketli olması farenin yüzünün takibinioldukça zorlastırmaktadır. Son yıllarda, yapay sinir agları temel alınarak gelistiri-len yöntemler egitilen veri setini soyutlayabilme yetenekleri sayesinde, sınıflandırma,karar verme ve obje tanıma gibi pek çok alanındaki problemlere etkin çözümler sun-dular. Özellikle hayvanların görme korteksinden esinlenilerek olusturulan evrisimselyapay sinir agları görsel uygulamalarda oldukça basarılı sonuçları vermistir.
Bu çalısmada fare yüzünün videolarda gerçek zamanlı takip edilmesi için evrisimselyapay sinir agını kullanan derin ögrenmeye dayalı bir yöntem önerilmistir. Evrisimselsinir agları, egitim veri setinden hiyerarsik özellik çıkartmak konusunda basarılıdırlar.Yüksek seviyeli özellikler anlamsal özellikler içerir ve düsük seviyeli özellikler yük-sek çözünürlüge sahiptir. Dirençli ve kesin bir takipçi elde etmek için, hedef bilgisievrisimsel katman kullanılarak düsük ve yüksek seviyeli özelliklerden çıkarılmıstır.Önerilen yöntem fare yüzünü izleme konusunda uzmanlasmıs olmasına ragmen, egi-tim veri kümesi degistirerek herhangi bir hedefi adapte edilebilir.
vii
Page 8
Anahtar Kelimeler: Evrisimsel Sinir Agları, Makine Ögrenmesi, Obje Takibi
viii
Page 9
To my wife, Akkaya and Öztürk family...
ix
Page 10
ACKNOWLEDGMENTS
I would like to express my sincere gratitude to my supervisor Prof. Dr. Ugur Halıcıfor her supervision, encouragement and guidance. It was a great honor to work withher. I also would like to thank METU Computer Vision and Smart Systems ResearchLaboratory and Hacettepe University Neurological Sciences and Psychiatry Institute,Behavior Experiments Research Laboratory members for creating mice database.
I wish to thank ASELSAN A.S. for giving me the opportunity of continuing my post-graduate education.
I am thankful for the support of TÜBITAK (The Scientific and Technological Re-search Council of Turkey) with BIDEB 2210 graduate student fellowship during myM.Sc. education.
This study is partially supported under TUBITAK project 115E248 - Automatic Eval-uation of Pain Related Facial Expression in Mice (Mice-Mimic) Project.
I am also grateful to my wife Burcu for her support, patience and belief in me.
x
Page 11
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ÖZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
CHAPTERS
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation and Overview . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of the Thesis . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 BACKGROUND INFORMATION ON DEEP LEARNING . . . . . . 7
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Biological Neuron . . . . . . . . . . . . . . . . . . . . . . . 11
xi
Page 12
2.4 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Sigmoid Neurons . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Artificial Neural Network Architectures . . . . . . . . . . . 14
2.7 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . 15
2.8 Back Propagation Algorithm . . . . . . . . . . . . . . . . . 16
2.9 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10 Initialization and Optimizers . . . . . . . . . . . . . . . . . 28
2.11 Convolutional Neural Networks . . . . . . . . . . . . . . . . 35
2.12 Some Popular CNN Architectures . . . . . . . . . . . . . . . 39
3 LITERATURE SURVEY . . . . . . . . . . . . . . . . . . . . . . . . 43
4 PROPOSED METHOD . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . 55
4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 Data Augmentation . . . . . . . . . . . . . . . . . 59
4.3 Off-line Training . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 On-line Tracking . . . . . . . . . . . . . . . . . . . . . . . . 62
5 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . . . . . . . . 65
5.1 Performance Criteria . . . . . . . . . . . . . . . . . . . . . . 65
Center Error . . . . . . . . . . . . . . 65
Region Overlap . . . . . . . . . . . . 66
Tracking Length . . . . . . . . . . . . 66
xii
Page 13
Failure Rate . . . . . . . . . . . . . . 67
5.2 Test networks . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Test Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . 73
5.4.1 Effect of Convolutional Layer . . . . . . . . . . . 73
5.4.2 Effect of Low Level Features and ConvolutionalLayer in Feature Fusion Networks . . . . . . . . . 75
5.4.3 Effect of Depth of Low Level Features . . . . . . . 77
5.4.4 Effect of Depth of High Level Features . . . . . . 78
5.4.5 Overall Comparison . . . . . . . . . . . . . . . . 80
5.5 System Performance . . . . . . . . . . . . . . . . . . . . . . 82
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
APPENDICES
A BLOCK DIAGRAMS OF TEST NETWORKS . . . . . . . . . . . . 93
xiii
Page 14
LIST OF TABLES
TABLES
Table 2.1 Convolutional Layers of LeNet-5 Network . . . . . . . . . . . . . . 39
Table 2.2 Layers in VGG-CNN-F Network . . . . . . . . . . . . . . . . . . . 40
Table 2.3 Differences among VGG-CNN-F, VGG-CNN-M and VGG-CNN-SNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Table 2.4 Differences among ConvNet Networks of VGG . . . . . . . . . . . 41
Table 5.1 Summary of Test Networks . . . . . . . . . . . . . . . . . . . . . . 69
Table 5.2 Tracker speeds of C52,5 − C1 − F 3 Network and Test Networks . . . 83
xiv
Page 15
LIST OF FIGURES
FIGURES
Figure 2.1 Hierarchical Features . . . . . . . . . . . . . . . . . . . . . . . . . 8
Figure 2.2 Dataset Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Figure 2.3 Biological Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Figure 2.4 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Figure 2.5 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Figure 2.6 Recurrent Network . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 2.7 Multi Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 2.8 Network with dropout . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 2.9 LeNet-5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 2.10 Local connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 2.11 VGG-CNN-F Network . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 4.1 Proposed Tracker Network . . . . . . . . . . . . . . . . . . . . . . 57
Figure 4.2 Video Record Setup . . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 4.3 Target Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Figure 4.4 Augmented Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Figure 4.5 Tracker Failure Example . . . . . . . . . . . . . . . . . . . . . . . 63
Figure 4.6 Width Ratio Histogram . . . . . . . . . . . . . . . . . . . . . . . . 64
Figure 5.1 Performance Measures Correlation . . . . . . . . . . . . . . . . . 67
Figure 5.2 True Positive vs Region Overlap Plot for C50,5−C0−F 4 and C5
0,5−C1 − F 3 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xv
Page 16
Figure 5.3 True Positive vs Normalized Center Error Plot for C50,5 −C0 − F 4
and C50,5 − C1 − F 3 Networks . . . . . . . . . . . . . . . . . . . . . . . . 74
Figure 5.4 Failure Rate vs Region Overlap Plot for C50,5−C0−F 4 and C5
0,5−C1 − F 3 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Figure 5.5 True Positive vs Region Overlap Plot for C50,5 − C0 − F 4, C5
0,5 −C1 − F 3, C5
3,5 − C0 − F 4 and C53,5 − C1 − F 3 Networks . . . . . . . . . 76
Figure 5.6 True Positive vs Normalized Center Error Plot for C50,5−C0−F 4,
C50,5 − C1 − F 3, C5
3,5 − C0 − F 4 and C53,5 − C1 − F 3 Networks . . . . . 76
Figure 5.7 Failure Rate vs Region Overlap Plot for C50,5 − C0 − F 4, C5
0,5 −C1 − F 3, C5
3,5 − C0 − F 4 and C53,5 − C1 − F 3 Networks . . . . . . . . . 77
Figure 5.8 True Positive vs Region Overlap Plot for C51,5 − C1 − F 3, C5
2,5 −C1 − F 3, C5
3,5 − C1 − F 3 and C54,5 − C1 − F 3 Networks . . . . . . . . . 78
Figure 5.9 True Positive vs Normalized Center Error Plot for C51,5−C1−F 3,
C52,5 − C1 − F 3, C5
3,5 − C1 − F 3 and C54,5 − C1 − F 3 Networks . . . . . 79
Figure 5.10 Failure Rate vs Region Overlap Plot for C51,5 − C1 − F 3, C5
2,5 −C1 − F 3, C5
3,5 − C1 − F 3 and C54,5 − C1 − F 3 Networks . . . . . . . . . 79
Figure 5.11 True Positive vs Region Overlap Plot for C52,4 − C1 − F 3, C5
2,4 −C2 − F 3 and C5
2,5 − C1 − F 3 Networks . . . . . . . . . . . . . . . . . . 80
Figure 5.12 True Positive vs Normalized Center Error Plot for C52,4−C1−F 3,
C52,4 − C2 − F 3 and C5
2,5 − C1 − F 3 Networks . . . . . . . . . . . . . . . 81
Figure 5.13 Failure Rate vs Region Overlap Plot for C52,4 − C1 − F 3, C5
2,4 −C2 − F 3 and C5
2,5 − C1 − F 3 Networks . . . . . . . . . . . . . . . . . . 81
Figure 5.14 Robustness vs Accuracy Plot of All Trackers . . . . . . . . . . . . 82
Figure A.1 C50,5 − C0 − F 4 Network . . . . . . . . . . . . . . . . . . . . . . 93
Figure A.2 C50,5 − C1 − F 3 Network . . . . . . . . . . . . . . . . . . . . . . 94
Figure A.3 C52,5 − C1 − F 3 Network . . . . . . . . . . . . . . . . . . . . . . 94
Figure A.4 C53,5 − C0 − F 4 Network . . . . . . . . . . . . . . . . . . . . . . 95
Figure A.5 C53,5 − C1 − F 3 Network . . . . . . . . . . . . . . . . . . . . . . 95
Figure A.6 C54,5 − C1 − F 3 Network . . . . . . . . . . . . . . . . . . . . . . 96
Figure A.7 C51,5 − C1 − F 3 Network . . . . . . . . . . . . . . . . . . . . . . 96
xvi
Page 17
Figure A.8 C52,4 − C1 − F 3 Network . . . . . . . . . . . . . . . . . . . . . . 97
Figure A.9 C52,4 − C2 − F 3 Network . . . . . . . . . . . . . . . . . . . . . . 97
xvii
Page 18
LIST OF ABBREVIATIONS
MSE Mean Squared Root
CNN Convolutional Neural Network
FC Fully Connected
UHD Ultra High Definition
FPS Frame Per Second
SGD Stochastic Gradient Descent
NAG Nesterov Accelerated Gradient
RMS Root Mean Square
ADAM Adaptive Moment Estimation
VOT Visual Object Tracking
AUC Area Under Curve
ILSVRC ImageNet Large Scale Visual Recognition Challenge
VGG Visual Geometry Group
xviii
Page 19
CHAPTER 1
INTRODUCTION
1.1 Motivation and Overview
In biomedical studies, laboratory mice are frequently used. For some studies facial
expression of the mice give important clues for researchers. However, detection and
analysis of face of a mouse requires extra labor. Automatizing that process would be
time saving. Therefore, tracking of a face of a mouse is an important application for
biomedical areas. With successful tracker, researches should define only the initial
location of a mouse. The tracker algorithm detects faces in following frames by
tracking it.
Recently, deep learning has become one of the most popular methods in machine
learning field. This artificial intelligence algorithm merges representation learning
with classification or regression methods. There is no need for human intervention in
deep learning algorithms. Since features are learned from example data, algorithms
can easily adapt to training data space. This property makes it very adaptive to differ-
ent kinds of problems. Some application fields of deep learning algorithms are object
detection [1, 2, 3], object recognition [4, 5, 6], pose estimation [7], image segmenta-
tion [8], image stylization [9], image classification[10], age and gender classification
[11], activity recognition [12] and object tracking.
In this study, the main goal is to implement a real time tracking algorithm that tracks
face of a mouse. In object tracking applications, the purpose is to track initially
defined target as long as target is in a video frame. There are some difficulties that
trackers may encounter in tracking applications. Some of them are, fast and abrupt
1
Page 20
motion, variation in pose, cluttered background, occlusion, object deformation and
illumination changes. A good tracker should be able to track target without being
affected by these difficulties. In recent studies, deep learning algorithms are used to
overcome these difficulties in tracker applications.
Different kinds of methodology can be used in order to implement object tracker
algorithm. In recent years, deep learning algorithms have been used in object track-
ing and they have achieved state-of-art performance. The visual object tracking(VOT)
challenge has been held every year since 2013. In VOT challenge, trackers are bench-
marked on test video sequences which have different attributes. Deep learning based
trackers attended VOT challenge in 2015 for the first time. In VOT 2015 challenge,
three visual object trackers, namely MDNet [56], DeepSRDCF [48] and SODLT [58],
were based on convolutional neural networks. Among 62 trackers, MDNet took first
and DeepSRDCF took second place in terms of region overlap ratio performance
criterion. Since single object tracker which are based on deep learning algorithms
performs better, this study is focused on deep learning based trackers. In this thesis,
background information about deep learning algorithm, literature survey on single
object tracker that uses artificial neural networks, proposed method and tests on pro-
posed method are given.
The mouse face tracker, proposed in this study, is implemented by deep learning
algorithm. As deep learning network model, deep bounding box regression with con-
volutional neural network which is a very powerful in vision tasks is used.
1.2 Organization of the Thesis
In chapter 1, introductory information about this thesis is supplied to the reader. The
motivation behind this thesis, organization and contributions are described in that
part.
In chapter 2, background information in order to understand why deep learning be-
comes so powerful and how it works is explained. Although the first studies on artifi-
cial neural networks are made in 1950s, they gain popularity in recent years. In order
to understand what made artificial neural networks so powerful in this days, a brief
2
Page 21
history of neural networks is presented. Since, deep learning algorithm is a mathe-
matical algorithm that inspired by human brain. Pure explanation of mathematical
model is not enough to understand its mechanism. Therefore, in chapter 2 biological
model of neuron and its simple mechanism are also described. After biological neu-
ron model, from a single neuron model to deep neural networks structure, different
kinds of artificial neural networks are explained. The learning mechanism of neural
networks with back-propagation algorithm is also described. In addition to that, what
are the weak points about training and how to overcome these issues are expressed.
In near past, deep learning algorithms are started to be used in tracking algorithms.
Different kinds of methods with different kinds of architecture are proposed. While
early studies used multi layer perceptron, recent studies focused on convolutional
neural networks with transfer learning. Tracker algorithms based on deep neural net-
works are revised in details in chapter 3.
In chapter 4, the proposed tracker in this thesis is presented. Network architecture in
the proposed tracker is composed of four network architecture used together. Firstly,
structure of the networks and the layers that create them are stated. For a fast tracking,
the neural network proposed in this thesis is purely trained off-line and no model
adaptation is performed. Dataset generation and the training of the network are also
explained in this chapter. Finally, how tracking is performed on video sequence is
expressed.
In chapter 5, performance of proposed tracker is evaluated. Firstly, performance cri-
terion for tracking is stated. Performance measures are given. In addition to that, 9
test networks with different network architecture are proposed in order to evaluate
performance of the proposed tracker. Performance outputs of these trackers are given
with graphical visualizations.
1.3 Contributions
There have been some studies on tracking by using deep neural networks architec-
tures. Most of the trackers are trained on-line starting from the first frame. Generally,
they generate patches around the target and label them as a positive or negative ac-
3
Page 22
cording to heuristics they use. Their networks are trained with these patches. In these
methods, large number of forward passes, which is proportional to number of patches,
is made and training is performed in test time. These forward pass and training is very
time consuming, therefore on-line trained trackers are very slow.
The tracker proposed in [13] is trained purely off-line and finds track with single
forward pass. They used a pre-trained network as a feature extractor. Two identi-
cal pre-trained networks are used to extract semantic features from two consecutive
frames. An additional network, which is composed of fully connected layers, is used
to localize target from concatenation of these features.
In this thesis, the tracker proposed in [13] is taken as starting point. However, Held’s
tracker is trained with generic objects. Laboratory mice have characteristic properties
that should be considered. Two of the most important ones are that laboratory mice
are albino and they are very mobile. Since body and face of laboratory mice is the
same color and mice moves fast, tracking of face of a mouse is a challenging task.
In order to overcome these problems, neural network should adapt the mice specific
features. Within this thesis, a mice dataset in order to train neural network is gen-
erated. The target area is chosen to be square since face of a mouse usually fits in
square bounding box. Network architecture is modified so that it tracks target with-
out deforming the square shape.
Although, high level features are useful in order to identify object in given image,
they can’t localize target precisely due to their large receptive field. If only high level
features are used, network can’t regress to bounding box of the target precisely. In
proposed method, low and high level features are used together.
Concatenated features still contain spatial information since there is no fully con-
nected layer, which distorts spatial information, before concatenation. Convolutional
layers are better at exploiting spatial information. In addition to that, depth of input
is also taken into account in convolution operation. It means that features related to
content of the depth are extracted as well. In proposed method, the first layer of the
last network, that is responsible from regressing to target bounding box, is replaced
with convolutional layer in order to keep spatial information and merge all features
4
Page 23
related to depth.
With these contributions, performance of tracker increased significantly.
5
Page 25
CHAPTER 2
BACKGROUND INFORMATION ON DEEP LEARNING
From birth to death, primal skills of human are improved subconsciously. Senses of
people constantly supply data to human brain. This information is processed by it
with supervised feedback mechanism. Epic learning power of human brain is what
makes human is so adaptive and intelligent.
Artificial intelligence algorithms aim to make human like decision making, vision
perception etc. Some training algorithms of artificial intelligence algorithms are in-
spired by biological learning. Researchers try to make computational model of train-
ing mechanism of brain. Deep learning is a method for artificial intelligence inspired
by biological brain. It is a type of machine learning algorithm that learns from exam-
ple data which are called training set. Need for human intervention such as feature
extraction from input is not necessary since algorithms can be trained purely with
training data. Deep learning network architecture is also called artificial neural net-
work due to its resemblance to brain.
2.1 Overview
Although artificial neural networks have a long history, they have been much more
popular recently. They became more powerful with increasing number of available
training data. Over time, both hardware and software environments for neural net-
works are improved. That makes possible training of more complex and bigger net-
works with large number of training data. By means of that neural networks could
solve more complex problems with lower error rate and they gained popularity.
7
Page 26
Figure 2.1: Hierarchical features extracted from deep neural network from Deep
Learning Book [14]
Deep neural networks have a hierarchical structure. Simple neural layers are con-
nected on top of previous one. If all network architecture is examined, it is seen that
there is a deep connection. This is why this approach is called deep learning. Due
to this hierarchical structure, neural network can learn complicated concepts. Every
layer of network can generate feature from previous layer. If depth increases, com-
plicated features of input data can be learned. In Figure 2.1, how neural network
represents input hierarchically is shown.
Performance of machine learning algorithms depends on representation of data. Each
piece of information in this representation of data is called feature. In classical ma-
chine learning algorithms, features of data are given to the artificial intelligence sys-
tem by feature extractors designed by human to make decision, classification etc.
Although many problems can be solved by supplying suitable features to machine
learning algorithms, sometimes it is hard to decide which feature should be extracted
for a given problem. In neural networks, not only outputs but also features are learned
from examples. This property is called representation learning.
By the help of representation learning, Neural networks are able to adapt to new
8
Page 27
Figure 2.2: Dataset size increase over time from Deep Learning Book [14]
tasks without human intervention. One of the best examples of the neural network
representation learning algorithm is auto-encoder. This algorithm composed of two
functions. First one is encoder which extracts feature from original data. Second
one is decoder which generates original data from features that encoder extracted.
When this network is trained with training dataset, it learns to extract features that are
specific to training dataset.
2.2 History
Although first studies on artificial neural networks were made on 1950s, successful
commercial applications appeared on 1990s. Until 90s available datasets were lim-
ited and some skills were necessary to get good results from deep learning algorithms
with limited data. When dataset size increases, the need for expertise is decreased.
As training dataset size increases, deep learning algorithm becomes better at gener-
alizing input data. Therefore, it gave better performance. Dataset sizes are increased
over time. Figure 2.2 shows the size of dataset is increased remarkably over time.
According to [14], if artificial neural network is trained with 5,000 labeled examples
per category, it will achieve acceptable performance for supervised learning. If it is
trained with a dataset size of 10 million labeled examples, it will match or exceed
human performance. Reaching successful results with small dataset and using large
number of unlabeled data is still an important research area.
Another reason of artificial neural networks’ success is that computational resources
9
Page 28
in order to run much larger networks are available today. According to connection-
ism approach, animals become intelligent when their large number of neurons work
together. An individual neuron or small number of them is not capable of building
smart system. It appears that artificial neural networks also work like that.
Until recently, number of neurons in artificial neural networks was very small with
respect to biological neural system of mammalian brain. After introduction of hidden
layer, size of the artificial neural networks is doubled every 2.4 years. If biological
neural networks are examined, it is seen that biological neurons aren’t densely con-
nected with respect to number of neurons in the brain. There are approximately 86
billion neurons in the human brain [15] and they make approximately 10,000 con-
nections per neuron. In recent days, some neural networks make nearly as many
connection per neuron as cats which are around 8,000 connections per neuron [16].
Artificial neural networks are close to human brain in terms of connection number
per neurons. If neuron increase trend continues, by 2050s artificial neural networks
will have the same number of neurons with human brain.
Growth in network size is made possible by improved hardware and software infras-
tructure. As network size increases, required memory space for weight connection
and computational power for training and evaluation increase. As it is told before,
artificial neural networks are composed of layers, each of it connected to top of previ-
ous one. Layer takes output of previous one and operates on it. However, neurons in
each layer works in a parallel manner. With improvement in general purpose GPUs,
distributed computing software on GPUs is become available. Some deep learning
frameworks such as Theano [17], Caffe [18] and Tensorflow [19] are designed to
work on GPUs by using this property. They use parallel processing property of GPUs
that provides faster network connectivity in neural networks. These days GPUs pro-
vide much faster computing than CPUs for neural networks due to large number of
processors on GPUs. By the help of improvement in memory sizes and distributed
computing, artificial neural network size increased significantly.
Early networks are able to recognize a limited number of categories. However modern
networks can recognize more than 1000 different categories. Object recognition con-
test ImageNet Large Scale Visual Recognition Challenge (ILSVRC) held each year.
10
Page 29
Dendrite
Cell body
Node ofRanvier
Axon Terminal
Schwann cell
Myelin sheath
Axon
NucleusFigure 2.3: Biological Neuron from Wikimedia [21]
Algorithm results are evaluated by performance criteria called top-5 error rate. For
that performance criteria algorithm gives most probable 5 classes among 1000. If cor-
rect class is not among these 5 classes, this is called erroneous. In 2012, Krizhevsky
et al. [10] reached state-of-art performance with convolutional neural networks. They
brought top-5 error from %26.1 to %15.3. After then, contest is won by deep convo-
lutional networks in following years. In 2015, ResNet [20] won ILSVRC2015 with
%3.57 top-5 error which is human level grade. Deep Learning is also used in many
other fields such as speech recognition, image segmentation, pedestrian detection and
object tracking.
This shows that with improvements in computational resources and datasets, artificial
neural networks may provide solution to much more sophisticated problems in the
future.
2.3 Biological Neuron
Neurons are specialized cells in human brain. Human cognition system is composed
of large number of neurons. Around 86 billion neurons exist in human brain and in
average they make 10000 connections to each other. In network, each neuron behaves
as an information processing unit. Single neuron is not so intelligent, however brain
which is a connection of large number of neurons constitute human cognition.
A typical neuron is composed of soma, dendrite and axon. Figure 2.3 shows illustra-
tion of neuron cell. Soma is the body of the cell. Dendrites can be thought as inputs to
11
Page 30
neurons and axon is the output. Although working mechanism of a neuron in detail is
very complicated, simplified models can easily be expressed in algebraic form. Basic
mechanism of neuron is as follows.
If impulses that reaches to neuron via dendrite cause soma potential to exceed some
threshold value, neuron fires. That means it sends electrical pulse via axon. Axon of
a neuron is connected to dendrite of another neuron. Therefore, pulse of a one neuron
excites another neuron and it may cause firing. Those consecutive firings constitute
human cognition system.
Firing process of a neuron is a very slow compared to computers. Even the fastest
neurons in the brain can fire at a rate of around 200 Hz [22]. If it is compared to
commercial computers, it may seem that neurons are so much slower than computers.
However, neurons perform simultaneously. Firing of one neuron may trigger more
than one. Harmonic performance of 86 billion neurons in brain makes human so
intelligent.
2.4 Perceptron
The power of brain encouraged researcher to work on brain like system that are in-
spired by biological neuron. One of the earliest type of the artificial neural network
is perceptron. Perceptrons [23] were developed by Frank Rosenblatt in the 1960s.
Perceptron is a simple mathematical model of biological neuron. It takes several
binary inputs and produces one binary output. There is only one output, but any
number of inputs can be defined. Figure 2.4 shows the graphical representation of a
perceptron.
Binary inputs are multiplied with weights which are real numbers. If the sum of
weighted inputs exceed predefined threshold value (also a real number), neuron out-
puts 1, otherwise 0. Algebraic form of perceptron is given in (2.1).
output =
0 if∑
j wjxj + b ≤ threshold
1 if∑
j wjxj + b > threshold(2.1)
12
Page 31
1
x1
x2
wn
∑
b
w n
w0
w1
└┐
Step Function
Figure 2.4: Graphical representation of perceptron
Mainly perceptron is used in decision-making problems. Let inputs of perpectrons
are some conditions and output is whether an action should be performed or not. By
choosing appropriate weights and bias value, decision-making algorithm is obtained.
If more than one layer is used, more complex decision making algorithms can be
designed. In that case, algorithm decides to do something by evaluating decisions
from previous layer.
Another application that perceptron is used is basic logical operations. By using
perceptron NAND gate can be implemented. Since NAND gate is universal (any
logical operation can be implemented by NAND gates), by collection of perceptron
any logical computation can be made. However, without automatic tuning artificial
neural network does not provide any improvement over standard logical operation.
Therefore, need for learning algorithm that adjusts weight and biases using data is
emerged. In learning, main purpose is to get desired output by adjusting weights.
However, small changes in weight may not affect output of perceptron because output
is step function. That make training perceptron is hard. This hardness is solved by
sigmoid neurons.
2.5 Sigmoid Neurons
Algebraic model of sigmoid neuron is almost the same with perceptron. Difference is
that sigmoid neurons takes real inputs and they output sigmoid of the weighted sum of
inputs and bias. Sigmoid function is given in Figure 2.5. Explicit function of sigmoid
13
Page 32
f(x)
x
Figure 2.5: Sigmoid Function
neuron is given in equation (2.2). Any small change in weight directly changes output
of sigmoid neuron. Even though, change in output is small, if corrections in the output
error are repeated iteratively, satisfactory results are obtained. Actually smoothness of
sigmoid function makes training possible. The effect of weight changes to the output
is defined as a partial derivative of output with respect to weights. Change in output
with respect to partial derivatives is shown in equation 2.3 where w represent weights
and b represents bias. If output function isn’t derivable, effect can’t be defined as a
algebraic form. That makes training very hard. In the following sections, how these
properties makes training possible will be explained.
f(x) =1
1 + exp(−∑
j wjxj − b)(2.2)
∆output ≈∑j
∂ output∂wj
∆wj +∂ output∂b
∆b (2.3)
2.6 Artificial Neural Network Architectures
Basically, If large numbers of artificial neurons is connected to each other, it is called
artificial neural network. Artificial neural networks are classified according to their
connection types. There are two types of artificial neural network architecture namely
feed forward and recurrent neural networks.
14
Page 33
Figure 2.6: Recurrent Artificial Neural Network
In network architecture, if there are backward connections, the network is called re-
current neural network. These backward connections cause some loops in network.
Neurons fire for a while until it reaches steady state due to the loops and the neural
network constitute a dynamical system. State of the neurons (firing or not) changes
while the network is not in steady state. Firing of neurons stimulates other neurons.
When network goes to steady state, the states stabilize. It is expected that this steady
state of the network corresponds to desired data. Figure 2.6 shows an example of re-
current neural network. Each line represents connection in the direction of arrow and
each connection has its own connection weight. Since proposed method in this thesis
doesn’t include recurrent neural network, it will not be detailed in following sections.
In feed forward neural network, there are only connections in the forward direction.
The feed forward neural networks will be presented in detail in the following sections.
2.7 Multilayer Perceptron
Feed forward neural networks are composed of three different type of layers. They are
input, output and hidden layers. Neurons in the input layer are called input neurons.
Each of them is responsible from feeding data to the network. Input neurons can be
thought as neuron which has no input and one output that is data itself. Neurons in the
hidden and output layers are regular artificial neurons with multiple inputs and one
output. Neurons in the output layer (output neurons) fire output data of the network.
15
Page 34
Hidden neurons don’t have a special property. They are called hidden because out-
puts of these neurons aren’t observable by user. Although design of neural network
architecture can be tricky, design of an input and output layer is very straightforward.
Number of the neuron in the input layer is equal to the number of data and the number
of the output neuron is equal to the number of outputs necessary. For example, the
number of neurons in output layer is 1000 for classification problem with 1000 dif-
ferent classes. There are different kinds of feed forward neural networks. If network
is consist of one layer in which all input neurons are connected to all output neurons,
it is called single layer perceptron or single layer fully connected network. If network
consists of one or more hidden layer, it is called multi layer perceptron or just fully
connected network. For those networks, it is needed that inputs of hidden neurons
are connected to all neurons before it and output of hidden neurons are connected
to inputs of each neuron in the following layer. Although these networks are called
multi layer perceptron, neurons doesn’t need to be perceptron in general. They can
be sigmoid neurons or neurons with different activation functions.
In fact in recent neural networks, ReLU activation layer is more frequently used.
ReLU (Rectified linear unit) is a one input one output function. If weighted input is
bigger than zero, input identically transferred to output. Otherwise, output is zero.
ReLU increases sparsity and overcomes gradient vanishing problem which is faced
in sigmoid function.
A simple multi layer perceptron architecture with 2 hidden layers, 3 inputs and 1
output is given in Figure 2.7.
2.8 Back Propagation Algorithm
The main goal in training is to get desired output for defined input. To evaluate how
well this goal is achieved, cost function is used. Cost function is a non-negative
function that penalties error between network output and desired output. An example
cost function is given in (2.4).
16
Page 35
Figure 2.7: Multi Layer Perceptron Architecture with 2 hidden layer, 3 input and 1
output neurons
C(w, b) =1
2N
N∑x=1
‖y(x)− a(w, b)‖2. (2.4)
In equation, w corresponds to all weights and b corresponds to all biases in the net-
work, N is the total number of training inputs, y(x) is desired output and a is the
network output for given w and b. For each x, different a is feed forwarded from net-
work. This cost function is called mean squared error (MSE) function. If this function
is examined, it is seen that cost function is non-negative and cost gets close to zero
when networks outputs, which is predicted by the network, are close to desired ones.
On the contrary cost gets bigger in the squared order of prediction error. As stated
above, the main goal of a training algorithm is to reach desired output by minimizing
cost function.
One of the most effective (and the most widely used) algorithm for training feed
forward neural network is the back propagation algorithm. The back propagation
algorithm has gained popularity after publication of the famous paper [24] in 1986.
Briefly, back propagation algorithm computes the gradients of the cost function with
respect to weights and biases. Gradient gives the vectorial direction in the weight
and bias space that cost function increases most. By subtracting the gradients from
weights and biases, algorithm tries to minimize the cost.
Before getting into detail, let’s go over the notation that will be used in this thesis.
17
Page 36
wljk defines a connection weight from kth neuron in (l − 1)th layer to jth neuron
in the lth layer. blj defines bias of jth neuron in the lth layer. alj is used to denote
activation value of the jth neuron in the lth layer. Output value of individual neuron
is called activation. In more algebraic definition, it is the sigmoid of the weighted
sum of inputs and bias for sigmoid neurons. Activation function doesn’t need to be
sigmoid. Activation function is represented with σ. More generic form of activation
of a neuron is given in (2.5).
alj = σ
(∑k
wljkal−1k + blj
)(2.5)
In this thesis vectorized function representation is used. Vectorization means that
function is applied to every element of input that is in vector form. Vectorized repre-
sentation of function (2.5) is shown in (2.6).
al = σ(wlal−1 + bl) (2.6)
In cost function two assumptions should be satisfied in order to apply back propa-
gation. The first one is that cost function should be written in terms of average of
cost functions of individual training samples. This assumption is necessary because
in order to compute partial derivative of a cost function with respect to weights and
biases, firstly partial derivative of cost of a single training sample is calculated. Then
derivative of cost function is calculated by averaging these individual cost function
values. The second assumption is that cost function should be written in terms of
outputs of the networks.
In this section mean squared error function will be used in order to illustrate back
propagation algorithm.
Back propagation algorithm is based on some algebraic operations. One of them is
the Hadamard product. Hadamard product is the element wise production of two
matrices. In this thesis, Hadamard product will be shown as �.
As stated above back propagation algorithm is based on taking partial derivative of
cost function. In order to compute those derivatives, a term that represents an inter-
18
Page 37
mediate error should be defined. It is the error in the jth neuron in the lth layer. This
error is shown as δlj . In order to simplify equation this error will be helpful. Another
useful quantity is the weighted input quantity which is the weighted sum of inputs
and biases. In algebraic form, weighted input is shown in (2.7).
zl = wlal−1 + bl (2.7)
δlj can be shown as in (2.8). If change in weighted input is small, δlj corresponds to
change in cost that can be defined as error.
δlj ≡∂C
∂zlj(2.8)
From starting the general cost definition, δl will be computed and it will be related to
partial derivatives ∂C/∂wljk and ∂C/∂blj .
By the help of four equations, back propagation can be defined.
1. The first equation is error at the output layer. The output error is the error in
cost function caused by weighted input (zL) of it. Uppercase L corresponds to
output layer. Derivation of output error is as follows.
δLj =∂C
∂zLj(2.9)
By applying chain rule to the derivative term above, it can be expressed with
respect to output activation.
δLj =∑k
∂C
∂aLk
∂aLk∂zLj
(2.10)
Since activation of a neuron depends on weighted input of itself. Therefore,
when k isn’t equal to j , ∂aLk /∂zLj vanishes. The equation above can be shown
in a more simple form.
δLj =∂C
∂aLj
∂aLj∂zLj
(2.11)
19
Page 38
Since activation of a neuron is σ(zL), output error can be shown as (2.12).
δLj =∂C
∂aLjσ′(zLj ) (2.12)
In matrix form, it can be represented with hadamard product.
δL = ∇aC � σ′(zL) (2.13)
Output error function is depend on the form of the cost function where ∇aC is
the gradient of C with respect to a. Calculating derivations of a complex cost
function may not be resource friendly. However, if appropriate cost function is
selected, the result will be easily computable. If MSE function (2.4) (which is
the example case of this thesis) is used, derivation of MSE cost function with
respect to activation would be very simple.
∂C/∂aLj = (aj − yj) (2.14)
If the error at the output is shown in matrix form, it will be as in (2.15). This
function can be easily computed.
δL = (aL − y)� σ′(zL) (2.15)
2. The second equation is the error in the hidden or input layer δl, in terms of the
next layer. Derivation of that error is given as follows.
In this step, algebraic expression of δl in terms of δl+1 is needed. This can be
derived by the help of chain rule.
δlj =∂C
∂zlj(2.16)
=∑k
∂C
∂zl+1k
∂zl+1k
∂zlj(2.17)
=∑k
∂zl+1k
∂zljδl+1k (2.18)
In order to get more simplified expression, ∂zl+1k
∂zljwill be derived. More explicit
form of zl+1k is as follows.
20
Page 39
zl+1k =
∑j
wl+1kj a
lj + bl+1
k =∑j
wl+1kj σ(zlj) + bl+1
k (2.19)
If it is differentiated with respect to zlk, a more simple form is obtained.
∂zl+1k
∂zlj= wl+1
kj σ′(zlj) (2.20)
Substituting this equation to error definition, results in backward error propa-
gation:
δlj =∑k
wl+1kj δ
l+1k σ′(zlj) (2.21)
In matrix form;
δl = ((wl+1)T δl+1)� σ′(zl) (2.22)
By using (2.13) and (2.22), errors caused by weights in any layer can be com-
puted. Firstly, output layer error δL is computed. By using output layer error,
error in a layer before output δL−1 can be computed. Iteratively all errors in the
network can be computed by backward error propagation.
3. The third equation is a rate of change in cost caused by biases. This also can be
derived by the help of chain rule.
∂C
∂blj=∑k
∂C
∂alk
∂alk∂blj
(2.23)
As it is told above, partial derivative of activation function with respect to bias
is zero, if k isn’t equal to l. The equation above can be simplified as follows:
∂C
∂blj=∂C
∂alj
∂alj∂blj
(2.24)
Writing partial derivative of activation function with respect to bias in terms
of partial derivative of activation function with respect to weighted sum and
partial derivative of weighted sum with respect to bias by the help of chain rule,
and substituting into equation above, results in:
21
Page 40
∂C
∂blj=∂C
∂alj
∑k
∂alj∂zlk
∂zlk∂blj
(2.25)
Since partial derivative of weighted sum with respect to bias is equal to 1 and
partial derivative of weighted sum with respect to bias is zero when k isn’t equal
to l, partial derivative of cost function with respect to bias turns out to be error
term δlj .
∂C
∂blj=∂C
∂alj
∂alj∂zlj
=∂C
∂zlj= δlj (2.26)
4. The last equation that is needed by back propagation algorithm is the rate of
change of cost function with respect to weights. That equation can be repre-
sented as follows.
∂C
∂wlkj=∂C
∂zlj
∂zlj∂wlkj
(2.27)
Notice that first term is δ and the second term corresponds to al−1k . (Check
equation (2.7)). The simple form of the fourth equation can be written as;
∂C
∂wlkj= al−1k δlj (2.28)
Since four fundamental equations are derived, back propagation algorithm can be
defined in terms of them. As it is told before, the main purpose of back propagation
algorithm is to compute the partial derivative of cost function with respect to weights
and biases. After partial derivatives are calculated, they are multiplied with a constant
called learning rate and subtracted from weights and biases. By this method, cost
function is minimized. Step by step procedure of back propagation algorithm is given
below.
Calculation of gradients:
1. Training data is given to the network. Input corresponds to activation of input
layer a1.
22
Page 41
2. Input is fed forward layer by layer by using activation function. zl = wlal−1 +
bl.
3. After all activations are calculated, output layer included, output layer error is
computed by using the first equation (2.13).
4. The error is back propagated through to input layer by using the second equa-
tion (2.22).
5. The gradients of the cost function is calculated by using the third (2.26) and the
fourth (2.28) equation.
Notice that gradient of one training data is considered above. However, in practice,
training samples are given as batches. The whole back propagation training algorithm
for batch training is as follows:
1. Initialize all weights and biases.
2. For each training sample in the batch, calculate gradients according to the pro-
cedure above.
3. Apply gradient descent on weights and biases by averaging gradients of all sam-
ples in the training batch. α corresponds to the learning rate andN corresponds
to number of samples in a batch.
wlkj → wlkj −α
N
N∑x=1
∂C
∂wlkj(2.29)
blj → blj −α
N
N∑x=1
∂C
∂blj(2.30)
2.9 Regularization
As you can notice, weights and biases are the free parameters of artificial neural net-
works. In modern networks, number of free parameter may be really big. Although,
this is what makes neural networks are so powerful, this also bring some disadvan-
tages. One of the most important one is over fitting.
23
Page 42
Learning algorithms tries to minimize cost function. That means, it minimizes the
error between training set and network output. Since neural network has so many free
parameters, after some point it starts to memorize training set and loses its ability to
generalize the input dataset. In other words, although the value of the cost function of
training dataset decreases, cost value of the function of test dataset starts to increase.
That is called over fitting. In this part, how to avoid over fitting will be explained.
One method to avoid over fitting is to increase size of training dataset. If training
dataset can’t define the problem in satisfactory number of cases, the network can’t
generalize the problem and gives the correct results for cases that are close to ones
in dataset. If number of data is increased in the dataset, network will output more
accurate predictions.
There may be some cases that no more training data can be supplied. In these cases,
artificial data generation method can be used. This is called data augmentation. Let’s
consider the human face detection problem. It is obvious that the training dataset is
contains the human faces. In order to expand training dataset, rotation of training
images by let say up to 10 degree or mirroring the image horizontally may be used.
In both cases modified image contains human faces and can be used as a training
sample.
Another method is to decrease free parameter in the network. Although this may
solve the over fitting problem, decreasing number of free parameter lower the power
of artificial neural network.
If there is a complex problem and no more training data can be generated, these two
methods can’t be applied. Fortunately, there is another method that can be used with
fixed network size and fixed dataset. This is called regularization.
One of the most commonly used regularization technique is L2 regularization, in
other words weight decay. The idea is to add an extra term to cost function that is
called regularization term. Usually, weight decay isn’t applied to bias term. Generic
definition of regularized cost is given in (2.31).
24
Page 43
C = C0 +λ
2n
∑w
w2 (2.31)
Sum of squared of all weights are added to cost function. Regularization term is
scaled by λ2n
where n is the number of weight parameter. λ is called regularization
parameter. As an example, regularized mean squared error function is given in (2.32)
C(w, b) =1
2N
N∑x=1
‖y(x)− a‖2 +λ
2n
∑w
w2 (2.32)
Let’s check how regularization term affects training. In order to understand effect of
regularization weight update equation with regularized cost should be derived. Rate
of change of regularized cost function with respect to weights is given in (2.33).
∂C
∂w=∂C0
∂w+λ
nw (2.33)
Gradient descent weight update rule for regularized means squared error cost function
is given by.
w →(
1− αλ
n
)w − α∂C0
∂w(2.34)
As it can be seen from equation (2.34), regularization term rescales weights by 1− αλn
.
Since λ is positive value, in every iteration weights are forced to close zero.
There is a variant of L2 norm which is called L1 norm. In L1 norm sum of the
absolute values of all weights is added to the cost function instead of squared sum.
General definition of L1 regularized cost function is given in (2.35).
C = C0 +λ
n
∑w
|w| (2.35)
Gradient descent weight update of L1 regularized cost function is as follows. As a
cost function, mean squared error function is used again.
25
Page 44
∂C
∂w=∂C0
∂w+λ
nsgn(w) (2.36)
w → w′ = w − αλ
nsgn(w)− α∂C0
∂w(2.37)
If L1 norm is used constant value which is αλnsgn(w) is subtracted from weights at
each iteration.
Both L1 and L2 try to minimize weights. L2 norm affect weight update in the order
of weight magnitude. L1 norm has constant effect. If weights are small L1 drives
weights to zero more brutal than L2 norm. That may cause oscillation around zero.
However, since effect of L2 norm is proportional to weight value, it doesn’t cause
oscillation and regularization becomes faster than L1 norm for large w values.
If regularized cost function is used, in training, small weights will be preferred. Large
weight is only preferred when C0 is small with that weight. After training, only
weights that decrease the cost function will be large. In other words, features that the
most distinctive for training set will be large. Learning distinctive features improves
generalization.
One another regularization method is called dropout. Dropout is a different method
from L2 and L1 regularization. In dropout, cost function isn’t modified. Network
is modified instead. Normally, during training of neural networks, all neurons par-
ticipate feed-forward and back propagation. However in dropout, randomly selected
predefined percentage (dropout ratio) of neurons is removed from network. Feed for-
ward and back propagation is applied only on remaining neurons. After weight up-
date, removed neurons are restored and new randomly selected neurons are removed.
An example of multi layer perceptron network that dropout is applied on is shown in
Figure 2.8.
Let say dropout ratio is 0.5 which is the most common case. When feed forward is
applied for inference full network is used. That means the number of hidden neurons
in inference will be twice of the ones in training. In order to compensate that, weights
of hidden neurons will be divided to two.
26
Page 45
Figure 2.8: Multi layer perceptron with dropout from [25]
How dropout prevents over fitting and improves performance on test case is not
straight forward. In order to understand that, imagine there are identical two net-
works. They are trained with the same training dataset however in a different sample
order. After training, most probably they will end up with different weight. Also,
they will over fit differently.
Consider these two networks will be used for inference for specific input. They will
give different results. Which network gives the true result is unclear. Some voting
or averaging the outputs of the networks may be a powerful strategy to decide true
output. Therefore, it is thought that averaging of differently over fitted networks
give not over fitted result and improve test performance. Since in different training
iterations different neurons are active, when dropout is used in training, it is like
training different networks. Therefore, the output of a network which is trained with
dropout, is kind of average of outputs of different networks.
27
Page 46
2.10 Initialization and Optimizers
In order to get good performance of the network, initialization and training of network
is very important. Karpathy et al. gives valuable information about these concepts on
lecture notes of convolutional neural networks for visual recognition lecture in Stan-
ford University [26]. Initialization and parameter update strategies will be explained
below.
Back-propagation algorithm tries to minimize cost function with respect to free vari-
ables of the network such as weights and biases. In order to avoid local-minima,
starting state of the free variables are very important. That means good initializer
methodology is crucial in order to obtain good performance from trained network.
The final state of weights is unknown before training. However, empirical results
suggest that it is logical to believe that half of weights are positive and half of them
are negative with proper training data. Since mean of weight is close to zero, initial-
izing all weight with zero may seem a logical operation. However, when all weights
are zero, output of all training sample will be zero. Therefore, back-propagation algo-
rithm will propagate the same error. Because of that, symmetry should be broken in
weight initialization. Although, assigning all weight to zero is a bad idea, assigning
random number close to zero that breaks symmetry is applicable. One of the com-
mon approaches in weigh initialization is assigning random numbers with Gaussian
distribution. Variance of 0.01 is a practical value.
However, it isn’t always true that initializing with small number provide better per-
formance. When weights are small, back-propagation algorithm computes small gra-
dients. This small gradient becomes smaller when it is back-propagated through the
network to input layer. For deep network, gradients will become so small that doesn’t
affect network when weight updates. For these cases, bigger initial values should be
considered.
When network initialized with Gaussian distribution, all neurons are initialized with
the same variance. In that case, variance of output of neurons with large number of in-
puts will become larger. Variance of weighted input is given in equation (2.38), where
sweighted input. It is assumed that average of weights and input is 0. Equations show
28
Page 47
that variance is proportional with number of inputs.
Var(s) = Var(n∑i
wixi)
=n∑i
Var(wixi)
=n∑i
[E(wi)]2Var(xi) + E[(xi)]
2Var(wi) + Var(xi)Var(wi)
=n∑i
Var(xi)Var(wi)
= (nVar(w)) Var(x)
(2.38)
A solution to that is to adjust variance of Gaussian with respect to number of inputs.
Glorot et al. [27] proposed an initializer with that idea. They proposed Gaussian
distribution with variance of two divided by the sum of number of neurons in current
layer and the number of neurons in next layer. Explicit equation is given in (2.39). n
is the number of neurons in lth layer. This initializer is called Xavier initializer.
V ar[wl] =2
nl + nl+1(2.39)
In this thesis Caffe framework [18] is used. In caffe framework, Xavier initializer is
implemented with respect to only number of neurons in the input layer. Variance of
Gaussian distribution in Xavier initializer in Caffe is given in 2.40.
V ar[wl] =2
nl(2.40)
Another important application in training is a parameter update mechanism. As it
discussed in back-propagation section, parameters are updated with gradient descent
algorithms where gradients are calculated by back-propagation algorithm.
There are three types of gradient descent algorithms in terms of number of training
sample that it uses. In stochastic gradient descent (SGD) gradients are calculated and
parameters are updated for every training sample. Batch gradient descent algorithm
29
Page 48
updates parameter by averaging gradients of all training samples in dataset. One pass
of all training sample is called as epoch. In mini-batch gradient descent algorithm,
training samples are divided into a fixed size groups. For every group of sample,
gradients are averaged and parameters are updated. Group size is called batch size.
Mini-batch gradient descent algorithm is the most frequently used one when the num-
ber of samples in the training set is large.
There are also different variants of gradient descent algorithm in terms of update
mechanism. These algorithms try to optimize training speed and performance by
modifying gradient descent algorithm. An informative overview of gradient descent
algorithms is made on [28] by Ruder. These parameter update methodologies will be
explained below based on stochastic gradient descent.
In standard stochastic gradient descent (SGD) algorithm, which also called vanilla
stochastic gradient descent, parameter is updated by subtracting weighted gradient
from itself. Multiplier of gradient is called learning rate which is represented with α.
Vanilla SGD is given in (2.41).
w = w − αdCdw
(2.41)
Gradient of the cost function with respect to parameter represent the steepest increase
in cost. By updating parameter with the negative of the gradient, cost is forced to min-
imize in the steepest decent direction. However, stochastic gradient decent algorithm
has some drawbacks. SGD is very slow near ravines which are common around local
minima. Near ravines, some curves are steeper than direction of minima. Therefore,
cost function starts to oscillate near minima instead of directly going into it.
Momentum method [29] brings solution to that problem. Momentum method is in-
spired from momentum phenomena in physics. An additional velocity term with mo-
mentum is added to gradient descent algorithm. Parameter update is made according
to that velocity. With momentum, cost function changes more smoothly and oscilla-
tions are avoided. In addition to that, local minima can be avoided by the help of the
momentum term.
SGD with momentum parameter update function is given in (2.42). µ corresponds to
30
Page 49
momentum hyper-parameter. A typical value for µ is 0.9.
vt = µ ∗ vt−1 − α ∗dC
dw
w = w + vt
(2.42)
A variation of momentum SGD is called Nesterov accelerated gradient (NAG) [30].
In Nesterov’s momentum method, next probable state of weight is calculated by the
help of the velocity term. Parameter update is made with gradient of cost function
with respect to that estimated weights. This momentum performs better than stan-
dard momentum since effect of momentum on gradient is also considered. Parameter
update function is given in (2.43).
wahead = w + µ ∗ vt−1
vt = µ ∗ vt−1 − α ∗dC
dwahead
w = w + vt
(2.43)
In practice, derived version of Nesterov parameter update formula (2.43) that is more
similar to momentum SGD is preferred. Derived version of NAG is given in (2.44).
vt = µ ∗ vt−1 − α ∗dC
dw
w = −µ ∗ vt−1 + (1 + µ) ∗ vt(2.44)
Another parameter update strategy to increase performance is annealing learning rate
in time. High learning rate causes big step in cost domain. That is good thing at the
beginning of training since it is desired to reach global minimum as quick as possible.
However, minimum point may be missed due to those big steps near minima. It takes
so much time to settle to minima due to those big steps. Decreasing learning rate near
minima decreases step size. Therefore, probability of settling to minima increases.
Annealing learning rate is a tricky application. If learning rate is decreased too ag-
gressively, it will be very slow to reach to minima. If it decreased slowly, training
takes long time due to oscillations around minima.
31
Page 50
There are different implementation strategies in annealing learning rate. The first one
is step decay. Learning rate is reduced with some factor after predefined number of
epoch. Annealing strategy of learning rate may change in different networks. For
those networks ad-hoc parameters may be necessary. Step decay is appropriate for
that kind of problems because parameters are easy to control. The other two strategies
are exponential decay and 1/t decay. In exponential decay, learning rate is decreased
proportional to inverse of exponential function. Learning rate update formula for
exponential decay is shown in (2.45). In 1/t decay, learning rate is proportional to
inverse of time. Update rule for 1/t decay is given in (2.46).
α(t) = α0e−kt (2.45)
α(t) = α0/(1 + kt) (2.46)
The methods that are discussed up to now, manipulate parameters globally. Equal
learning rate and momentum are applied all the parameters. If these hyper-parameters
are updated individually for each parameter of network, much better performance
should be achieved. Two method that uses per-parameter adaptive learning is ex-
plained as follows.
The first adaptive learning method is called as Adagrad. This method is proposed
by Duchi et al. [31]. It basically updates low gradients parameters more and high
gradient parameter less. By the help of that cost can escape from saddle points. In
order to achieve sparse data Adagrad is a good strategy. Parameter update formula
for Adagrad is given in (2.47).
Gt = Gt−1 +
(dC
dw
)2
w = w − α√Gt + ε
∗ dCdw
(2.47)
Gt is a variable that holds sum of squared of gradient. Its size is equal to parameter
size. ε is smoothing constant. Typical value for smoothing constant is 1e-8. Note
32
Page 51
that effective learning rate is reduced for high gradient variables and increased for
low gradients. Empirical results shows that ε is very important parameter. Without
it AdaGrad optimizer performs much worse. The beauty about AdaGrad is that there
is no need to tune learning rate manually. AdaGrad adapts to it with respect to cu-
mulative gradients per parameter. However there is a drawback of this adaptation.
Gt increases in time due to squared cumulation. This makes effective learning rate
so small that gradients doesn’t contribute to training in time. In order to solve that
problem following two algorithms are proposed.
Remember that sum of squared of gradients causes learning rate decay in AdaGrad. In
AdaDelta [32], this term is replaced by decaying average of square of gradients. Av-
eraging and decaying term protect adaptation term from getting so large. By the help
of this, effective learning rate doesn’t vanish. γ represents decaying hyper-parameter.
After replacing squared sum of gradients, formula in (2.48) is obtained.
E[
(dC
dw
)2
]t = γE[
(dC
dw
)2
]t−1 + (1− γ)
(dC
dw
)2
t
w = w − α√E[(dCdw
)2]t + ε
∗ dCdw
(2.48)
Authors of AdaDelta states that update term should have the same unit with parameter
that is to be updated. In order to achieve this they defined another variable that is the
decaying average of squared of parameter updates. Parameter update is shown as
∆w. The equation of the decaying average of squared parameter updates is given in
equation (2.49).
E[(∆w)2]t = γE[(∆w)2]t−1 + (1− γ)(∆w)2t (2.49)
Note that denominator of the effective learning rate is equal to root mean square
(RMS) of gradient. Square-root of equation (2.49) can also be defined as RMS of
parameter update term. In order to satisfy unit balance, authors used RMS of param-
eter update term as learning rate. Previous value of parameter update term is used
33
Page 52
instead of current one since current values aren’t known. When all pieces are merged,
equation in (2.50) is obtained.
∆w =RMS(∆w)t−1
RMS(dCdw
)t
∗ dCdw
w = w − RMS(∆w)t−1
RMS(dCdw
)t
∗ dCdw
(2.50)
RMSProp [33] is another method that tries to solve effective learning rate decaying
problem of AdaGrad optimizer. RMSProp and AdaGrad optimizers developed at the
same time independently. In RMSProp, problem solved in a simpler way compared
to AdaGrad. Learning rate isn’t discarded. Only squared sum of gradients is replaced
by decaying average of square of gradients. The equation which is given in AdaGrad
(2.48) is actually parameter update rule for RMSProp. Hinton suggest 0.9 for γ and
0.001 for α for good default value.
Another optimizer that is frequently used is Adaptive Moment Estimation (Adam)
optimizer [34]. Adam optimizer can be thought as RMSProb with momentum term.
Adam optimizer keeps decaying average of square of gradients term. In addition
to that decaying average of past gradients term is added. That term is similar to
momentum. Decaying average terms are given in equation (2.51). The first equation
is decaying average of past gradients and the second equation is decaying average of
square of gradients. Note that m is estimate of the first equation and v is estimate of
the second momentum.
mt = β1mt−1 + (1− β1)dC
dw
vt = β2vt−1 + (1− β2)(dC
dw
)2 (2.51)
These two momentums are initialized with zeros. Kingma et al. observed that mo-
ments goes to zero when β1 and β2 are close to one. In order to overcome that, they
proposed bias-corrected term which are shown in (2.52)
34
Page 53
mt =mt
1− β1vt =
vt1− β2
(2.52)
Parameter updates are made with respect to bias corrected moments. Parameter up-
date rule for Adam optimizer is given in equation (2.53). Suggested hyper-parameter
values in paper are β1 = 0.9, β2 = 0.999 and ε = 1e− 8.
wt = wt−1 −α√vt + ε
mt (2.53)
2.11 Convolutional Neural Networks
Convolutional neural network is a variation of multi layer perceptron. It is inspired by
visual cortex of animals. In 1968, Hubel and Wiesel [35] made a research on mam-
mal’s visual cortex and they realized that some neurons in visual cortex are responds
to local regions of visual field. They called those regions receptive fields. By the help
of that idea convolutional neural networks are designed.
Convolutional neural networks are composed of sequence of layers. Each layer trans-
forms given input array, which is usually 3 dimensional, to another array. Most
frequently used layers in convolutional networks are convolutional, ReLU, pooling,
normalization and fully connected layers. A convolutional network, namely LeNet-
5, proposed by LeCun et al. [36], is given in 2.9 as an example. This network is
composed of 2 convolutional, 2 pooling and 3 fully connected layers. Each layer is
explained as follows.
In convolutional layer, neurons make spatially contiguous local connections that form
receptive fields. Convolutional neural networks are mostly used in vision applications
where inputs are 2D for gray-scale images and 3D for colored images. In Figure 2.10,
connection graph of an example network is shown. For simplicity, input is represented
as 1D. Each neuron in network makes connection to 3 neurons in the previous layer.
Neurons in the layer m are affected from the 3 neuron in layer m-1, neurons in layer
m+1 is affected from the 5 neurons in layer m-1. That means effective fields of
35
Page 54
Figure 2.9: LeNet-5 Architecture taken from [37]
w0
w1
w2
Layer m-1
Layer m
Layer m+1 w0
w1
w2
w0
w1
w2
w5
w4
w3
Figure 2.10: Local connectivity in multi layer perceptron
neurons at layer m is 3, effective fields of neuron at layer m-1 is 5. If connection
graph is examined, it is seen that all input field is covered as in visual cortex.
One of the most important property of the convolutional layer is weight sharing. In
the same layer, connections are constrained to share the same weight. For example in
Figure 2.10, weights of 3 connections from layer m to layer m-1 are the same for all
neurons in layer m. The shared weights are presented in the same color. By the help
of weight sharing, the number of free parameter is greatly reduced.
Convolutional layers have 4 hyper-parameters that control how it operates on input
array. These are kernel size, kernel number, stride and zero-padding.
36
Page 55
1. Kernel size determines the receptive field of the each neuron at the output of the
convolutional layer. Kernels contain connection weights of the layer. Assume
there is a three dimensional input image and kernel size is selected as a 5 x 5.
That means one neuron at the output of the convolutional layer has a connection
to the 5x5 area in the input. Kernel has the same depth with input in order to
fully cover it. Therefore, one neuron at the output makes connections to 75
neurons (5 x 5 x 3). In order to determine the value of the neuron at the output,
neurons values that are in the receptive field is multiplied with kernel values.
Summation of these multiplications corresponds to neuron value. Output array
is generated by sliding kernel over the input image. The output is called feature
map. This operation is known as convolution.
2. Kernel number shows how many kernels there are in the convolutional layer.
Each kernel is slided over input image and generates feature map. Number
of feature maps is determined by number of kernels. These feature maps are
concatenated in depth to form three dimensional outputs.
3. Stride is the step size of kernel slide in convolution operation. For image size of
25 x 25 and kernel size of 3 x 3, there is 23 kernel multiplication in horizontal
direction if stride is 1. If step size is 2, there are 12 kernel multiplications since
kernel is slid by 2.
4. Convolution operation decreases output size with respect to input size. For
image size of 25 x 25, kernel size of 3 x 3 and stride 1, output size will be 23 x
23 x numbers of kernels. It is practical to add zeros around input image in order
to increase output size. Zero-padding is the thickness of zeros added around the
input. If zero-padding is selected as 1, input size will be 27 x 27 and output size
will be 25 x 25 that is equal to original input size. The most common usage of
zero-padding is to equate input and output sizes.
Output size of convolutional layer can be calculated by following formula where O is
width of the output W is width of the input, F is kernel size, P is zero padding and
S is stride.
37
Page 56
O =W − F + 2P
S+ 1 (2.54)
Convolutional layer can still be trained with back propagation algorithm. The gradient
of the shared weights is calculated by simply averaging the gradients of the shared
weights.
ReLU layer is responsible from adding non-linearity to the input of layer. ReLU
(Rectified linear unit) is a one input one output function. If weighted input is bigger
than zero, input is identically transferred to output. Otherwise output is zero. ReLU
layer doesn’t change input size.
Pooling layer takes a rectangular block from convolution output and subsamples it.
There are several methods used in pooling such as averaging and maximizing. Max-
pooling is most frequently used pooling method. In max-pooling, maximum value
in the rectangular block is selected. By sliding that rectangular block over the con-
volutional output, filter response is sub-sampled. This layer can be considered as a
regularization layer.
Normalization layer is used to normalize given input. However, in recent networks
normalization started to decrease its popularity. Local response normalization is a
type of normalization layer. According to the [10], this normalization helps general-
ization. For classification problem in the paper, test error rate is decreased from 13%
to 11% with normalization. The normalization formula is given in (2.55) where a is
input, b is output, N is depth of input, x, y and i is the neuron in the position of first,
second and third dimension, respectively, k,n, α and β are hyper parameters that are
used to tune normalization.
bixiy = aix,y/
k + α
min (N−1,i+n/2)∑j=max (0,i−n/2)
(ajx,y)2
β
(2.55)
ReLU, pooling and normalization layer don’t contain free parameters to train.
38
Page 57
Table 2.1: Convolutional Layers of LeNet-5 Network
Layer NumberConvolution Max-Pooling
Number of kernel Kernel size Stride Padding Kernel Size Stride1 20 5x5 1 0 2x2 22 50 5x5 1 0 2x2 2
2.12 Some Popular CNN Architectures
Convolutional neural networks are very popular in image processing area. They are
capable of extracting features that are discriminative for training dataset. However,
in order to train convolutional neural network large number of training sample is
necessary. In recent day, publicly available dataset sizes increased significantly. Con-
volutional neural networks that are trained with rich dataset, achieve state-of-art per-
formance especially in image recognition area.
It is seen that features that are extracted from that networks are also very discrimina-
tive for objects that aren’t in the training dataset. Since they give good performance
for different kinds of application, It is common that pre-trained networks are directly
used in other networks as a feature extractor or their weights are transferred and fine-
tuned with target dataset.
Some popular convolutional neural networks are given as follows.
LeNet-5 [37] is convolutional neural network designed for handwritten recognition.
It is composed of two convolutional layer followed by pooling layer and three fully
connected layer. Network architecture is given in 2.9. Fully connected layers have
500 and 10 neurons in order. Details of convolutional layers are given in 2.1.
Visual Geometry Group of University of Oxford proposed some architectures in their
two papers [38] and [39]. In ImageNet Large Scale Visual Recognition Challenge
(ILSVRC), image recognition algorithms are evaluated. Algorithm makes top-5 guess
form 1000 class on test dataset. Their performance is measured according to their
top-5 error.
On ILSVRC-2012 VGG proposed three different CNN architectures. They are called
VGG-CNN-F, VGG-CNN-M and VGG-CNN-S where F is abbreviation of fast, M is
39
Page 58
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5 FC6FC6 FC7FC7 FC8FC8
ReLULRNPool
ReLULRNPool ReLU ReLU Dropout
ReLUPool Dropout
Figure 2.11: VGG-CNN-F Network Architecture
Table 2.2: Layers in VGG-CNN-F Network
Layer NumberConvolution Max-Pooling
Number of kernel Kernel size Stride Padding Kernel Size Stride1 64 11x11 4 0 3x3 22 256 5x5 1 2 3x3 23 256 3x3 1 1 - -4 256 3x3 1 1 - -5 256 3x3 1 1 3x3 2
abbreviation of medium and S is abbreviation of slow. They are trained with roughly
1.2 million image taken from imagenet [40, 41] dataset. They reach 16.7%, 13.7%
and 13.1% top-5 error rate on ILSVRC-2012 dataset, respectively.
Generic network architecture of VGG networks is given in Figure 2.11 and layer
detials of VGG-CNN-F is given in table 2.2. LRN corresponds to local response
normalization.
All three network shares common generic architecture differences among VGG-CNN-
F, VGG-CNN-M and VGG-CNN-S are given in table 2.3.
Table 2.3: Differences among VGG-CNN-F, VGG-CNN-M and VGG-CNN-S Net-works
40
Page 59
Table 2.4: Differences among ConvNet Networks of VGG
In ILSVRC-2014, Visual Geometry Group proposed 6 different convolutional neural
networks with deeper architectures. These networks are called ConvNets. Network
depth of ConvNets changes from 11 to 16. Summary of ConvNets are given in 2.4.
convA-B corresponds to convolutional layer with B number of kernels with kernel
size A x A and FC-C corresponds to fully connected layer with C number of neurons.
ConvNet A, A-LRN, B, C, D and E achieves 10.4%, 10.5%, 9.9%, 8.8%, 8.1%
and 8.0% top-5 error rate on ILSVRC-2012 dataset, respectively. In ILSVRC-2014,
ILSVRC-2012 dataset is used.
41
Page 61
CHAPTER 3
LITERATURE SURVEY
In this chapter, related work in the field of object tracking that uses neural network is
presented.
Tracker algorithms can be divided into two classes. They use generative or discrimi-
native approach. In generative approach, large number of target candidate is cropped
around the target. Tracker algorithm finds the most probable patch that can be a tar-
get. In these methods, patches are transformed into another space by using feature
extractor algorithms. After that, these features are graded by some classifier algo-
rithm. In discriminative approach, target position is estimated by directly segmenting
input image as target and background.
When methods that use deep learning algorithm in tracking are examined, it is ob-
served that mostly generative approach is adopted. Since deep learning algorithms
can learn representation of data from examples, they are very powerful in applica-
tions that requires feature extraction.
In following two papers, [42] and [43], particle filter approach with deep learning
feature extractors is used. In deep learning tracker (DLT) method, Wang et al. [42]
designed an object tracker that uses natural features. Target is tracked with particle
filter principle by the help of these natural features. Feature extractor is updated
on-line manner. Therefore, tracker can adapt to difficulties like illumination change,
occlusion etc. Authors are used de-noising auto-encoder which is a special type of
neural network in order to be used as a feature extractor. De-noising auto encoder is
a network that is trained in unsupervised way. The network is trained with randomly
43
Page 62
selected 1 million images from 80 million 32 x 32 natural images. It is stated that after
the network is trained, it is able to extract features that are common to natural images.
Therefore, it is thought that for common targets, these features will be distinctive.
One more layer is added on top of de-noising auto encoder for tracking. This is
multiple input one output fully connected layer. This layer is used to determine the
confidence of the target. For on-line tracking, 1000 patches are drawn according
to particle filter approach. These patches are fed to the network and confidences are
calculated. Target is determined according to particle filter approach. If the maximum
confidence is below some threshold value, that means target appearance is changed
significantly. In that case, the whole network is fine tuned.
In CNNTracker [43], algorithm, a generic object tracker which uses convolutional
neural network based model with particle filter methodology is used. The network
used in CNNTracker is composed of two convolutional layers with ReLU and max-
pooling and one fully connected layer. The first convolutional layer contains 6 filters
of kernel size 5 x 5. This is followed by ReLU layer and 2 x 2 max-pooling lay-
ers. The second convolutional layer contains 12 filters with 5 x 5 kernel size. This
layer is also followed by ReLU and 2 x 2 max-pooling layers. Fully connected layer
is connected to the flattened output of second max-pooling layer. The network is
pre-trained off-line and trained layers are transferred to the network which is used
in on-line tracking. The main difference between off-line trained network and on-
line tracking network is the output stage. Since the network in pre-training phase is
trained with CIFAR-10 [44] dataset which has 10 class, fully connected layer has ten
neurons at output stage. In on-line tracking network 10 output fully connected layer
is replaced with 1 output fully connected layer. This network is fine-tuned during
tracking. Output of the network corresponds to likelihood of the input to be target.
This probability is used in particle filter. In particle filter, posterior probabilities of
the target patches are calculated by Monte Carlo sampling. The patch with maximum
probability is labeled as the target. Model update decision is made according to that
posterior priority. If maximum posterior probability of patches, which is labeled as
target, is smaller that T1 and larger than T2, where T2 < T1, then the model is up-
dated. The logic behind the idea is as follows. If posterior probability is larger than
T1, result is reliable. There is no need for training. If posterior probability is smaller
44
Page 63
than T2, result is not reliable. Training with that data would distort model, and in this
case tracker performance will decrease due to weak target model.
One of the commonly used methods in object tracking is correlation filters. Correla-
tion filter is trained with some images that represent input. Due to FFT based training
algorithm correlation filters can be trained very fast. However in order to achieve
good performance with correlation filters, an input that successfully represents target
should be supplied to correlation filter algorithm.
In following two papers, [45] and [46], representation learning property is used in
correlation filter based trackers.
In [45], a tracker algorithm is implemented by using specific properties of each con-
volutional layer. In this paper, one of the famous neural networks namely VGG-Net
[39] is adopted. VGG-net has five convolutional layers. In this method three of them
which are third, fourth and fifth layer are used. Abstraction capability of deeper lay-
ers is high with respect to lower layers. Therefore, output of fifth convolutional layer
represents more semantic features. However due to the pooling layers in VGG-Net
spatial details are lost in deeper layers. From fifth to third layer abstraction capabil-
ity decreases but spatial detail increases. Therefore by combining properties of these
three layers both accurate and robust tracker could be obtained.
In order to exploit convolutional layer properties, adaptive linear correlation filter is
trained over output of each convolutional layer. Location of target is inferred from the
outputs of the correlation filter. In order to find target position, coarse to fine approach
is used.
In [46], authors proposed a tracker method based on convolutional neural network.
They used activation values of convolutional layers in discriminative correlation filter
based tracker. Authors claim that no fine tuning for specific target is necessary, since
convolutional layer is capable of generating generic features. In addition to that, these
features contain semantic information about object which is very important for target
tracking. According to the authors, activations of the first layer give better tracker
performance, although deeper network contains more complex feature.
The proposed method in [46], data representation stage of two different correlation
45
Page 64
filter based tracker namely DCF [47] and SRDCF [48] is replaced by convolutional
features. In DCF method, discriminate correlation filter is learned from examples.
Examples are activations of convolutional layer outputs for this case. On-line up-
date rule in [49] is used to approximate solution efficiently by using DFT. When new
frame comes, convolutional features of the target is circular correlated with correla-
tion filter and target location is labeled according to the maximum correlation score.
Periodic assumption in DCF causes periodic boundary effect. This effect limits per-
formance of the tracker. SRDCF method eliminates the periodic boundary effect. In
SRDCF spatial regularization term is added to the cost function of filter. The added
term is a multiplier to regularization term that increases proportional to distance from
target. That new regularization causes a significant performance boost. For target
features, imagenet-vgg-2048 network [38] is used. This network is trained with Ima-
geNet dataset [40, 41]. Imagenet-vgg-2048 has five convolutional layer. In proposed
method, ReLU output of convolutional layer is used as features. Images are fed to the
network after resizing to 224 x 224 and subtracting mean. For gray-scale images the
same image is fed to R,G and B channels. Extracted features are filtered with Hann
window.
Support vector machines are powerful algorithms in data classification. In generative
approach support vector machines can be used as target classifier. In following paper,
tracker that uses both deep learning and support vector machine is expressed.
In [50], Hong et al. proposed a tracker method that uses convolutional neural net-
works and support vector machine. Algorithm tracks target object by the help of
learned discriminative saliency map. Discriminative saliency map is generated by
back-propagating positive samples through the network. In order to track target se-
quential Bayesian filtering method is applied on saliency maps. As a convolutional
neural network, network in [1] is adopted. This network is trained with rich image
dataset off-line and it isn’t fine-tuned during tracking.
In algorithm, outputs of hidden layer are used as features. Target candidates are
cropped around the target. Features of these target candidates are classified with
support vector machine as foreground or background. Foreground images are used to
generate saliency map by back-propagating the same convolutional neural network.
46
Page 65
Saliency method is firstly proposed in [51]. Features in hidden layer capture semantic
information of target successfully. However due to pooling layers, spatial information
is lost. That problem is solved with target generative appearance model. Generative
appearance model is generated by saliency maps of foreground patches.
In this method, representation learning property of convolutional neural network is
used. A part of a pre-trained network that is trained with large dataset is used as
feature extractor. It is shown that outputs of hidden layer are able to capture semantic
information successfully. In addition, saliency map is generated by back-propagating
positive images that help to localize target precisely.
In object tracking studies, there are some solutions that uses purely neural network
to localize target. The following study uses convolutional neural network in order to
evaluate convolutional features.
In FCNT[52], the same property of convolutional neural network with in [45] is used.
Top layer features of convolutional neural network works like a class detector. On
the other hand lower layers carries target specific properties which can be used to
separate target from surroundings. Authors merged the power of both layer with a
switch mechanism. It is shown that not all feature maps of convolutional layer are
necessary in order to track specific target. Author proposed feature map selection
method in order to eliminate irrelevant and noisy features. VGG-Net [39] is used as a
feature extractor. Proposed algorithm is as follows. On conv4-3 and conv5-3 layers,
feature map selection is applied. Selected feature maps of conv5-3 are called a general
network (GNet) and selected features of conv4-3 is called a specific network (SNet).
In the first frame both network is initialized with foreground heat map regression
method. When new frame comes, it is feed forward to the network and heat maps are
generated by GNet and SNet. Finally distractor detector mechanism decides which
heat map defines target. Generally GNet detects the target. However, although GNet
contains class specific information, it may be distracted by the object with the same
class objects. Let say target is human and there are two people in frame. In that
case GNet outputs a heat map that shows two possible target positions. Besides, SNet
contains target specific property. Therefore SNet is more successful at discriminating
two objects with the same class. However GNet is more robust than SNet. This
47
Page 66
trade-off is managed by distracter detection mechanism.
In most of the studies transfer learning method is used by directly transferring con-
nection weights. In the following study by Wang et al. [53], another approach is
expressed.
Wang et al. proposed convolutional neural network based object tracker by using
learning hierarchical features. They used ASLA [54] tracker in order to evaluate per-
formance of their learned hierarchical features. However, proposed feature learning
method can be used with different trackers by replacing their feature representations.
In the method proposed in [53], two layer convolutional neural network is used in
order to learn hierarchical features. This network is trained with auxiliary video se-
quences taken from Hans van Hateren natural scene videos [55]. It is claimed that
network is able to learn features that are robust to different kinds of motion patterns
when trained with rich dataset. In order to increase robustness of the learned features,
temporal slowness constrain is applied. An extra term is added to loss function which
penalizes the difference between outputs of two consecutive frames. Due to this term,
network is forced to learn the motion with small temporal change which is the most
common state in videos.
Authors propose a domain adaptation module to adapt learned features to new videos.
By the help of this mechanism, learned features are merged with target specific ones.
That makes tracker robust to both complicated motions and target specific appearance
changes. In many domain adaptation mechanisms, connection weight of network is
directly transferred to test network. However, Wang et al. proposed a different ap-
proach. They added an additional term to loss function of the test network. This
term penalize the difference between weights of pre-trained network and test net-
work. Therefore in the test network the advantages of both generic features and target
specific features are merged.
As it is said before transfer learning is mostly used in object tracking. The networks
whose weights are transferred are mostly used for classification application. There-
fore, the features extracted from that networks are related to notion of object not
motion of object. In following study, network is fine-tuned to learn motion specific
48
Page 67
features.
A convolutional neural network based tracker with a novel visual tracking algorithm
is proposed in [56]. Convolutional neural network in the algorithm is a combination of
two groups of networks. The first group is named as shared layers and the second part
is domain-specific layers which are connected to shared layers in training phase. The
shared layer is composed of three convolutional layer and two fully connected layer.
Convolutional layers used in shared layers are identical to conv1-3 and two fully con-
nected layer is identical to fc4-5 of VGG-M network [38]. Fully connected layers
have 512 output units with dropout and ReLU layers. In domain-specific layers, there
are multiple branches each of them is trained with individual training domains. There
are K number of fully connected layer which is connected to last fully connected
layer of shared layers for K training sequences. Each layer is binary classification
layer with soft-max cross-entropy. The function of that layers is to distinguish target
and background. As stated above, shared layers are common to all training sequence.
The main purpose in that network architecture is to obtain generic target representa-
tion in shared layers. After training each domain in iterative way, it is expected that
common features of all training sequences are learned by shared layers. In test case,
Shared layers and domain-specific layers are separated. Shared layer in pre-trained
convolutional neural network is preserved and new domain-specific layer which is
untrained is added on top of it. This new domain-specific layer is updated on-line.
When new frame comes, candidate target patches are sampled around the target ran-
domly. Each patch is fed to the network after resizing to input layer size. Output of
the domain-specific network shows how likely that patch corresponds to target. The
network size is small when compared to AlexNet [10] or VGG-Nets [39, 38]. There
are three reasons for authors to pick such small network. The first reason is that net-
work classifies only two classes and there is no need for complex model according.
The second reason is that as network goes deeper, it loses its spatial sensitivity. The
last reason is that since target size is small in general, input layer is chosen small.
Small input naturally decreases the depth of network. Also it is shown that bigger
network doesn’t improve performance.
Training convolutional neural networks requires so many training examples and train-
ing takes long times. In order to overcome these issues, in [57] tracker that uses
49
Page 68
purely on-line trained convolutional neural network is proposed. In tracking appli-
cations there is only one labeled data which is initialization frame. In order to train
convolutional neural network, effective training strategy is required. Authors made
three contributions to train network in purely on-line manner effectively.
The first one is truncated structural loss function. This loss function overcomes track-
ing error loss accumulation. One of the frequently used cost function is mean square
error lost function. In usual training scheme, vanilla mean square error loss function
is used. In [57] another term is added to loss function as a multiplier. This additional
term is shown in equation (3.1) where Θ(yn, y∗) is the function showing overlap ratio
of target y∗ and patch yn.
∆(yn, y∗) =
∣∣∣∣ 2
1 + exp(−(Θ(yn, y∗)− 0.5))− 1
∣∣∣∣ ∈ [0, 0.245] (3.1)
This term can be thought as a importance indicator. Note that positive samples that
are close to the target and negative samples that are far from the target has higher
importance. In addition to that, it is thought that patches with very small error don’t
have a significant effect on network. These patches are also discarded in training.
That is equal truncating the loss function around zero. By the help of truncated struc-
tural loss function, only samples that are discriminative for target are contributed to
the training. Effect of noisy patches is discarded.
The second contribution is robust sample selection mechanism. Positive and negative
samples are selected according to temporal relation and label noise. There is a positive
and negative pool in order to hold training samples. For each new frame, a predefined
number of patches are cropped around the target randomly. The patches, whose over-
lapping ratio with the target is bigger than 0.5, are labeled as positive patches. Other
patches are labeled as negative. All these patches are fed to the network. Network
output gives the probability of the patches to be target. Predefined number of positive
and negative patches is saved to the positive and negative pools with an additional
quality term. Quality term calculation by the help of a label noise concept is as fol-
lows. Patches are sorted from high to low with respect to target probability which
is network output. Predefined number of patches with high probability is selected.
Average truncated loss of these patches are calculated and subtracted from 1. Note
50
Page 69
that if there is a negative patch with high target probability, it will decrease target
quality. Patches with noisy label will have low quality and they won’t affect network
much. Quality function is given in (3.2) where the set P contains the samples with
high probability and Ln corresponds to loss of individual patch.
Q = 1− 1
|P |
N∑n∈P
Ln (3.2)
In training, positive samples are drawn with uniform probability and negative samples
are drawn with exponential probability with respect to time. By doing that, short term
negative, long term positive memory restriction is satisfied. Quality term is used in
loss function as a multiplier. Therefore, correct labeled patches makes strong effect
in training. On the other hand, noisy labeled patches make almost effect because their
quality is low.
The third one is a lazy updating scheme. The straight forward approach for tracking
would be training network in every frame. However, this would be computationally
expensive. Authors proposed a method that decrease training frequency and naturally
increase speed of the tracker. In training phase, network is trained until error reaches
to predefined error value ε. The network won’t be trained until loss exceeds 2ε.
Since object appearance doesn’t change frequently, a successful model will give small
loss values for a long time. Therefore lazy updating scheme increases tracker speed
without affecting tracker performance significantly.
Neural network architecture in [57] is composition of four neural networks. There
are three identical neural networks which is called single cue CNN and one fully
connected network called fusion network. Single cue CNN takes 32 x 32 image patch
and outputs one probability term. At the last layer of single cue CNN there is 8
inputs 1 output fully connected layer. In order to merge features of three single cue
CNNs, last fully connected layer is discarded and concatenated. That concatenation
is fed to the Fusion network. Fusion network is composed of one 24 inputs 1 output,
which correspond probability term, fully connected network. Red, green and blue
components are given to the single-cue networks respectively. For gray level image
these components would be two local normalized images and one gradient image. In
51
Page 70
each iteration, different single cues are trained with its last fully connected layer, in
order. After three single cue CNNs are trained, fusion network is trained with its 8
neuron outputs.
Not all trackers are using generative approach in deep learning. The following study
uses discriminative approach to find target location.
Convolutional neural network is used for target tracking purpose in [58]. In the pro-
posed method, network trained both off-line and on-line. The purpose of off-line
tracking is to teach a network what is object. In on-line tracking, the network is fine-
tuned in order to adapt to tracked target. During on-line training some mistakes may
happen, which causes model deformation. In order to avoid this, two neural networks
is used concurrently in test time. The results of these two networks are used in col-
laboration to determine target location. Convolutional neural network is composed
of seven convolutional layer and three fully connected layer. Between convolutional
layers and fully connected layers multi scale pooling scheme [59] is applied. In most
tracker which uses convolutional neural network, one output neuron which shows the
probability of being target is used. The main difference of the network proposed in
[58] is that it has a 50 x 50 probability map instead of single output. Since input with
size of 100 x 100 is used each output neuron corresponds to 2 x 2 area. The value of a
neuron represents how likely this 2 x 2 area belongs to target. The purpose of the pre-
training is to teach network, object level features. For that purpose, ImageNet dataset
[40, 41] is used in pre-training. Since a deep convolutional neural network is used,
availability of large number of images in dataset is very important. In ImageNet there
are approximately 500k images with labeled bounding box. For training, if pixel is
in bounding box, it is labeled as 1, otherwise 0. Some negative images are also in-
troduced to the network. For negative images all pixels are labeled as 0. When new
frame comes, predefined number of patches with different sizes centered at target at
previous frame are cropped and feed forward through the network. Target searching
is started with the smallest patch. If sum of the network output is below some thresh-
old this patch is skipped. This searching continues from smallest to largest until sum
exceeds a threshold. The patch that exceeds threshold value is defined as target.
Model update frequency is important in tracker performance. If network is updated
52
Page 71
too frequently, model may be distorted due to inaccurate results. If update frequency
is not high enough, model may not adapt appearance changes. In order to solve
that problem, author proposed two convolutional network structures. One network
is used for adapting short-term appearance changes. The other one is used for long-
term appearance. Both networks is fine-tuned with initial frame. While, long-term
network is updated conservatively, short-term network is updated more aggressively.
Cropped patch is fed to both networks. The target location is determined from the
most confident output.
In generative model, target candidates are generated and the most probable patch
is labeled as target. In addition to that, most of the trackers update their model in
test time in order to adapt appearance changes. On-line training and large number
of feed forward due to generative approach are time consuming. This makes these
kinds of tracker algorithms very slow and not practical. In GOTURN [13], authors
proposed a novel approach to the neural network tracking. Only one forward pass is
necessary and there is no model update. Therefore, GOTURN can run at 100fps. The
network in GOTURN algorithm is composition of three neural networks. There are
two parallel convolutional network and one fully connected network. Convolutional
layers of CaffeNet [60] is used as convolutional network. CaffeNet is trained with
1.2 million data taken from ImageNet [40, 41] dataset. Fully connected network is
composed of four fully connected layers. Three of them contain 4096 neuron and
the final fully connected layer has four neurons which gives output. Each layer has
dropout and ReLU layer. The hyper parameters, such as neuron numbers, kernel
sizes, are taken from the CaffeNet. This whole network is fine-tuned with training
dataset which is generated from auxiliary sequences. In test time no model update is
performed.
Neural network takes two consecutive images as input. In order to feed network
double size of target is cropped in previous frame around target. The same location
is also cropped in current frame. These two cropped images is resized to 227 x 227.
This is the input size of CaffeNet. When these two image is feed-forward, network
output directly gives the position of the target with respect to upper left corner of
previous frame. Network has four output neurons. They are the x and y position of
left upper corner and right lower corner of the target respectively. In order to train
53
Page 72
network consecutive images and its motion is given as training sample. In addition
to that, images are also used in training. Images are shifted in some direction as if
they are moving. Non-shifted image is counted a previous image and shifted image
is counted as current image. Cropping applied in the same way with video sequence.
This mechanism generates augmented data for object motion. Since large number of
object class can be found in image datasets. By this approach, algorithm gets more
robust to different kinds of objects.
54
Page 73
CHAPTER 4
PROPOSED METHOD
In this chapter, proposed tracker that tracks face of mice in a video sequences col-
lected from TÜBITAK 115E248 FARE-MIMIK project set up will be explained. The
proposed tracker method is based on convolutional neural network. Given the target
location in the previous frame, tracker algorithm outputs target location in the current
frame.
4.1 Network Architecture
Deep neural networks are able to learn discriminative representation of input dataset.
Each layer of network learns different features. While early layers learn simple fea-
tures, deep layer features contains semantic information related to object in image.
Semantic features are very helpful for tracker algorithms since main purpose of the
tracker is to determine target location precisely. By the help of semantic features,
target identification can be made successfully. However, as network depth increases,
spatial resolution decreases due to pooling layers. Receptive field of one pixel at deep
layer response can be very large. Therefore, accurate target detection can’t be made
by using just deep features. Although shallow layer features contains basic informa-
tions like corner detection, they have good spatial resolution. By using combination
of shallow and deep layer features, accurate and robust tracker can be designed.
Training deep convolutional neural networks requires large number of training data,
in the order of million. The dataset that is prepared for training doesn’t contain so
many training data. Therefore, convolutional layers of pre-trained deep convolutional
55
Page 74
neural network that is trained rich dataset will be used as a feature extractor. These
layers will be called as feature extraction network. Low and high level features of
feature extraction network is used to design accurate and robust tracker.
Two consecutive frames are fed into network and it is expected that networks outputs
target location. Concatenation of low and high features of both previous and cur-
rent frames is used to find target location. Therefore, two parallel feature extraction
networks is needed.
Low and high level features of feature extraction network are concatenated in the third
dimension. In order to concatenate them, spatial sizes of features should be the same.
However, it may not be possible for some layer. Feature adaptation network is used
to resize and concatenate low and high level features of both previous and current
frames.
Concatenated features are fed into the regression network. Regression network is
responsible from generating target location from features of inputs. This network is
trained with application specific dataset. Therefore, this network learns how to relate
motion of target with given features.
The neural network that is used in this thesis is composed of two parallel feature
extraction network, one feature adaptation network and one regression network that
is connected to output of feature adaptation network. Block diagram of the proposed
network is given in Figure 4.1.
Various different combinations of the network layers of proposed method are exam-
ined in section 5.4.
4.2 Dataset
The mice dataset is generated from the mice videos which are recorded at Hacettepe
University, Neurological Sciences and Psychiatry Institute, Behavior Experiments
Research Laboratory. Videos are recorded for a project that grades the pain level
of a mouse automatically by the help of computer software. The proposed tracker
will be used in this project as mouse face tracker. Two types of videos are recorded.
56
Page 75
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5
ReLULRNPool
ReLULRNPool ReLU ReLU
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5
ReLULRNPool
ReLULRNPool ReLU ReLU
FCExtraFC
ExtraFC6FC6 FC7FC7
Dropout DropoutDropout
FeatureAdaptation
Network
FeatureAdaptation
Network
High Level Features
High Level Features
Regression Network
PreviousFrame
CurrentFrame
Target State
ReLUPool
Low LevelFeatures
Low LevelFeatures
Convolutional Network #1
Convolutional Network #2
Figure 4.1: Network architecture of proposed network
These are videos that mouse is in pain and not. Both types of videos are used in
training of tracker algorithm. However, videos in which mouse is in basal (the state
before drug) state are more valuable for tracker training, because in that state mouse
is more mobile.
For one video record, six cameras are placed around container of mouse as in Figure
4.2. Video are recorded in ultra-high definition (UHD) 3840 x 2048 in 25 frames per
second (FPS). The length of the videos that mouse is in basal state are 30 minutes.
The duration of videos that the mouse is dragged changes from 30 minutes to 1 hour.
Videos aren’t interrupted during recording in order to see effect of drug with respect to
time. The bounding box of face of a mouse is labeled manually at METU Intelligent
Systems and Computer Vision laboratory.
In order to generate dataset, frames that face of a mouse isn’t visible are discarded.
Among the remaining frames, valid frames are determined. If face of mouse moves
between previous and current frames, current frame is labeled as valid frame. Valid
frames in all videos, includes both pain and basal videos, are used in training. That
makes tracker robust to the state change of mouse.
57
Page 76
Figure 4.2: Video recording setup
After deciding which frames are valid, the data and their label for the training dataset
is prepared as follows.
One valid frame with its previous frame is randomly selected. Search area is defined
for that frame. Search area is a region of interest with size of 1.5 of target and centered
at target at the previous frame. Search area for a frame is given in Figure 4.3. Target
shape is defined as square.
Search area is cropped from both previous and current frame. Note that search area
is defined with respect to target area of previous frame. These 2 cropped frames
are converted from RGB to BGR and resized to 227x227x3 which is input format
of VGG-CNN-F network. After that, Imagenet dataset mean is subtracted from both
image which is also necessary for VGG-CNN-F network. Label related to that data
is three dimensional vector that contains x,y coordinates of upper left corner of target
in current frame with respect to upper left corner of search area and label target size.
Note that target size in label should be scaled since cropped images are resized to
227x227x3. Scaled label can be obtained by dividing label to scale factor, which is
division of width of search area to width of resized input image (227). The final label
is calculated by multiplying resized label with 10 in order to increase loss term.
58
Page 77
Figure 4.3: Target area is shown in green square and Search area is shown in blue
square
That data and label generation process is applied on all valid frames in a random
order. Randomization is an important in training because if ordered inputs are used,
network may memorize some pattern. That decreases test performance of a neural
network usually.
4.2.1 Data Augmentation
Data augmentation is an important approach for training dataset generation. By the
help of augmented data, target abstraction can be improved. For mice dataset, ver-
tical mirroring and random brightness change is applied for data augmentation. The
augmentation generates artificial data that may exist in real video sequences. Also
brightness change increases robustness of tracker to illumination change. From one
training data, nineteen augmented data is generated. For data generation following
procedure is applied.
First, nine uniformly distributed random number is generated from 0.8 to 1.2. These
numbers are used as brightness value multiplier. Both current and previous frames
are converted from RGB to HSV colorspace. V values of images are multiplied with
generated brightness value multiplier. After multiplication images are converted back
59
Page 78
1.000000 1.009725 0.970500 1.041494 0.800973
0.845352 0.943010 0.892733 1.150210 1.184576
1.000000 1.009725 0.970500 1.041494 0.800973
0.845352 0.943010 0.892733 1.150210 1.184576
Figure 4.4: Upper ten images are original data with illumination change. Illumination
multiplier is given below each image. Lower ten images are mirrored images with
illumination change.
to RBG color space. By this method nine augmented data is generated. In addition
to that, vertical mirroring is applied to that nine augmented image and the origi-
nal image. By vertical mirroring, ten more augmented data is generated. Example
augmented images are given in Figure 4.4. Note that, after vertical mirroring target
locations are also changed. Labels are updated according to mirror operation. When
image is mirrored, upper left corner goes to upper right corner of the target. Verti-
cal location and width doesn’t change in search area. However, horizontal location
should be subtracted from 227 minus target width in order to get horizontal location
of the upper left corner. In mice dataset, target location is multiplied by 10. Because
of that, first parameter, which corresponds to horizontal location, in the label should
be multiplied by 10 after subtraction operation in the horizontal location.
60
Page 79
4.3 Off-line Training
As explained in section 4.1, tracker network is composed of four different network
which are two feature extraction network, one feature adaptation network and one
regression network. Tracker network architecture was given in Figure 4.1. Layers
shown in blue are called as feature extraction network. Layers shown in yellow are
feature adaptation network. Network shown in red is named as regression network.
Convolutional layers of VGG-CNN-F are used in tracker network. These layers cor-
respond to feature extraction network of the tracker network. Connection weights of
that network are directly transferred to tracker network. Feature extraction networks
aren’t trained during off-line training. Because, VGG-CNN-F network is trained with
a rich dataset containing roughly 1.2 million images of 1000 different class such as
animals, plants, foods and instruments. The features that extracted from this rich
dataset are known to be highly representative for large number of object. Training of
this network additionally with mice dataset would corrupt that features. In the tracker
algorithm proposed in this thesis, low and high level features of feature extraction
network are used. Therefore, extra training of convolutional layers would decrease
performance of tracker.
Feature adaptation network, which is shown in yellow, is responsible from equating
spatial size of low and high level features. This network is composed of max-pooling
and concatenation. Max-pooling and concatenation operations don’t contain free pa-
rameters. Feature adaptation network isn’t participated in training, since training op-
eration is only valid for free parameter. The only network that is trained in proposed
method is regression network.
In training of regression network, Adam optimizer with learning rate annealing is
used. In Adam optimizer suggested hyper-parameters are used. As a learning rate
annealing policy, step decrease is preferred. Initial learning rate value is selected as
1e − 5. Learning rate is multiplied with 0.5 in every 1000 batch in order to satisfy
step decrease. Batch size is 50. All layers in regression network are initialized with
Xavier initializer implemented in caffe deep learning software framework.
In training, mean squared error loss between predicted target location and ground-
61
Page 80
truth target location is used. Loss function is given in (4.1). yn represents label and
yn represents network output for given input data. N is batch size which is 50 in
proposed method.
C =1
2N
N∑n=1
‖yn − yn‖22 (4.1)
The network is trained with mice videos. Training dataset is generated according to
procedure that is explained in section 4.2.
4.4 On-line Tracking
In this section, method that is used during online tracking will be explained.
The first frame is read from video. Bounding box target location in the first frame
is taken from ground-truth. Target size is expanded with ratio of 1.5. This area is
called search area as explained previously in section 4.2. Second frame is read from
video. That frame is called current frame. The first frame is called previous frame.
Both previous frame and current frame is cropped to size of search area. Cropped
images are converted to BGR color space from RGB color space, since VGG-CNN-
F network is trained with BGR images. Cropped images resized to 227 x 227 and
imagenet dataset mean is subtracted because input size of VGG-CNN-F is 227 x 227
and it is trained with imagenet by subtracting dataset mean. The cropped area in
the previous frame is given as input to the first feature extraction network and the
cropped area in the current frame is given as input to the second feature extraction
network. By simply feed forwarding these two frames, target location in current
frame is calculated. Network has three output neurons. They corresponds to x,y
coordinate of upper left corner and width of the bounding box. Since target area is
always square, width and height are equal to each other. Note that, since network is
trained with resized image, network output is also scaled to 227 x 227. Scale factor
can be calculated by simply dividing 227 to width of the search area. By simply
dividing network output to scale factor, target location with respect to cropped current
frame is calculated. Remember that, target location is multiplied by 10 in training
62
Page 81
Frame No : 5319Width ratio 0.98
Frame No : 5320Width ratio 0.98
Frame No : 5321Width ratio 0.98
Frame No : 5322Width ratio 0.98
Frame No : 5323Width ratio 0.97
Frame No : 5324Width ratio 0.93
Frame No : 5325Width ratio 0.85
Frame No : 5326Width ratio 0.80
Frame No : 5327Width ratio 0.67
Frame No : 5328Width ratio 0.56
Frame No : 5329Width ratio 0.50
Frame No : 5330Width ratio 0.42
Figure 4.5: Mouse turns back and tracker losses target. Frame numbers and width
ratios calculated according to previous frames are given below each frame. Successful
tracker results shown in green and failed tracker result is shown in red.
dataset. In order to compensate that, target location of current frame is divided by
10. Therefore, the location of target in the search area is determined. However for
tracking applications, target location in a given frame is necessary. In order to find
final location, coordinate of upper left corner of the search area is added to the first
two term of the output vector which are x,y coordinate of target in search area.
After target location is found, current frame is defined as previous frame whose target
location is known. New frame is read from video and the same procedure is applied
to find target in newly read frame. This process continues until end of the video or
target is lost.
Experimental results shows that when tracker losses target, target size dramatically
decrease in time and goes to zero. In order to identify target loss, simple heuristic is
used. Width of current target is divided by width of the previous target. This division
is called width ratio. If width ratio is smaller than loss threshold, which is 0.9 for this
application, it is assumed that tracker algorithm is failed for current frame.
63
Page 82
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.10
200
400
600
800
1000
Figure 4.6: Width ratio histogram on 25 FPS test video.
In Figure 4.5, a tracker failure example is given. Tracker can not track target after
frame number 1325, since mouse turns its back to camera. Tracker outputs whose
width ratio is below 0.9 are shown as red square. Note that red squares don’t contain
a mouse face. That shows that width ratio assumption is valid for tracker failure
evaluation for the given example case.
In Figure 4.6, width ratio histogram for test video is given. For 25 FPS video capture
rate, width ratio of natural movement of face of a mouse mostly changes from 0.94 to
1.04. Therefore, 0.9 threshold value doesn’t affect tracker performance.
64
Page 83
CHAPTER 5
EXPERIMENTAL RESULTS
5.1 Performance Criteria
In order to evaluate performance of single target object tracker, some performance
criterion should be used. Several performance measures are used in literature. Some
of them are popular in tracker studies. However, there is no standard for performance
measure in single object tracking. Cehovin et al. [61] evaluated some popular perfor-
mance measures on 25 widely-used video sequences with 13 trackers. Based on these
analyses, they stated that some measures indicate the same aspects. Among these
measures two uncorrelated performance measure is selected and they proposed per-
formance visualization method that is based on accuracy versus robustness of tracker.
In this section, firstly, popular performance measures of single target object tracking
and visualization methods are explained based on study of Cehovin et al. Secondly,
which performance measures will be used in this thesis will be expressed.
The purpose of performance measure in object tracking is to evaluate how much ob-
ject state assigned by tracker matches the ground truth object state. Some popular
performance measures as follows:
Center Error One of the oldest performance measures is the center error. In center
error measurement, the difference between target center and ground truth center is
measured. The smaller center error represents the better tracker. Usually center error
is visualized by center error versus frame plot. Also, average or mean squared error
can be used as a numeric performance indicator. Center error is represented by δ.
65
Page 84
δt =∥∥xGt − xTt ∥∥ (5.1)
Center error measure may not be objective because center error greatly depends on
the target size. For large objects, there may be big center error. However this doesn’t
mean the tracker isn’t successful. In order to overcome this, normalized center er-
ror is proposed. In normalized center error, center error is divided by ground truth
target size. The same representation techniques with center error are also used for
normalized center error. Normalized center error is represented by δt.
δt =
∥∥∥∥ xGt − xTtsize(AGt )
∥∥∥∥ (5.2)
Region Overlap Region overlap measure is a ratio of intersection of target and
ground truth region to union of these regions. This is a good performance measure
because both position and size is taken into account. Region overlap is represented
by φ.
φt =AGt ∩ ATtAGt ∪ ATt
(5.3)
The score term that used with region overlap is called true positive. True positive is
the number of frames, whose region overlap is larger than threshold, is divided by
number of frames in video sequence. It is shown as Pτ .
Pτ (ΛG,ΛT ) =
∥∥∥{t|φt > τ}Nt=1
∥∥∥N
(5.4)
Tracking Length Tracking length measure is the duration of tracking from the first
frame to the frame that tracker fails. Failure can be decided with the region overlap
term. If region overlap falls below some threshold value τ , it is thought that tracker
is failed. This performance measurement is highly dependent to video sequence. If
there is some difficulty in early frames, tracker would fail. Since remaining video is
discarded after tracker is failed, tracker is evaluated on a limited duration of a video
66
Page 85
Figure 5.1: Correlation of performance measures, correlation of the diagonal entries
have the maximum value, that is 1
sequence. Due to that reason, tracking length measure isn’t considered as a good
performance measure. Tracking length is denoted by Lτ .
Failure Rate Failure rate is a ratio of re-initialization of tracker upon failure to
number of frames in a video sequence. Failure is detected by the help of region
overlap just as it is used in tracking length measure. Failure rate is denoted by Fτ
where τ is threshold value for region overlap. This tracker failure threshold is called
re-initialization threshold. Unlike tracking length, failure rate evaluates tracker on
whole video sequence.
True positive, tracking length and failure rate measures are calculated for a given
threshold value τ . In order to visualize these performance measures, performance
measure vs threshold plot can be used. Area under curve (AUC) in that plots are good
numerical indicator for performance measures.
Cehovin et al. provided an experimental analysis on these performance criteria. They
show some of the performance measures are highly correlated. They calculated cor-
relation matrix from the measure results of the selected tracker and video sequence
pairs. The heat map of correlation of performance measures is given in Figure 5.1. It
is seen that there is a high correlation among performance measure 1 to 3 and 4 to 7.
That means it doesn’t matter which performance measure is selected among 1 to 3 or
4 to 7. They give pretty much the same performance measures.
Performance evaluation on single target tracking is mainly focused on robustness and
67
Page 86
accuracy of tracker. Cehovin et al. proposed a simple accuracy versus robustness plot
in order to compare object trackers.
Since failure rate is a measure of how good an algorithm tracks object without losing
it, this measure can be used as a robustness measure. However, failure rate doesn’t
contain any information about accuracy. For accuracy measure, one of the center
errors, region overlap or tracking length measures can be used. As stated before,
tracking length measure isn’t reliable because it doesn’t use whole video sequence.
Region overlap both uses size and location of target. Therefore, it defines accuracy
better than center error which uses only location. In addition to that, region overlap is
highly correlated to true positive and tracking length measures. That also shows the
representative power of region overlap measure.
In this thesis, different variations of the proposed mouse face tracker are compared
with each other. For comparison, accuracy versus robustness plot is supplied. AUC
of true positive versus region overlap will be used as a accuracy and 1 minus AUC of
failure rate versus re-initialization threshold will be used as robustness term.
However, comparing proposed tracker with each other isn’t enough for detailed per-
formance evaluation. For that purpose, true positive versus region overlap threshold is
given for detailed accuracy measure. Although center error is highly correlated with
region overlap, true positive versus center error threshold gives detailed information
for target centering property. Failure rate versus region overlap threshold are supplied
for detailed robustness performance.
5.2 Test networks
In section 4.1, pseudo architecture of proposed tracker is explained. In order to eval-
uate performance of the proposed tracker, 9 test networks with different network ar-
chitecture is proposed. One of the test network is variant of a state-of-art GOTURN
[13] tracker. Performance of 9 test network is evaluated and compared with each
other. The architecture of those 9 networks are summarized in table 5.1 and will be
presented in detail below. In order to name trackers, C5L,H − Cx − F y convention is
used, where L is depth of low level feature, H is depth of high level feature in feature
68
Page 87
Table 5.1: Summary of Test Networks
Network NameLow Level
FeatureHigh Level
FeatureRegression Network
Number ofConvolutional
layer
Number ofFully Connected
layerC5
0,5 − C0 − F 4 - Pool5 - 4C5
0,5 − C1 − F 3 - Pool5 1 3
C53,5 − C0 − F 4 ReLU Output
of Conv3Pool5 - 4
C53,5 − C1 − F 3 ReLU Output
of Conv3Pool5 1 3
C51,5 − C1 − F 3 Pool1 Pool5 1 3
C52,5 − C1 − F 3 Pool2 Pool5 1 3
C54,5 − C1 − F 3 ReLU Output
of Conv4Pool5 1 3
C52,4 − C1 − F 3 Pool2
ReLU Outputof Conv4
1 3
C52,4 − C2 − F 3 Pool2
ReLU Outputof Conv4
2 3
extractor network, x is the number of convolutional layer and y is the number of fully
connected layer in the regression network. If low level feature isn’t used in network,
L is equal to 0. If convolutional layer isn’t used in regression layer, x is equal to 0.
As a base test network architecture, network architecture of generic single object
tracker proposed by Held et al. namely GOTURN [13] is used. GOTURN is state-of-
art tracker that can run at 100 FPS. That speed is very satisfactory for neural network
based tracker algorithm. Generally, neural network based trackers runs at 0.8 FPS to
15 FPS due to on-line training.
Since it is trained with generic objects, GOTURN perform better with standard targets
like ball, car, human etc. If its evaluated on mice dataset, Its performance wouldn’t
be satisfactory. Therefore, a mouse face tracker that has the same architecture except
output layer with GOTURN is implemented. Output layer is changed with 3 neuron
output layer instead of 4 neuron output layer. Because mouse face is square shaped
and one of the width or height parameter isn’t necessary. This network is called
C50,5 − C0 − F 4 network. It is trained with mice dataset like other networks in order
to increase performance of GOTURN on mouse video sequences. Block diagram is
69
Page 88
given in Figure A.1.
In original GOTURN implementation, convolutional layers of AlexNet are used as
feature extraction network. In order to achieve faster operation, convolutional layers
of VGG-CNN-F network is used in C50,5 − C0 − F 4 network instead of AlexNet.
On ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC-2012), both
network architectures achieved similar error rates. In addition to that, VGG-CNN-F
network is specialized for fast training and inference.
Regression network is composed of four fully connected layers. The first three fully
connected layer contains 4096 neurons with ReLU and 0.5 dropout. The final layer
has three neurons that correspond to target state. Three outputs represent x,y co-
ordinate of upper left point of target and size of target (scaled by 227/target size).
Fully connected layers are called fc extra, fc6, fc7 and fc 8, in order. Full network
architecture is given in Figure A.3
While fully connected layers operate on all input independently, convolutional layers
consider spatial relations. Depth of input is also taken into account in convolution op-
eration. That means features related to content of the depth are extracted as well. Note
that in C50,5−C0−F 4 network, pool5 features of feature extraction network are con-
catenated. All layers before conv5 are also convolutional layer. Therefore, concate-
nation of pool5 features still contains spatial information. Features of two different
pool5 features should be merged. For that purpose, FC extra layer in C50,5 −C0 − F 4
network is replaced by convolutional layer namely conv extra. This convolutional
layer has 256 kernels with 3x3 kernel size. This layer has stride 1 and padding 1.
This network is called C50,5 − C1 − F 3 network. Block diagram of C5
0,5 − C1 − F 3
network is given in A.2.
Pool5 features represent the high level features. Although high level features con-
tains semantic features, receptive field of neurons in conv5 layer is very large. That
makes harder to localize target precisely. Therefore, low level features should also be
included into network.
InC53,5−C0−F 4 andC5
3,5−C1−F 3 networks, ReLU output of layer 3 is added to the
network. In order to see the effect of convolutional layer on mixture of high and low
70
Page 89
level features, C53,5 − C0 − F 4 network contains FC extra layer and C5
3,5 − C1 − F 3
network contains Conv extra layer. Block diagram of these networks are given in
Figure A.4 and A.5, respectively.
InC53,5−C0−F 4 andC5
3,5−C1−F 3 networks, ReLU output of conv3 layer is used as
a low level feature. However, it is not clear what the depth of low level features should
be. In order to measure performance of different low level features, C51,5 − C1 − F 3,
C52,5 − C1 − F 3, C5
3,5 − C1 − F 3 and C54,5 − C1 − F 3 networks are proposed.
In C51,5−C1−F 3 network, conv5 features of VGG-CNN-F network is used, in order
to make use of high level representation property. For low level features that have
high spatial resolution, pool output of conv1 layer is used. Input size of VGG-CNN-F
network is 227x227x3. That makes size of conv5 output 6x6x256 and size of pool
output of conv1, which is called pool1 layer, 27x27x64. As stated above concatena-
tion of conv5 and pool1 features are used as input to the regression network. Features
are concatenated in third dimension. In order to do that spatial size of pool1 layer
should be equated to conv5 layer. Concatenation layer and two max-pooling layer
with kernel size 7x7 and stride 4 is used on top of pool1 layer in feature adaptation
network, in order to equate spatial sizes. In regression network conv extra, fc6, fc7
and fc8 layers are used.
C52,5 − C1 − F 3 network uses pool2 features. In order to match feature sizes, feature
adaptation network in C53,5−C1−F 3 network is used. The same regression network
with C51,5 − C1 − F 3 network is used. Network block diagrams of C5
2,5 − C1 − F 3
network is given in Figure A.7.
C54,5 − C1 − F 3 network uses ReLU output of conv4 layer. The same regression
network with C51,5 − C1 − F 3 network is used. Network block diagrams of C5
4,5 −C1 − F 3 network are given in Figure A.6.
In all test networks so far, pool5 features are used as high level features. It is known
that high level features represent input semantically. However for mouse face tracking
problem, there is only one class of input that is to be tracked. Such high level features
may not be necessary. In order to evaluate this, C52,4 − C1 − F 3 network is defined.
Block diagram of C52,4 − C1 − F 3 network is given in Figure A.8. In that network,
71
Page 90
pool2 features are used as low level feature and ReLU output of conv4 features are
used as high level feature. Note that their feature map size which is 13 x 13 is equal to
each other. No pooling layer is necessary to concatenate features in feature adaptation
network.
Depth of the network is decreased by one layer, since conv5 layer of VGG-CNN-
F isn’t used in C52,4 − C1 − F 3 network. In C5
2,4 − C2 − F 3 network, one more
convolutional layer is added to keep network depth the same with other networks
except C52,4−C1−F 3 network. It is expected that high level features of mice dataset
will be extracted by conv5 extra layer. Block diagram C52,4 − C2 − F 3 network is
given in Figure A.9. Conv5 extra layer has the same properties with conv extra layer
except conv5 extra layer has 512 kernels.
5.3 Test Procedure
Performance of 9 different trackers is measured on video sequences. 7 video se-
quences is cropped from a video sequence that isn’t included in training dataset.
These test videos are selected considering face visibility of mouse.
Proposed tracker has target tracker failure detection mechanism as explained in on-
line tracking 4.4. In order to evaluate trackers objectively, this property is turned-off.
Tracker is assumed to be failed if region overlap of target and ground-truth bounding
box falls below threshold value. After tracker failure, target location is reinitialized
from ground-truth. This tracker failure threshold will be called reinitialization thresh-
old.
True positive versus region overlap threshold and true positive versus center error
threshold plots are given for reinitialization threshold value of 0. Each tracker is run
on all test videos and results are averaged to obtain final plots. Area under curve in
that plots are good indicator of performance. These values are given at right of the
name of network in legend of figure. A higher AUC value means a better perfor-
mance.
Note that, failure rate performance measure, which is the best robustness indicator
72
Page 91
among mentioned performance measures, is measured for defined re-initialization
threshold. In order to plot failure rate versus re-initialization threshold plot, trackers
should be run on all test videos for all re-initialization threshold interval. Trackers are
run on all test videos with 50 different threshold values that changes from 0 to 0.98
with 0.02 step values. For failure rate measure, a lower AUC value represents better
performance.
The speed of a tracker is calculated by averaging FPS values of the tracker on all
videos. There are 350 results for each tracker since tracker is run with 50 different
re-initialization thresholds on 7 video sequences. Averaging tracker speed on 350
videos gives a reliable performance measure.
5.4 Performance Evaluation
In this section, effect of convolutional layer with high level features, effect of convo-
lutional layer with fusion of low and high level features, depth of low level feature
comparison, depth of high level feature comparison and overall comparison of test
networks will be explained with experimental results.
5.4.1 Effect of Convolutional Layer
As mentioned before, C50,5 − C0 − F 4 Network shares the same architecture except
output layer with state-of-art GOTURN tracker. C50,5−C1−F 3 Network is a slightly
modified version of C50,5 − C0 − F 4 Network. In C5
0,5 − C1 − F 3 Network, FC extra
layer is replaced with conv extra layer. Performance plots of both trackers are given
in Figures 5.2, 5.3 and 5.4.
Experimental results show that convolutional layer in the regression network increases
robustness of tracker in terms of both robustness and accuracy. Failure rate of C50,5 −
C1 − F 3 Network is lower than C50,5 − C0 − F 4 Network between region overlap
threshold 0 and 0.5. Above threshold 0.5, performance of both network are close to
each other. That is because both networks are not successful in tracking with high
region overlap.
73
Page 92
0.0 0.2 0.4 0.6 0.8 1.0Region Overlap Threshold
0.0
0.2
0.4
0.6
0.8
1.0
True P
osit
ive
C 50, 5−C0−F 4 / [0.656]
C 50, 5−C1−F 3 / [0.702]
Figure 5.2: True Positive versus Region Overlap Ratio plot for C50,5 − C0 − F 4 and
C50,5 − C1 − F 3 Networks
0.0000 0.0005 0.0010 0.0015 0.0020Normalized Center Error Threshold
0.0
0.2
0.4
0.6
0.8
1.0
True P
osit
ive
C 50, 5−C0−F 4 / [0.741]
C 50, 5−C1−F 3 / [0.743]
Figure 5.3: True Positive versus Normalized Center Error plot for C50,5−C0−F 4 and
C50,5 − C1 − F 3 Networks
74
Page 93
0.0 0.2 0.4 0.6 0.8 1.0Reinitialization Threshold
0.0
0.2
0.4
0.6
0.8
1.0
Failu
re R
ate
C 50, 5−C0−F 4 / [0.465]
C 50, 5−C1−F 3 / [0.432]
Figure 5.4: Failure Rate versus Region Overlap Ratio plot for C50,5 − C0 − F 4 and
C50,5 − C1 − F 3 Networks
Although, normalized center error performance of both network close to each other,
region overlap performance of C50,5 − C1 − F 3 Network is much better. That shows
that convolutional layer especially improves bounding box performance. Experimen-
tal results supports that additional convolutional layer in regression network is more
successful than only fully connected layer in merging features that comes from dif-
ferent networks.
5.4.2 Effect of Low Level Features and Convolutional Layer in Feature Fusion
Networks
In target localization, low level features are also necessary due to wide receptive fields
of high level features. It is expected that if low level features are used with high level
features accuracy of tracker will be increased. In this section, effect of low level
features is represented with experimental results. C53,5−C0−F 4 and C5
3,5−C1−F 3
Networks are compared with C50,5 − C0 − F 4 and C5
0,5 − C1 − F 3 Networks. Both
C53,5 − C0 − F 4 and C5
3,5 − C1 − F 3 Networks uses ReLU output of conv3 layer. In
addition to that, effect of convolutional layer is evaluated when it is used with fusion
of high and low level features. Performance plots are given in Figures 5.5, 5.6 and
5.7.
75
Page 94
0.0 0.2 0.4 0.6 0.8 1.0Region Overlap Threshold
0.0
0.2
0.4
0.6
0.8
1.0
True P
osit
ive
C 50, 5−C0−F 4 / [0.656]
C 50, 5−C1−F 3 / [0.702]
C 53, 5−C0−F 4 / [0.318]
C 53, 5−C1−F 3 / [0.770]
Figure 5.5: True Positive versus Region Overlap Ratio plot for C50,5 − C0 − F 4,
C50,5 − C1 − F 3, C5
3,5 − C0 − F 4 and C53,5 − C1 − F 3 Networks
0.0000 0.0005 0.0010 0.0015 0.0020Normalized Center Error Threshold
0.0
0.2
0.4
0.6
0.8
1.0
True P
osit
ive
C 50, 5−C0−F 4 / [0.741]
C 50, 5−C1−F 3 / [0.743]
C 53, 5−C0−F 4 / [0.520]
C 53, 5−C1−F 3 / [0.786]
Figure 5.6: True Positive versus Normalized Center Error plot for C50,5 − C0 − F 4,
C50,5 − C1 − F 3, C5
3,5 − C0 − F 4 and C53,5 − C1 − F 3 Networks
76
Page 95
0.0 0.2 0.4 0.6 0.8 1.0Reinitialization Threshold
0.0
0.2
0.4
0.6
0.8
1.0
Failu
re R
ate
C 50, 5−C0−F 4 / [0.465]
C 50, 5−C1−F 3 / [0.432]
C 53, 5−C0−F 4 / [0.611]
C 53, 5−C1−F 3 / [0.362]
Figure 5.7: Failure Rate versus Region Overlap Ratio plot for C50,5−C0−F 4, C5
0,5−C1 − F 3, C5
3,5 − C0 − F 4 and C53,5 − C1 − F 3 Networks
When ReLU output of conv3 layer is used with conv extra layer, performance of
tracker is significantly increased as expected. However, If low level features are used
with FC extra layer, performance decreases significantly. This is because of large
number of free parameters in FC extra layer.
Conv5 layer contains 256 feature maps with size of 6 x 6. Conv 3 layer contains 256
feature maps with size of 13 x 13. ReLU of conv3 features are downscaled to 6 x 6
by max pooling. Features of 4 layers are concatenated in C53,5 − C0 − F 4 Network.
That makes 1024 feature maps with size of 6 x 6. These features are flatten and given
to the FC extra layer. Fc extra layer have 36864 inputs 4096 outputs. That correspond
151 million free parameters just in one layer. Therefore, C53,5 − C0 − F 4 Network
easily over-fit to training dataset and losses its ability to generalize mice movement
behavior which decreases tracker performance significantly.
5.4.3 Effect of Depth of Low Level Features
It is seen that low level features improves tracker performance. In this section, effect
of depth of low level features is evaluated. C51,5−C1−F 3, C5
2,5−C1−F 3, C53,5−C1−
F 3 and C54,5 − C1 − F 3 Networks that use low level features of pool1, pool2, ReLU
of conv3, ReLU of conv4 respectively, are compared. They use conv extra layer since
77
Page 96
0.0 0.2 0.4 0.6 0.8 1.0Region Overlap Threshold
0.0
0.2
0.4
0.6
0.8
1.0Tr
ue P
osit
ive
C 52, 5−C1−F 3 / [0.805]
C 53, 5−C1−F 3 / [0.770]
C 54, 5−C1−F 3 / [0.574]
C 51, 5−C1−F 3 / [0.804]
Figure 5.8: True Positive versus Region Overlap Ratio plot for C51,5 − C1 − F 3,
C52,5 − C1 − F 3, C5
3,5 − C1 − F 3 and C54,5 − C1 − F 3 Networks
it is shown that fully connected layer can’t perform well with large number of feature
maps. Performance plots are given in Figures 5.8, 5.9, 5.10.
When ReLU output of conv4 is used as low level feature, performance of tracker is
worse than the others. That is because it still has large receptive field to locate target
precisely.
If C51,5 − C1 − F 3, C5
2,5 − C1 − F 3 and C53,5 − C1 − F 3 are examined, it is seen that
they all perform well in terms of normalized center error. However, C52,5 − C1 − F 3
Network and C51,5 − C1 − F 3 Network are better at region overlap. Because pool1
and pool2 features have smaller receptive field and features are still discriminative
for mouse face. Trackers, which use pool1 and pool2 features, outperform the others.
However, C52,5 − C1 − F 3 Network is slightly better than C5
1,5 − C1 − F 3 Network.
5.4.4 Effect of Depth of High Level Features
Pool2 features give good performance as a low level features. Effect of depth of the
high level feature is examined by comparing C52,5 − C1 − F 3, C5
2,4 − C1 − F 3 and
C52,4 −C2 − F 3 Networks. In C5
2,4 −C1 − F 3 Network, ReLU output of conv4 layer
is used. In C52,4 − C2 − F 3 Network, additional convolutional layer is added. Aim of
78
Page 97
0.0000 0.0005 0.0010 0.0015 0.0020Normalized Center Error Threshold
0.0
0.2
0.4
0.6
0.8
1.0Tr
ue P
osit
ive
C 52, 5−C1−F 3 / [0.788]
C 53, 5−C1−F 3 / [0.786]
C 54, 5−C1−F 3 / [0.717]
C 51, 5−C1−F 3 / [0.778]
Figure 5.9: True Positive versus Normalized Center Error plot for C51,5 − C1 − F 3,
C52,5 − C1 − F 3, C5
3,5 − C1 − F 3 and C54,5 − C1 − F 3 Networks
0.0 0.2 0.4 0.6 0.8 1.0Reinitialization Threshold
0.0
0.2
0.4
0.6
0.8
1.0
Failu
re R
ate
C 52, 5−C1−F 3 / [0.337]
C 53, 5−C1−F 3 / [0.362]
C 54, 5−C1−F 3 / [0.473]
C 51, 5−C1−F 3 / [0.338]
Figure 5.10: Failure Rate versus Region Overlap Ratio plot for C51,5 − C1 − F 3,
C52,5 − C1 − F 3, C5
3,5 − C1 − F 3 and C54,5 − C1 − F 3 Networks
79
Page 98
0.0 0.2 0.4 0.6 0.8 1.0Region Overlap Threshold
0.0
0.2
0.4
0.6
0.8
1.0Tr
ue P
osit
ive
C 52, 5−C1−F 3 / [0.805]
C 52, 4−C1−F 3 / [0.579]
C 52, 4−C2−F 3 / [0.706]
Figure 5.11: True Positive versus Region Overlap Ratio plot for C52,4 − C1 − F 3,
C52,4 − C2 − F 3 and C5
2,5 − C1 − F 3 Networks
adding this layer is to obtain high level feature that is extracted from fusion of pool2
and ReLu of conv4 with mice dataset. Performance plots are given in Figures 5.11,
5.12 and 5.13.
Performance of C52,4 − C1 − F 3 Network is lower than the others. Pool5 layer con-
tains more semantic features when compared to conv4 layer features. That increases
performance of network of identifying input object. In addition to that network depth
is decreased in C52,4 − C1 − F 3 Network. If network depth is increased by adding
extra convolutional layer, tracker performance increases. However, C52,5 − C1 − F 3
Network still performs better. Because VGG-CNN-F network is trained with a rich
dataset compared to mice dataset which makes features of VGG-CNN-F network
more representative.
5.4.5 Overall Comparison
Robustness versus accuracy plot is given in Figure 5.14. AUC of true positive versus
region overlap is used as a accuracy and 1 minus AUC of failure rate versus reinitial-
ization threshold is used as robustness term.
It is seen that trackers, that uses low and high level features with convolutional layer
80
Page 99
0.0000 0.0005 0.0010 0.0015 0.0020Normalized Center Error Threshold
0.0
0.2
0.4
0.6
0.8
1.0Tr
ue P
osit
ive
C 52, 5−C1−F 3 / [0.788]
C 52, 4−C1−F 3 / [0.704]
C 52, 4−C2−F 3 / [0.770]
Figure 5.12: True Positive versus Normalized Center Error plot for C52,4 − C1 − F 3,
C52,4 − C2 − F 3 and C5
2,5 − C1 − F 3 Networks
0.0 0.2 0.4 0.6 0.8 1.0Reinitialization Threshold
0.0
0.2
0.4
0.6
0.8
1.0
Failu
re R
ate
C 52, 5−C1−F 3 / [0.337]
C 52, 4−C1−F 3 / [0.433]
C 52, 4−C2−F 3 / [0.411]
Figure 5.13: Failure Rate versus Region Overlap Ratio plot for C52,4 − C1 − F 3,
C52,4 − C2 − F 3 and C5
2,5 − C1 − F 3 Networks
81
Page 100
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00Accuracy
0.000.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.951.00
Robust
ness
C 52, 5−C1−F 3
C 50, 5−C0−F 4
C 50, 5−C1−F 3
C 53, 5−C0−F 4
C 53, 5−C1−F 3
C 54, 5−C1−F 3
C 51, 5−C1−F 3
C 52, 4−C1−F 3
C 52, 4−C2−F 3
Figure 5.14: Robustness vs Accuracy Plot of All Trackers
in their regression network, gives better performance in terms of both accuracy and
robustness. C53,4−C0−F 4 gives the worst performance even though it uses both low
and high level features. This network doesn’t use convolutional layer in regression
network. Therefore, it overfits to dataset.
5.5 System Performance
All trackers are run on a workstation that is located at METU Intelligent Systems and
Computer Vision laboratory. Workstation contains Intel Core I7 3.3 GHz CPU and
NVidia TitanX GPU. Caffe deep learning framework is used in the implementation
of the networks. Since architecture of artificial neural networks are very suitable for
parallel computing, they run faster on GPU. For proposed method, tracker runs 35
times faster on GPU than CPU.
Mice dataset is stored on HDD. Training of test networks are performed on GPU.
For that configuration, training of one batch takes 0.8 seconds where batch size is 50.
Training of test network approximately takes 5000 batch iteration.
Tracker speeds in non-optimized code are given in table 5.2. Tracking of a frame
contains cropping and resizing of previous and current frames, feed forwarding them
to network and label transformation from search area to bounding box in image. The
82
Page 101
Table 5.2: Tracker speeds of C52,5 − C1 − F 3 Network and Test Networks
Network Name Throughput (FPS)C5
2,5 − C1 − F 3 Network 113.77C5
0,5 − C0 − F 4 Network 127.61C5
0,5 − C1 − F 3 Network 126.87C5
3,5 − C0 − F 4 Network 105.39C5
3,5 − C1 − F 3 Network 113.42C5
4,5 − C1 − F 3 Network 113.96C5
1,5 − C1 − F 3 Network 125.19C5
2,4 − C1 − F 3 Network 98.51C5
2,4 − C2 − F 3 Network 117.54
size of given frames to tracker before cropping is 1280 x 720. According to size of
image, resizing and cropping durations may change.
83
Page 103
CHAPTER 6
CONCLUSION
The aim of this thesis is to design an object tracker specialized for face of a mouse.
For that purpose, a special convolutional neural network architecture that takes two
consecutive frames and outputs target bounding box is proposed. Convolutional neu-
ral network is trained with mice dataset. Mice dataset is generated from the videos
that are recorded at Hacettepe University Neurological Sciences and Psychiatry Insti-
tute, Behavior Experiments Research Laboratory. Face locations in video sequences
are labeled by METU Computer Vision and Intelligent Systems Research Laboratory
members.
Deep neural network can learn how to represent data by training with training dataset.
Each layer of network learns feature extractors with different complexity. While shal-
low layer learns basic features like edges, deep layers learn more semantic features.
The aim in object tracking is to track object by defining its bounding box during
video sequence. In order to satisfy that tracker should be able to identify object that
is to be tracked and define precise bounding box for target. It is known that high
level features contain semantic features for input images that are useful for object
identification. However, high level features has high receptive field which makes
localization of given input hard. In this thesis, high level and low level features are
merged for accurate and robust target tracking.
Tracker Network is composed of four neural networks. Two of them is called feature
extraction network. Since large number of training data is necessary to train con-
volutional neural network in order to obtain good feature extractors, convolutional
85
Page 104
layers of VGG-CNN-F is used as a feature extractor network, which was trained with
ImageNet dataset that contains 1.2 million images. Third network is called feature
adaptation network that is used to merge low and high level features of feature ex-
traction network. The fourth network is called regression network. The only network
that is trained with mice dataset is regression network. VGG-CNN-F network isn’t
trained since fine-tuning this network with mice dataset, which has limited number
of data compared to ImageNet dataset, would distort representative power of it. Opti-
mum depth for high and low level features is selected from experimental results. For
given consecutive frame, high and low level features are extracted from VGG-CNN-F
network and concatenated. Concatenated features are related with bounding box of
target by regression network. Regression network learns how to define bounding box
from the features of consecutive mouse frames, since it is trained with mice dataset.
Experimental results showed that an additional convolutional layer at the input of the
regression network performs better because concatenation of features from the feature
extraction networks contains spatial information. That information can be exploited
by convolutional layer. Although proposed method is specialized in tracking face of
mouse, it can be adapted any target by changing training dataset.
Neural network based object trackers are slow due to on-line training. They adapt to
object changes during tracking. In this thesis, regression network learns how to define
bounding box for natural movements of face of a mouse. In other words, it learns
how mouse moves. There is no need for on-line adaptation. Proposed tracker defines
bounding box by just feed forwarding the network. Therefore, proposed tracker can
run at 113 FPS with GPU.
Tracker performance can be increased more if more training data is used.
86
Page 105
REFERENCES
[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies foraccurate object detection and semantic segmentation,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[2] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Lecun, “Over-feat: Integrated recognition, localization and detection using convolutional net-works,” http://arxiv.org/abs/1312.6229.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolu-tional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 37, no. 9, pp. 1904–1916, 2015.
[4] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferringmid-level image representations using convolutional neural networks,” in Pro-ceedings of the IEEE conference on computer vision and pattern recognition,pp. 1717–1724, 2014.
[5] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell,“Decaf: A deep convolutional activation feature for generic visual recognition.,”in ICML, pp. 647–655, 2014.
[6] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns forfine-grained category detection,” in European Conference on Computer Vision,pp. 834–849, Springer, 2014.
[7] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neuralnetworks,” in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pp. 1653–1660, 2014.
[8] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simultaneous detectionand segmentation,” in European Conference on Computer Vision, pp. 297–312,Springer, 2014.
[9] S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell, A. Hertzmann, andH. Winnemoeller, “Recognizing image style,” arXiv preprint arXiv:1311.3715,2013.
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification withdeep convolutional neural networks,” in Advances in neural information pro-cessing systems, pp. 1097–1105, 2012.
87
Page 106
[11] G. Levi and T. Hassner, “Age and gender classification using convolutional neu-ral networks,” in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops, pp. 34–42, 2015.
[12] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for vi-sual recognition and description,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 2625–2634, 2015.
[13] D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 fps with deepregression networks,” arXiv preprint arXiv:1604.01802, 2016.
[14] I. G. Y. Bengio and A. Courville, “Deep learning.” Book in preparation for MITPress, 2016.
[15] S. Herculano-Houzel, “The remarkable, yet not extraordinary, human brain asa scaled-up primate brain and its associated cost,” Proceedings of the NationalAcademy of Sciences, vol. 109, no. Supplement 1, pp. 10661–10668, 2012.
[16] B. Catanzaro, “Deep learning with cots hpc systems,” 2013.
[17] Theano Development Team, “Theano: A Python framework for fast compu-tation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688, May2016.
[18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-rama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embed-ding,” arXiv preprint arXiv:1408.5093, 2014.
[19] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané,R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas,O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Ten-sorFlow: Large-scale machine learning on heterogeneous systems,” 2015. Soft-ware available from tensorflow.org.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-nition,” arXiv preprint arXiv:1512.03385, 2015.
[21] Wikimedia, “Neuron,” 2006. [Online; accessed August 02, 2016].
[22] A. H. Gittis, S. H. Moghadam, and S. du Lac, “Mechanisms of sustained high fir-ing rates in two classes of vestibular nucleus neurons: differential contributionsof resurgent na, kv3, and bk currents,” Journal of neurophysiology, vol. 104,no. 3, pp. 1625–1634, 2010.
88
Page 107
[23] F. Rosenblatt, The perceptron, a perceiving and recognizing automaton ProjectPara. Cornell Aeronautical Laboratory, 1957.
[24] D. Williams and G. Hinton, “Learning representations by back-propagating er-rors,” Nature, vol. 323, pp. 533–536, 1986.
[25] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,“Dropout: a simple way to prevent neural networks from overfitting.,” Journalof Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[26] A. Karpathy, “Cs231n convolutional neural networks for visual recognition.”http://cs231n.github.io. Accessed: 2016-08-20.
[27] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedfor-ward neural networks.,” in Aistats, vol. 9, pp. 249–256, 2010.
[28] S. Ruder, “An overview of gradient descent optimization algorithms.” http://sebastianruder.com/optimizing-gradient-descent/index.html. Accessed: 2016-08-20.
[29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representationsby back-propagating errors,” Cognitive modeling, vol. 5, no. 3, p. 1, 1988.
[30] Y. Nesterov, “A method for unconstrained convex minimization problem withthe rate of convergence o (1/k2),” in Doklady an SSSR, vol. 269, pp. 543–547,1983.
[31] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for onlinelearning and stochastic optimization,” Journal of Machine Learning Research,vol. 12, no. Jul, pp. 2121–2159, 2011.
[32] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprintarXiv:1212.5701, 2012.
[33] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by arunning average of its recent magnitude,” COURSERA: Neural Networks forMachine Learning, vol. 4, no. 2, 2012.
[34] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXivpreprint arXiv:1412.6980, 2014.
[35] D. H. Hubel and T. N. Wiesel, “Receptive fields and functional architecture ofmonkey striate cortex,” The Journal of physiology, vol. 195, no. 1, pp. 215–243,1968.
[36] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, andL. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”Neural computation, vol. 1, no. 4, pp. 541–551, 1989.
89
Page 108
[37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,pp. 2278–2324, 1998.
[38] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of thedevil in the details: Delving deep into convolutional nets,” arXiv preprintarXiv:1405.3531, 2014.
[39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[40] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on, pp. 248–255, IEEE, 2009.
[41] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recog-nition challenge,” International Journal of Computer Vision, vol. 115, no. 3,pp. 211–252, 2015.
[42] N. Wang and D.-Y. Yeung, “Learning a deep compact image representationfor visual tracking,” in Advances in Neural Information Processing Systems 26(C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,eds.), pp. 809–817, Curran Associates, Inc., 2013.
[43] Y. Chen, X. Yang, B. Zhong, S. Pan, D. Chen, and H. Zhang, “Cnntracker:Online discriminative object tracking via deep convolutional neural network,”Applied Soft Computing, vol. 38, pp. 1088–1098, 2016.
[44] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tinyimages,” 2009.
[45] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical convolutional fea-tures for visual tracking,” in Proceedings of the IEEE International Conferenceon Computer Vision), 2015.
[46] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg, “Convolutionalfeatures for correlation filter based visual tracking,” in Proceedings of the IEEEInternational Conference on Computer Vision Workshops, pp. 58–66, 2015.
[47] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object track-ing using adaptive correlation filters,” in Computer Vision and Pattern Recogni-tion (CVPR), 2010 IEEE Conference on, pp. 2544–2550, IEEE, 2010.
[48] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg, “Learning spatiallyregularized correlation filters for visual tracking,” in Proceedings of the IEEEInternational Conference on Computer Vision, pp. 4310–4318, 2015.
90
Page 109
[49] M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate scale estimationfor robust visual tracking,” in British Machine Vision Conference, Nottingham,September 1-5, 2014, BMVA Press, 2014.
[50] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning discrimi-native saliency map with convolutional neural network,” in Proceedings of the32nd International Conference on Machine Learning, 2015, Lille, France, 6-11July 2015, 2015.
[51] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional net-works: Visualising image classification models and saliency maps,” CoRR,vol. abs/1312.6034, 2013.
[52] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convo-lutional networks,” in 2015 IEEE International Conference on Computer Vision(ICCV), pp. 3119–3127, Dec 2015.
[53] L. Wang, T. Liu, G. Wang, K. L. Chan, and Q. Yang, “Video tracking usinglearned hierarchical features,” IEEE Transactions on Image Processing, vol. 24,no. 4, pp. 1424–1435, 2015.
[54] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural localsparse appearance model,” in Computer vision and pattern recognition (CVPR),2012 IEEE Conference on, pp. 1822–1829, IEEE, 2012.
[55] C. Cadieu and B. A. Olshausen, “Learning transformational invariants from nat-ural movies,” in Advances in neural information processing systems, pp. 209–216, 2008.
[56] H. Nam and B. Han, “Learning multi-domain convolutional neural networks forvisual tracking,” arXiv preprint arXiv:1510.07945, 2015.
[57] H. Li, Y. Li, and F. Porikli, “Deeptrack: Learning discriminative feature repre-sentations by convolutional neural networks for visual tracking,” in Proceedingsof the British Machine Vision Conference, BMVA Press, 2014.
[58] N. Wang, S. Li, A. Gupta, and D.-Y. Yeung, “Transferring rich feature hierar-chies for robust visual tracking,” arXiv preprint arXiv:1501.04587, 2015.
[59] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convo-lutional networks for visual recognition,” IEEE transactions on pattern analysisand machine intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.
[60] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-rama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embed-ding,” in Proceedings of the 22nd ACM international conference on Multimedia,pp. 675–678, ACM, 2014.
91
Page 110
[61] L. Cehovin, M. Kristan, and A. Leonardis, “Is my new tracker really betterthan yours?,” in IEEE Winter Conference on Applications of Computer Vision,pp. 540–547, IEEE, 2014.
92
Page 111
APPENDIX A
BLOCK DIAGRAMS OF TEST NETWORKS
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5
ReLULRNPool
ReLULRNPool ReLU ReLU
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5ReLULRNPool
ReLULRNPool
ReLU ReLU
FCExtraFC
ExtraFC6FC6 FC7FC7 FC8FC8
Dropout DropoutDropoutConcatanationConcatanation
Pool 5
Pool 5
Feature ExtractionNetwork #1
Feature ExtractionNetwork #2
Regression Network
PreviousFrame
CurrentFrame
TargetState
ReLUPool
ReLUPool
Figure A.1: C50,5 − C0 − F 4 Network. Network architecture of GOTURN is imple-
mented by changing feature extraction network and output layer.
93
Page 112
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5
ReLULRNPool
ReLULRNPool ReLU ReLU
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5ReLULRNPool
ReLULRNPool
ReLU ReLU
ConvExtra
ConvExtra
FC6FC6 FC7FC7 FC8FC8Dropout DropoutReLUConcatanationConcatanation
Pool 5
Pool 5
Feature ExtractionNetwork #1
Feature ExtractionNetwork #2
Regression Network
PreviousFrame
CurrentFrame
TargetState
ReLUPool
ReLUPool
Figure A.2: C50,5 − C1 − F 3 Network. Fully connected layer of C5
0,5 − C0 − F 4
Network is replaced with convolutional layer.
FC7FC7
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5
ReLULRNPool
ReLULRNPool ReLU ReLU
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5ReLULRNPool
ReLULRNPool
ReLU ReLU
FCExtraFC
ExtraFC6FC6 FC7FC7
Dropout DropoutDropout
Pool5
Pool5
Regression Network
PreviousFrame
CurrentFrame
Target State
ReLUPool
Feature ExtractionNetwork #1
Feature ExtractionNetwork #2
MaxPoolingMax
Pooling
MaxPoolingMax
Pooling
ConcatanationConcatanation
Pool2
Pool2
ReLUPool
Figure A.3: Network architecture of C52,5 − C1 − F 3 network
94
Page 113
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5
ReLULRNPool
ReLULRNPool ReLU ReLU
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5ReLULRNPool
ReLULRNPool
ReLU ReLU
FCExtraFC
ExtraFC6FC6 FC7FC7 FC8FC8
Dropout DropoutDropout
MaxPoolingMax
Pooling
MaxPoolingMax
Pooling
ConcatanationConcatanation
ReLU of Conv3
ReLU of Conv3
Pool5
Pool5
Feature ExtractionNetwork #1
Feature ExtractiıonNetwork #2
Regression Network
PreviousFrame
CurrentFrame
TargetState
ReLuPool
ReLuPool
Figure A.4: C53,5 − C0 − F 4 Network. ReLU of Conv3 layer is added as a low level
feature
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5
ReLULRNPool
ReLULRNPool ReLU ReLU
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5ReLULRNPool
ReLULRNPool
ReLU ReLU
ConvExtra
ConvExtra
FC6FC6 FC7FC7 FC8FC8Dropout DropoutReLU
MaxPoolingMax
Pooling
MaxPoolingMax
Pooling
ConcatanationConcatanation
ReLU of Conv3
ReLU of Conv3
Pool5
Pool5
Feature ExtractionNetwork #1
Feature ExtractionNetwork #2
Regression Network
PreviousFrame
CurrentFrame
TargetState
ReLUPool
ReLUPool
Figure A.5: C53,5 − C1 − F 3 Network. Fully connected layer of C5
3,5 − C0 − F 4
Network is replaced with convolutional layer.
95
Page 114
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5
ReLULRNPool
ReLULRNPool ReLU ReLU
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5ReLULRNPool
ReLULRNPool
ReLU ReLU
ConvExtra
ConvExtra
FC6FC6 FC7FC7 FC8FC8Dropout DropoutReLU
MaxPoolingMax
Pooling
MaxPoolingMax
Pooling
ConcatanationConcatanation
Conv4 Relu output
Conv4 Relu output
Pool5
Pool5
Feature ExtractionNetwork #1
Feature ExtractionNetwork #2
Regression Network
PreviousFrame
CurrentFrame
TargetState
ReLUPool
ReLUPool
Figure A.6: C54,5 − C1 − F 3 Network. ReLu of Conv4 layer is used as a low level
feature
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5
ReLULRNPool
ReLULRNPool ReLU ReLU
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5ReLULRNPool
ReLULRNPool
ReLU ReLU
ConvExtra
ConvExtra
FC6FC6 FC7FC7 FC8FC8Dropout DropoutReLU
MaxPoolingMax
Pooling
MaxPoolingMax
Pooling
ConcatanationConcatanation
Pool1
Pool1
Pool5
Pool5
Feature ExtractionNetwork #1
Feature ExtractionNetwork #2
Regression Network
PreviousFrame
CurrentFrame
TargetState
ReLUPool
ReLUPool
Figure A.7: C51,5 − C1 − F 3 Network. Pool1 is used as low level feature
96
Page 115
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5
ReLULRNPool
ReLULRNPool ReLU ReLU
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5ReLULRNPool
ReLULRNPool
ReLU ReLU
ConvExtra
ConvExtra
FC6FC6 FC7FC7 FC8FC8Dropout DropoutReLUConcatanationConcatanation
Pool2
Pool2
Conv4 Relu Output
Conv4 Relu Output
Feature ExtractionNetwork #1
Feature ExtractionNetwork #2
Regression Network
PreviousFrame
CurrentFrame
TargetState
Figure A.8: C52,4 − C1 − F 3 Network. Pool2 is used as low level feature and ReLU
output of conv4 layer is used as high level feature
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5
ReLULRNPool
ReLULRNPool ReLU ReLU
Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5ReLULRNPool
ReLULRNPool
ReLU ReLU
ConvExtra
ConvExtra
FC6FC6 FC7FC7 FC8FC8Dropout DropoutReLUConcatanationConcatanation
Pool2
Pool2
Conv4 Relu Output
Conv4 Relu Output
ReLU
Feature ExtractionNetwork #1
Feature ExtractionNetwork #2
Regression Network
PreviousFrame
CurrentFrame
TargetState
Conv5Extra
Conv5Extra
Figure A.9: C52,4 − C2 − F 3 Network. Conv5 extra layer is added to C5
2,4 − C1 − F 3
Network in order to extract dataset specific high level features
97