mouse face tracking using convolutional neural

MOUSE FACE TRACKING USING CONVOLUTIONAL NEURALNETWORKS

A THESIS SUBMITTED TOTHE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES

OFMIDDLE EAST TECHNICAL UNIVERSITY

BY

IBRAHIM BATUHAN AKKAYA

IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR

THE DEGREE OF MASTER OF SCIENCEIN

ELECTRICAL AND ELECTRONICS ENGINEERING

SEPTEMBER 2016

Approval of the thesis:

MOUSE FACE TRACKING USING CONVOLUTIONAL NEURAL NETWORKS

submitted by IBRAHIM BATUHAN AKKAYA in partial fulfillment of the require-ments for the degree of Master of Science in Electrical and Electronics Engineer-ing Department, Middle East Technical University by,

Prof. Dr. Gülbin Dural ÜnverDean, Graduate School of Natural and Applied Sciences

Prof. Dr. Tolga ÇilogluHead of Department, Electrical and Electronics Engineering

Prof. Dr. Ugur HalıcıSupervisor, Electrical and Electronics Engineering Depart-ment, METU

Examining Committee Members:

Prof. Dr. Gözde Bozdagı AkarElectrical and Electronics Engineering Department, METU

Prof. Dr. Ugur HalıcıElectrical and Electronics Engineering Department, METU

Assoc. Prof. Dr. Ilkay UlusoyElectrical and Electronics Engineering Department, METU

Assoc. Prof. Dr. Emine Eren KoçakInst. of Neurological Sci. and Psychiatry, Hacettepe Uni.

Assist. Prof. Dr. Elif VuralElectrical and Electronics Engineering Department, METU

Date: 09.09.2016

I hereby declare that all information in this document has been obtained andpresented in accordance with academic rules and ethical conduct. I also declarethat, as required by these rules and conduct, I have fully cited and referenced allmaterial and results that are not original to this work.

Name, Last Name: IBRAHIM BATUHAN AKKAYA

Signature :

iv

ABSTRACT

MOUSE FACE TRACKING USING CONVOLUTIONAL NEURALNETWORKS

AKKAYA, Ibrahim BatuhanM.S., Department of Electrical and Electronics Engineering

Supervisor : Prof. Dr. Ugur Halıcı

September 2016, 97 pages

Laboratory mice are frequently used in biomedical studies. Facial expressions ofmice provide important data about various issues. For this reason real time trackingof mice provide output to both researcher and software that operate on face imagedirectly. Since body and face of mice is the same color and mice moves fast, track-ing of face of mice is a challenging task. In recent years, methods that use artificialneural networks provide effective solutions to problems such as classification, deci-sion making and object recognition thanks to their ability to abstract training dataset.Especially, convolutional neural networks, which are inspired by visual cortex of an-imals, are very successful in computer vision tasks.

In this study, a method based on deep learning which uses convolutional neural net-works is proposed for real time tracking of face of mice. Convolutional neural net-works are good at extracting hierarchical features from training dataset. High levelfeatures contains semantic features and low level features has high spatial resolution.Target information is extracted from combination of low and high level features byconvolutional layer to achieve robust and accurate tracker. Although proposed methodis specialized in tracking face of mouse, it can be adapted any target by changingtraining dataset.

v

Keywords: Convolutional Neural Networks, Machine Learning, Object Tracking

vi

ÖZ

EVRISIMSEL SINIR AGLARI KULLANILARAK FARE YÜZÜ TAKIBI

AKKAYA, Ibrahim BatuhanYüksek Lisans, Elektrik ve Elektronik Mühendisligi Bölümü

Tez Yöneticisi : Prof. Dr. Ugur Halıcı

Eylül 2016 , 97 sayfa

Biyomedikal çalısmalarda laboratuvar fareleri sıklıkla kullanılmaktadır. Yapılan ça-lısmalar sırasında fare yüz mimikleri, ilgili arastırmacıya pek çok konuda ipuçlarıvererek önemli veriler saglamaktadır. Bu sebeple söz konusu farenin yüzünün deneysırasında gerçek zamanlı takibi hem arastırmacı için hem de yüz üzerinde dogrudançalısan yazılımlar için çıktı saglamaktadır. Laboratuvar farelerinin bedenlerinin yüz-leri ile aynı renk olması ve farenin çok hareketli olması farenin yüzünün takibinioldukça zorlastırmaktadır. Son yıllarda, yapay sinir agları temel alınarak gelistiri-len yöntemler egitilen veri setini soyutlayabilme yetenekleri sayesinde, sınıflandırma,karar verme ve obje tanıma gibi pek çok alanındaki problemlere etkin çözümler sun-dular. Özellikle hayvanların görme korteksinden esinlenilerek olusturulan evrisimselyapay sinir agları görsel uygulamalarda oldukça basarılı sonuçları vermistir.

Bu çalısmada fare yüzünün videolarda gerçek zamanlı takip edilmesi için evrisimselyapay sinir agını kullanan derin ögrenmeye dayalı bir yöntem önerilmistir. Evrisimselsinir agları, egitim veri setinden hiyerarsik özellik çıkartmak konusunda basarılıdırlar.Yüksek seviyeli özellikler anlamsal özellikler içerir ve düsük seviyeli özellikler yük-sek çözünürlüge sahiptir. Dirençli ve kesin bir takipçi elde etmek için, hedef bilgisievrisimsel katman kullanılarak düsük ve yüksek seviyeli özelliklerden çıkarılmıstır.Önerilen yöntem fare yüzünü izleme konusunda uzmanlasmıs olmasına ragmen, egi-tim veri kümesi degistirerek herhangi bir hedefi adapte edilebilir.

vii

Anahtar Kelimeler: Evrisimsel Sinir Agları, Makine Ögrenmesi, Obje Takibi

viii

To my wife, Akkaya and Öztürk family...

ix

ACKNOWLEDGMENTS

I would like to express my sincere gratitude to my supervisor Prof. Dr. Ugur Halıcıfor her supervision, encouragement and guidance. It was a great honor to work withher. I also would like to thank METU Computer Vision and Smart Systems ResearchLaboratory and Hacettepe University Neurological Sciences and Psychiatry Institute,Behavior Experiments Research Laboratory members for creating mice database.

I wish to thank ASELSAN A.S. for giving me the opportunity of continuing my post-graduate education.

I am thankful for the support of TÜBITAK (The Scientific and Technological Re-search Council of Turkey) with BIDEB 2210 graduate student fellowship during myM.Sc. education.

This study is partially supported under TUBITAK project 115E248 - Automatic Eval-uation of Pain Related Facial Expression in Mice (Mice-Mimic) Project.

I am also grateful to my wife Burcu for her support, patience and belief in me.

x

TABLE OF CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

ÖZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

CHAPTERS

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation and Overview . . . . . . . . . . . . . . . . . . . 1

1.2 Organization of the Thesis . . . . . . . . . . . . . . . . . . . 2

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 BACKGROUND INFORMATION ON DEEP LEARNING . . . . . . 7

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Biological Neuron . . . . . . . . . . . . . . . . . . . . . . . 11

xi

2.4 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Sigmoid Neurons . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Artificial Neural Network Architectures . . . . . . . . . . . 14

2.7 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . 15

2.8 Back Propagation Algorithm . . . . . . . . . . . . . . . . . 16

2.9 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.10 Initialization and Optimizers . . . . . . . . . . . . . . . . . 28

2.11 Convolutional Neural Networks . . . . . . . . . . . . . . . . 35

2.12 Some Popular CNN Architectures . . . . . . . . . . . . . . . 39

3 LITERATURE SURVEY . . . . . . . . . . . . . . . . . . . . . . . . 43

4 PROPOSED METHOD . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . 55

4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.1 Data Augmentation . . . . . . . . . . . . . . . . . 59

4.3 Off-line Training . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 On-line Tracking . . . . . . . . . . . . . . . . . . . . . . . . 62

5 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . . . . . . . . 65

5.1 Performance Criteria . . . . . . . . . . . . . . . . . . . . . . 65

Center Error . . . . . . . . . . . . . . 65

Region Overlap . . . . . . . . . . . . 66

Tracking Length . . . . . . . . . . . . 66

xii

Failure Rate . . . . . . . . . . . . . . 67

5.2 Test networks . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3 Test Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . 73

5.4.1 Effect of Convolutional Layer . . . . . . . . . . . 73

5.4.2 Effect of Low Level Features and ConvolutionalLayer in Feature Fusion Networks . . . . . . . . . 75

5.4.3 Effect of Depth of Low Level Features . . . . . . . 77

5.4.4 Effect of Depth of High Level Features . . . . . . 78

5.4.5 Overall Comparison . . . . . . . . . . . . . . . . 80

5.5 System Performance . . . . . . . . . . . . . . . . . . . . . . 82

6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

APPENDICES

A BLOCK DIAGRAMS OF TEST NETWORKS . . . . . . . . . . . . 93

xiii

LIST OF TABLES

TABLES

Table 2.1 Convolutional Layers of LeNet-5 Network . . . . . . . . . . . . . . 39

Table 2.2 Layers in VGG-CNN-F Network . . . . . . . . . . . . . . . . . . . 40

Table 2.3 Differences among VGG-CNN-F, VGG-CNN-M and VGG-CNN-SNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Table 2.4 Differences among ConvNet Networks of VGG . . . . . . . . . . . 41

Table 5.1 Summary of Test Networks . . . . . . . . . . . . . . . . . . . . . . 69

Table 5.2 Tracker speeds of C52,5 − C1 − F 3 Network and Test Networks . . . 83

xiv

LIST OF FIGURES

FIGURES

Figure 2.1 Hierarchical Features . . . . . . . . . . . . . . . . . . . . . . . . . 8

Figure 2.2 Dataset Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Figure 2.3 Biological Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Figure 2.4 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Figure 2.5 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Figure 2.6 Recurrent Network . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Figure 2.7 Multi Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . 17

Figure 2.8 Network with dropout . . . . . . . . . . . . . . . . . . . . . . . . 27

Figure 2.9 LeNet-5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 36

Figure 2.10 Local connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Figure 2.11 VGG-CNN-F Network . . . . . . . . . . . . . . . . . . . . . . . . 40

Figure 4.1 Proposed Tracker Network . . . . . . . . . . . . . . . . . . . . . . 57

Figure 4.2 Video Record Setup . . . . . . . . . . . . . . . . . . . . . . . . . 58

Figure 4.3 Target Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Figure 4.4 Augmented Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Figure 4.5 Tracker Failure Example . . . . . . . . . . . . . . . . . . . . . . . 63

Figure 4.6 Width Ratio Histogram . . . . . . . . . . . . . . . . . . . . . . . . 64

Figure 5.1 Performance Measures Correlation . . . . . . . . . . . . . . . . . 67

Figure 5.2 True Positive vs Region Overlap Plot for C50,5−C0−F 4 and C5

0,5−C1 − F 3 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

xv

Figure 5.3 True Positive vs Normalized Center Error Plot for C50,5 −C0 − F 4

and C50,5 − C1 − F 3 Networks . . . . . . . . . . . . . . . . . . . . . . . . 74

Figure 5.4 Failure Rate vs Region Overlap Plot for C50,5−C0−F 4 and C5

0,5−C1 − F 3 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Figure 5.5 True Positive vs Region Overlap Plot for C50,5 − C0 − F 4, C5

0,5 −C1 − F 3, C5

3,5 − C0 − F 4 and C53,5 − C1 − F 3 Networks . . . . . . . . . 76

Figure 5.6 True Positive vs Normalized Center Error Plot for C50,5−C0−F 4,

C50,5 − C1 − F 3, C5

3,5 − C0 − F 4 and C53,5 − C1 − F 3 Networks . . . . . 76

Figure 5.7 Failure Rate vs Region Overlap Plot for C50,5 − C0 − F 4, C5

0,5 −C1 − F 3, C5



2,5 −C1 − F 3, C5



C52,5 − C1 − F 3, C5

3,5 − C1 − F 3 and C54,5 − C1 − F 3 Networks . . . . . 79


2,5 −C1 − F 3, C5



2,4 −C2 − F 3 and C5

2,5 − C1 − F 3 Networks . . . . . . . . . . . . . . . . . . 80


C52,4 − C2 − F 3 and C5

2,5 − C1 − F 3 Networks . . . . . . . . . . . . . . . 81


2,4 −C2 − F 3 and C5

2,5 − C1 − F 3 Networks . . . . . . . . . . . . . . . . . . 81

Figure 5.14 Robustness vs Accuracy Plot of All Trackers . . . . . . . . . . . . 82

Figure A.1 C50,5 − C0 − F 4 Network . . . . . . . . . . . . . . . . . . . . . . 93







xvi



xvii

LIST OF ABBREVIATIONS

MSE Mean Squared Root

CNN Convolutional Neural Network

FC Fully Connected

UHD Ultra High Definition

FPS Frame Per Second

SGD Stochastic Gradient Descent

NAG Nesterov Accelerated Gradient

RMS Root Mean Square

ADAM Adaptive Moment Estimation

VOT Visual Object Tracking

AUC Area Under Curve

ILSVRC ImageNet Large Scale Visual Recognition Challenge

VGG Visual Geometry Group

xviii

CHAPTER 1

INTRODUCTION

1.1 Motivation and Overview

In biomedical studies, laboratory mice are frequently used. For some studies facial

expression of the mice give important clues for researchers. However, detection and

analysis of face of a mouse requires extra labor. Automatizing that process would be

time saving. Therefore, tracking of a face of a mouse is an important application for

biomedical areas. With successful tracker, researches should define only the initial

location of a mouse. The tracker algorithm detects faces in following frames by

tracking it.

Recently, deep learning has become one of the most popular methods in machine

learning field. This artificial intelligence algorithm merges representation learning

with classification or regression methods. There is no need for human intervention in

deep learning algorithms. Since features are learned from example data, algorithms

can easily adapt to training data space. This property makes it very adaptive to differ-

ent kinds of problems. Some application fields of deep learning algorithms are object

detection [1, 2, 3], object recognition [4, 5, 6], pose estimation [7], image segmenta-

tion [8], image stylization [9], image classification[10], age and gender classification

[11], activity recognition [12] and object tracking.

In this study, the main goal is to implement a real time tracking algorithm that tracks

face of a mouse. In object tracking applications, the purpose is to track initially

defined target as long as target is in a video frame. There are some difficulties that

trackers may encounter in tracking applications. Some of them are, fast and abrupt

1

motion, variation in pose, cluttered background, occlusion, object deformation and

illumination changes. A good tracker should be able to track target without being

affected by these difficulties. In recent studies, deep learning algorithms are used to

overcome these difficulties in tracker applications.

Different kinds of methodology can be used in order to implement object tracker

algorithm. In recent years, deep learning algorithms have been used in object track-

ing and they have achieved state-of-art performance. The visual object tracking(VOT)

challenge has been held every year since 2013. In VOT challenge, trackers are bench-

marked on test video sequences which have different attributes. Deep learning based

trackers attended VOT challenge in 2015 for the first time. In VOT 2015 challenge,

three visual object trackers, namely MDNet [56], DeepSRDCF [48] and SODLT [58],

were based on convolutional neural networks. Among 62 trackers, MDNet took first

and DeepSRDCF took second place in terms of region overlap ratio performance

criterion. Since single object tracker which are based on deep learning algorithms

performs better, this study is focused on deep learning based trackers. In this thesis,

background information about deep learning algorithm, literature survey on single

object tracker that uses artificial neural networks, proposed method and tests on pro-

posed method are given.

The mouse face tracker, proposed in this study, is implemented by deep learning

algorithm. As deep learning network model, deep bounding box regression with con-

volutional neural network which is a very powerful in vision tasks is used.

1.2 Organization of the Thesis

In chapter 1, introductory information about this thesis is supplied to the reader. The

motivation behind this thesis, organization and contributions are described in that

part.

In chapter 2, background information in order to understand why deep learning be-

comes so powerful and how it works is explained. Although the first studies on artifi-

cial neural networks are made in 1950s, they gain popularity in recent years. In order

to understand what made artificial neural networks so powerful in this days, a brief

2

history of neural networks is presented. Since, deep learning algorithm is a mathe-

matical algorithm that inspired by human brain. Pure explanation of mathematical

model is not enough to understand its mechanism. Therefore, in chapter 2 biological

model of neuron and its simple mechanism are also described. After biological neu-

ron model, from a single neuron model to deep neural networks structure, different

kinds of artificial neural networks are explained. The learning mechanism of neural

networks with back-propagation algorithm is also described. In addition to that, what

are the weak points about training and how to overcome these issues are expressed.

In near past, deep learning algorithms are started to be used in tracking algorithms.

Different kinds of methods with different kinds of architecture are proposed. While

early studies used multi layer perceptron, recent studies focused on convolutional

neural networks with transfer learning. Tracker algorithms based on deep neural net-

works are revised in details in chapter 3.

In chapter 4, the proposed tracker in this thesis is presented. Network architecture in

the proposed tracker is composed of four network architecture used together. Firstly,

structure of the networks and the layers that create them are stated. For a fast tracking,

the neural network proposed in this thesis is purely trained off-line and no model

adaptation is performed. Dataset generation and the training of the network are also

explained in this chapter. Finally, how tracking is performed on video sequence is

expressed.

In chapter 5, performance of proposed tracker is evaluated. Firstly, performance cri-

terion for tracking is stated. Performance measures are given. In addition to that, 9

test networks with different network architecture are proposed in order to evaluate

performance of the proposed tracker. Performance outputs of these trackers are given

with graphical visualizations.

1.3 Contributions

There have been some studies on tracking by using deep neural networks architec-

tures. Most of the trackers are trained on-line starting from the first frame. Generally,

they generate patches around the target and label them as a positive or negative ac-

3

cording to heuristics they use. Their networks are trained with these patches. In these

methods, large number of forward passes, which is proportional to number of patches,

is made and training is performed in test time. These forward pass and training is very

time consuming, therefore on-line trained trackers are very slow.

The tracker proposed in [13] is trained purely off-line and finds track with single

forward pass. They used a pre-trained network as a feature extractor. Two identi-

cal pre-trained networks are used to extract semantic features from two consecutive

frames. An additional network, which is composed of fully connected layers, is used

to localize target from concatenation of these features.

In this thesis, the tracker proposed in [13] is taken as starting point. However, Held’s

tracker is trained with generic objects. Laboratory mice have characteristic properties

that should be considered. Two of the most important ones are that laboratory mice

are albino and they are very mobile. Since body and face of laboratory mice is the

same color and mice moves fast, tracking of face of a mouse is a challenging task.

In order to overcome these problems, neural network should adapt the mice specific

features. Within this thesis, a mice dataset in order to train neural network is gen-

erated. The target area is chosen to be square since face of a mouse usually fits in

square bounding box. Network architecture is modified so that it tracks target with-

out deforming the square shape.

Although, high level features are useful in order to identify object in given image,

they can’t localize target precisely due to their large receptive field. If only high level

features are used, network can’t regress to bounding box of the target precisely. In

proposed method, low and high level features are used together.

Concatenated features still contain spatial information since there is no fully con-

nected layer, which distorts spatial information, before concatenation. Convolutional

layers are better at exploiting spatial information. In addition to that, depth of input

is also taken into account in convolution operation. It means that features related to

content of the depth are extracted as well. In proposed method, the first layer of the

last network, that is responsible from regressing to target bounding box, is replaced

with convolutional layer in order to keep spatial information and merge all features

4

related to depth.

With these contributions, performance of tracker increased significantly.

5

6

CHAPTER 2

BACKGROUND INFORMATION ON DEEP LEARNING

From birth to death, primal skills of human are improved subconsciously. Senses of

people constantly supply data to human brain. This information is processed by it

with supervised feedback mechanism. Epic learning power of human brain is what

makes human is so adaptive and intelligent.

Artificial intelligence algorithms aim to make human like decision making, vision

perception etc. Some training algorithms of artificial intelligence algorithms are in-

spired by biological learning. Researchers try to make computational model of train-

ing mechanism of brain. Deep learning is a method for artificial intelligence inspired

by biological brain. It is a type of machine learning algorithm that learns from exam-

ple data which are called training set. Need for human intervention such as feature

extraction from input is not necessary since algorithms can be trained purely with

training data. Deep learning network architecture is also called artificial neural net-

work due to its resemblance to brain.

2.1 Overview

Although artificial neural networks have a long history, they have been much more

popular recently. They became more powerful with increasing number of available

training data. Over time, both hardware and software environments for neural net-

works are improved. That makes possible training of more complex and bigger net-

works with large number of training data. By means of that neural networks could

solve more complex problems with lower error rate and they gained popularity.

7

Figure 2.1: Hierarchical features extracted from deep neural network from Deep

Learning Book [14]

Deep neural networks have a hierarchical structure. Simple neural layers are con-

nected on top of previous one. If all network architecture is examined, it is seen that

there is a deep connection. This is why this approach is called deep learning. Due

to this hierarchical structure, neural network can learn complicated concepts. Every

layer of network can generate feature from previous layer. If depth increases, com-

plicated features of input data can be learned. In Figure 2.1, how neural network

represents input hierarchically is shown.

Performance of machine learning algorithms depends on representation of data. Each

piece of information in this representation of data is called feature. In classical ma-

chine learning algorithms, features of data are given to the artificial intelligence sys-

tem by feature extractors designed by human to make decision, classification etc.

Although many problems can be solved by supplying suitable features to machine

learning algorithms, sometimes it is hard to decide which feature should be extracted

for a given problem. In neural networks, not only outputs but also features are learned

from examples. This property is called representation learning.

By the help of representation learning, Neural networks are able to adapt to new

8

Figure 2.2: Dataset size increase over time from Deep Learning Book [14]

tasks without human intervention. One of the best examples of the neural network

representation learning algorithm is auto-encoder. This algorithm composed of two

functions. First one is encoder which extracts feature from original data. Second

one is decoder which generates original data from features that encoder extracted.

When this network is trained with training dataset, it learns to extract features that are

specific to training dataset.

2.2 History

Although first studies on artificial neural networks were made on 1950s, successful

commercial applications appeared on 1990s. Until 90s available datasets were lim-

ited and some skills were necessary to get good results from deep learning algorithms

with limited data. When dataset size increases, the need for expertise is decreased.

As training dataset size increases, deep learning algorithm becomes better at gener-

alizing input data. Therefore, it gave better performance. Dataset sizes are increased

over time. Figure 2.2 shows the size of dataset is increased remarkably over time.

According to [14], if artificial neural network is trained with 5,000 labeled examples

per category, it will achieve acceptable performance for supervised learning. If it is

trained with a dataset size of 10 million labeled examples, it will match or exceed

human performance. Reaching successful results with small dataset and using large

number of unlabeled data is still an important research area.

Another reason of artificial neural networks’ success is that computational resources

9

in order to run much larger networks are available today. According to connection-

ism approach, animals become intelligent when their large number of neurons work

together. An individual neuron or small number of them is not capable of building

smart system. It appears that artificial neural networks also work like that.

Until recently, number of neurons in artificial neural networks was very small with

respect to biological neural system of mammalian brain. After introduction of hidden

layer, size of the artificial neural networks is doubled every 2.4 years. If biological

neural networks are examined, it is seen that biological neurons aren’t densely con-

nected with respect to number of neurons in the brain. There are approximately 86

billion neurons in the human brain [15] and they make approximately 10,000 con-

nections per neuron. In recent days, some neural networks make nearly as many

connection per neuron as cats which are around 8,000 connections per neuron [16].

Artificial neural networks are close to human brain in terms of connection number

per neurons. If neuron increase trend continues, by 2050s artificial neural networks

will have the same number of neurons with human brain.

Growth in network size is made possible by improved hardware and software infras-

tructure. As network size increases, required memory space for weight connection

and computational power for training and evaluation increase. As it is told before,

artificial neural networks are composed of layers, each of it connected to top of previ-

ous one. Layer takes output of previous one and operates on it. However, neurons in

each layer works in a parallel manner. With improvement in general purpose GPUs,

distributed computing software on GPUs is become available. Some deep learning

frameworks such as Theano [17], Caffe [18] and Tensorflow [19] are designed to

work on GPUs by using this property. They use parallel processing property of GPUs

that provides faster network connectivity in neural networks. These days GPUs pro-

vide much faster computing than CPUs for neural networks due to large number of

processors on GPUs. By the help of improvement in memory sizes and distributed

computing, artificial neural network size increased significantly.

Early networks are able to recognize a limited number of categories. However modern

networks can recognize more than 1000 different categories. Object recognition con-

test ImageNet Large Scale Visual Recognition Challenge (ILSVRC) held each year.

10

Dendrite

Cell body

Node ofRanvier

Axon Terminal

Schwann cell

Myelin sheath

Axon

NucleusFigure 2.3: Biological Neuron from Wikimedia [21]

Algorithm results are evaluated by performance criteria called top-5 error rate. For

that performance criteria algorithm gives most probable 5 classes among 1000. If cor-

rect class is not among these 5 classes, this is called erroneous. In 2012, Krizhevsky

et al. [10] reached state-of-art performance with convolutional neural networks. They

brought top-5 error from %26.1 to %15.3. After then, contest is won by deep convo-

lutional networks in following years. In 2015, ResNet [20] won ILSVRC2015 with

%3.57 top-5 error which is human level grade. Deep Learning is also used in many

other fields such as speech recognition, image segmentation, pedestrian detection and

object tracking.

This shows that with improvements in computational resources and datasets, artificial

neural networks may provide solution to much more sophisticated problems in the

future.

2.3 Biological Neuron

Neurons are specialized cells in human brain. Human cognition system is composed

of large number of neurons. Around 86 billion neurons exist in human brain and in

average they make 10000 connections to each other. In network, each neuron behaves

as an information processing unit. Single neuron is not so intelligent, however brain

which is a connection of large number of neurons constitute human cognition.

A typical neuron is composed of soma, dendrite and axon. Figure 2.3 shows illustra-

tion of neuron cell. Soma is the body of the cell. Dendrites can be thought as inputs to

11

neurons and axon is the output. Although working mechanism of a neuron in detail is

very complicated, simplified models can easily be expressed in algebraic form. Basic

mechanism of neuron is as follows.

If impulses that reaches to neuron via dendrite cause soma potential to exceed some

threshold value, neuron fires. That means it sends electrical pulse via axon. Axon of

a neuron is connected to dendrite of another neuron. Therefore, pulse of a one neuron

excites another neuron and it may cause firing. Those consecutive firings constitute

human cognition system.

Firing process of a neuron is a very slow compared to computers. Even the fastest

neurons in the brain can fire at a rate of around 200 Hz [22]. If it is compared to

commercial computers, it may seem that neurons are so much slower than computers.

However, neurons perform simultaneously. Firing of one neuron may trigger more

than one. Harmonic performance of 86 billion neurons in brain makes human so

intelligent.

2.4 Perceptron

The power of brain encouraged researcher to work on brain like system that are in-

spired by biological neuron. One of the earliest type of the artificial neural network

is perceptron. Perceptrons [23] were developed by Frank Rosenblatt in the 1960s.

Perceptron is a simple mathematical model of biological neuron. It takes several

binary inputs and produces one binary output. There is only one output, but any

number of inputs can be defined. Figure 2.4 shows the graphical representation of a

perceptron.

Binary inputs are multiplied with weights which are real numbers. If the sum of

weighted inputs exceed predefined threshold value (also a real number), neuron out-

puts 1, otherwise 0. Algebraic form of perceptron is given in (2.1).

output =

0 if∑

j wjxj + b ≤ threshold

1 if∑

j wjxj + b > threshold(2.1)

12

1

x1

x2

wn

∑

b

w n

w0

w1

└┐

Step Function

Figure 2.4: Graphical representation of perceptron

Mainly perceptron is used in decision-making problems. Let inputs of perpectrons

are some conditions and output is whether an action should be performed or not. By

choosing appropriate weights and bias value, decision-making algorithm is obtained.

If more than one layer is used, more complex decision making algorithms can be

designed. In that case, algorithm decides to do something by evaluating decisions

from previous layer.

Another application that perceptron is used is basic logical operations. By using

perceptron NAND gate can be implemented. Since NAND gate is universal (any

logical operation can be implemented by NAND gates), by collection of perceptron

any logical computation can be made. However, without automatic tuning artificial

neural network does not provide any improvement over standard logical operation.

Therefore, need for learning algorithm that adjusts weight and biases using data is

emerged. In learning, main purpose is to get desired output by adjusting weights.

However, small changes in weight may not affect output of perceptron because output

is step function. That make training perceptron is hard. This hardness is solved by

sigmoid neurons.

2.5 Sigmoid Neurons

Algebraic model of sigmoid neuron is almost the same with perceptron. Difference is

that sigmoid neurons takes real inputs and they output sigmoid of the weighted sum of

inputs and bias. Sigmoid function is given in Figure 2.5. Explicit function of sigmoid

13

f(x)

x

Figure 2.5: Sigmoid Function

neuron is given in equation (2.2). Any small change in weight directly changes output

of sigmoid neuron. Even though, change in output is small, if corrections in the output

error are repeated iteratively, satisfactory results are obtained. Actually smoothness of

sigmoid function makes training possible. The effect of weight changes to the output

is defined as a partial derivative of output with respect to weights. Change in output

with respect to partial derivatives is shown in equation 2.3 where w represent weights

and b represents bias. If output function isn’t derivable, effect can’t be defined as a

algebraic form. That makes training very hard. In the following sections, how these

properties makes training possible will be explained.

f(x) =1

1 + exp(−∑

j wjxj − b)(2.2)

∆output ≈∑j

∂ output∂wj

∆wj +∂ output∂b

∆b (2.3)

2.6 Artificial Neural Network Architectures

Basically, If large numbers of artificial neurons is connected to each other, it is called

artificial neural network. Artificial neural networks are classified according to their

connection types. There are two types of artificial neural network architecture namely

feed forward and recurrent neural networks.

14

Figure 2.6: Recurrent Artificial Neural Network

In network architecture, if there are backward connections, the network is called re-

current neural network. These backward connections cause some loops in network.

Neurons fire for a while until it reaches steady state due to the loops and the neural

network constitute a dynamical system. State of the neurons (firing or not) changes

while the network is not in steady state. Firing of neurons stimulates other neurons.

When network goes to steady state, the states stabilize. It is expected that this steady

state of the network corresponds to desired data. Figure 2.6 shows an example of re-

current neural network. Each line represents connection in the direction of arrow and

each connection has its own connection weight. Since proposed method in this thesis

doesn’t include recurrent neural network, it will not be detailed in following sections.

In feed forward neural network, there are only connections in the forward direction.

The feed forward neural networks will be presented in detail in the following sections.

2.7 Multilayer Perceptron

Feed forward neural networks are composed of three different type of layers. They are

input, output and hidden layers. Neurons in the input layer are called input neurons.

Each of them is responsible from feeding data to the network. Input neurons can be

thought as neuron which has no input and one output that is data itself. Neurons in the

hidden and output layers are regular artificial neurons with multiple inputs and one

output. Neurons in the output layer (output neurons) fire output data of the network.

15

Hidden neurons don’t have a special property. They are called hidden because out-

puts of these neurons aren’t observable by user. Although design of neural network

architecture can be tricky, design of an input and output layer is very straightforward.

Number of the neuron in the input layer is equal to the number of data and the number

of the output neuron is equal to the number of outputs necessary. For example, the

number of neurons in output layer is 1000 for classification problem with 1000 dif-

ferent classes. There are different kinds of feed forward neural networks. If network

is consist of one layer in which all input neurons are connected to all output neurons,

it is called single layer perceptron or single layer fully connected network. If network

consists of one or more hidden layer, it is called multi layer perceptron or just fully

connected network. For those networks, it is needed that inputs of hidden neurons

are connected to all neurons before it and output of hidden neurons are connected

to inputs of each neuron in the following layer. Although these networks are called

multi layer perceptron, neurons doesn’t need to be perceptron in general. They can

be sigmoid neurons or neurons with different activation functions.

In fact in recent neural networks, ReLU activation layer is more frequently used.

ReLU (Rectified linear unit) is a one input one output function. If weighted input is

bigger than zero, input identically transferred to output. Otherwise, output is zero.

ReLU increases sparsity and overcomes gradient vanishing problem which is faced

in sigmoid function.

A simple multi layer perceptron architecture with 2 hidden layers, 3 inputs and 1

output is given in Figure 2.7.

2.8 Back Propagation Algorithm

The main goal in training is to get desired output for defined input. To evaluate how

well this goal is achieved, cost function is used. Cost function is a non-negative

function that penalties error between network output and desired output. An example

cost function is given in (2.4).

16

Figure 2.7: Multi Layer Perceptron Architecture with 2 hidden layer, 3 input and 1

output neurons

C(w, b) =1

2N

N∑x=1

‖y(x)− a(w, b)‖2. (2.4)

In equation, w corresponds to all weights and b corresponds to all biases in the net-

work, N is the total number of training inputs, y(x) is desired output and a is the

network output for given w and b. For each x, different a is feed forwarded from net-

work. This cost function is called mean squared error (MSE) function. If this function

is examined, it is seen that cost function is non-negative and cost gets close to zero

when networks outputs, which is predicted by the network, are close to desired ones.

On the contrary cost gets bigger in the squared order of prediction error. As stated

above, the main goal of a training algorithm is to reach desired output by minimizing

cost function.

One of the most effective (and the most widely used) algorithm for training feed

forward neural network is the back propagation algorithm. The back propagation

algorithm has gained popularity after publication of the famous paper [24] in 1986.

Briefly, back propagation algorithm computes the gradients of the cost function with

respect to weights and biases. Gradient gives the vectorial direction in the weight

and bias space that cost function increases most. By subtracting the gradients from

weights and biases, algorithm tries to minimize the cost.

Before getting into detail, let’s go over the notation that will be used in this thesis.

17

wljk defines a connection weight from kth neuron in (l − 1)th layer to jth neuron

in the lth layer. blj defines bias of jth neuron in the lth layer. alj is used to denote

activation value of the jth neuron in the lth layer. Output value of individual neuron

is called activation. In more algebraic definition, it is the sigmoid of the weighted

sum of inputs and bias for sigmoid neurons. Activation function doesn’t need to be

sigmoid. Activation function is represented with σ. More generic form of activation

of a neuron is given in (2.5).

alj = σ

(∑k

wljkal−1k + blj

)(2.5)

In this thesis vectorized function representation is used. Vectorization means that

function is applied to every element of input that is in vector form. Vectorized repre-

sentation of function (2.5) is shown in (2.6).

al = σ(wlal−1 + bl) (2.6)

In cost function two assumptions should be satisfied in order to apply back propa-

gation. The first one is that cost function should be written in terms of average of

cost functions of individual training samples. This assumption is necessary because

in order to compute partial derivative of a cost function with respect to weights and

biases, firstly partial derivative of cost of a single training sample is calculated. Then

derivative of cost function is calculated by averaging these individual cost function

values. The second assumption is that cost function should be written in terms of

outputs of the networks.

In this section mean squared error function will be used in order to illustrate back

propagation algorithm.

Back propagation algorithm is based on some algebraic operations. One of them is

the Hadamard product. Hadamard product is the element wise production of two

matrices. In this thesis, Hadamard product will be shown as �.

As stated above back propagation algorithm is based on taking partial derivative of

cost function. In order to compute those derivatives, a term that represents an inter-

18

mediate error should be defined. It is the error in the jth neuron in the lth layer. This

error is shown as δlj . In order to simplify equation this error will be helpful. Another

useful quantity is the weighted input quantity which is the weighted sum of inputs

and biases. In algebraic form, weighted input is shown in (2.7).

zl = wlal−1 + bl (2.7)

δlj can be shown as in (2.8). If change in weighted input is small, δlj corresponds to

change in cost that can be defined as error.

δlj ≡∂C

∂zlj(2.8)

From starting the general cost definition, δl will be computed and it will be related to

partial derivatives ∂C/∂wljk and ∂C/∂blj .

By the help of four equations, back propagation can be defined.

1. The first equation is error at the output layer. The output error is the error in

cost function caused by weighted input (zL) of it. Uppercase L corresponds to

output layer. Derivation of output error is as follows.

δLj =∂C

∂zLj(2.9)

By applying chain rule to the derivative term above, it can be expressed with

respect to output activation.

δLj =∑k

∂C

∂aLk

∂aLk∂zLj

(2.10)

Since activation of a neuron depends on weighted input of itself. Therefore,

when k isn’t equal to j , ∂aLk /∂zLj vanishes. The equation above can be shown

in a more simple form.

δLj =∂C

∂aLj

∂aLj∂zLj

(2.11)

19

Since activation of a neuron is σ(zL), output error can be shown as (2.12).

δLj =∂C

∂aLjσ′(zLj ) (2.12)

In matrix form, it can be represented with hadamard product.

δL = ∇aC � σ′(zL) (2.13)

Output error function is depend on the form of the cost function where ∇aC is

the gradient of C with respect to a. Calculating derivations of a complex cost

function may not be resource friendly. However, if appropriate cost function is

selected, the result will be easily computable. If MSE function (2.4) (which is

the example case of this thesis) is used, derivation of MSE cost function with

respect to activation would be very simple.

∂C/∂aLj = (aj − yj) (2.14)

If the error at the output is shown in matrix form, it will be as in (2.15). This

function can be easily computed.

δL = (aL − y)� σ′(zL) (2.15)

2. The second equation is the error in the hidden or input layer δl, in terms of the

next layer. Derivation of that error is given as follows.

In this step, algebraic expression of δl in terms of δl+1 is needed. This can be

derived by the help of chain rule.

δlj =∂C

∂zlj(2.16)

=∑k

∂C

∂zl+1k

∂zl+1k

∂zlj(2.17)

=∑k

∂zl+1k

∂zljδl+1k (2.18)

In order to get more simplified expression, ∂zl+1k

∂zljwill be derived. More explicit

form of zl+1k is as follows.

20

zl+1k =

∑j

wl+1kj a

lj + bl+1

k =∑j

wl+1kj σ(zlj) + bl+1

k (2.19)

If it is differentiated with respect to zlk, a more simple form is obtained.

∂zl+1k

∂zlj= wl+1

kj σ′(zlj) (2.20)

Substituting this equation to error definition, results in backward error propa-

gation:

δlj =∑k

wl+1kj δ

l+1k σ′(zlj) (2.21)

In matrix form;

δl = ((wl+1)T δl+1)� σ′(zl) (2.22)

By using (2.13) and (2.22), errors caused by weights in any layer can be com-

puted. Firstly, output layer error δL is computed. By using output layer error,

error in a layer before output δL−1 can be computed. Iteratively all errors in the

network can be computed by backward error propagation.

3. The third equation is a rate of change in cost caused by biases. This also can be

derived by the help of chain rule.

∂C

∂blj=∑k

∂C

∂alk

∂alk∂blj

(2.23)

As it is told above, partial derivative of activation function with respect to bias

is zero, if k isn’t equal to l. The equation above can be simplified as follows:

∂C

∂blj=∂C

∂alj

∂alj∂blj

(2.24)

Writing partial derivative of activation function with respect to bias in terms

of partial derivative of activation function with respect to weighted sum and

partial derivative of weighted sum with respect to bias by the help of chain rule,

and substituting into equation above, results in:

21

∂C

∂blj=∂C

∂alj

∑k

∂alj∂zlk

∂zlk∂blj

(2.25)

Since partial derivative of weighted sum with respect to bias is equal to 1 and

partial derivative of weighted sum with respect to bias is zero when k isn’t equal

to l, partial derivative of cost function with respect to bias turns out to be error

term δlj .

∂C

∂blj=∂C

∂alj

∂alj∂zlj

=∂C

∂zlj= δlj (2.26)

4. The last equation that is needed by back propagation algorithm is the rate of

change of cost function with respect to weights. That equation can be repre-

sented as follows.

∂C

∂wlkj=∂C

∂zlj

∂zlj∂wlkj

(2.27)

Notice that first term is δ and the second term corresponds to al−1k . (Check

equation (2.7)). The simple form of the fourth equation can be written as;

∂C

∂wlkj= al−1k δlj (2.28)

Since four fundamental equations are derived, back propagation algorithm can be

defined in terms of them. As it is told before, the main purpose of back propagation

algorithm is to compute the partial derivative of cost function with respect to weights

and biases. After partial derivatives are calculated, they are multiplied with a constant

called learning rate and subtracted from weights and biases. By this method, cost

function is minimized. Step by step procedure of back propagation algorithm is given

below.

Calculation of gradients:

1. Training data is given to the network. Input corresponds to activation of input

layer a1.

22

2. Input is fed forward layer by layer by using activation function. zl = wlal−1 +

bl.

3. After all activations are calculated, output layer included, output layer error is

computed by using the first equation (2.13).

4. The error is back propagated through to input layer by using the second equa-

tion (2.22).

5. The gradients of the cost function is calculated by using the third (2.26) and the

fourth (2.28) equation.

Notice that gradient of one training data is considered above. However, in practice,

training samples are given as batches. The whole back propagation training algorithm

for batch training is as follows:

1. Initialize all weights and biases.

2. For each training sample in the batch, calculate gradients according to the pro-

cedure above.

3. Apply gradient descent on weights and biases by averaging gradients of all sam-

ples in the training batch. α corresponds to the learning rate andN corresponds

to number of samples in a batch.

wlkj → wlkj −α

N

N∑x=1

∂C

∂wlkj(2.29)

blj → blj −α

N

N∑x=1

∂C

∂blj(2.30)

2.9 Regularization

As you can notice, weights and biases are the free parameters of artificial neural net-

works. In modern networks, number of free parameter may be really big. Although,

this is what makes neural networks are so powerful, this also bring some disadvan-

tages. One of the most important one is over fitting.

23

Learning algorithms tries to minimize cost function. That means, it minimizes the

error between training set and network output. Since neural network has so many free

parameters, after some point it starts to memorize training set and loses its ability to

generalize the input dataset. In other words, although the value of the cost function of

training dataset decreases, cost value of the function of test dataset starts to increase.

That is called over fitting. In this part, how to avoid over fitting will be explained.

One method to avoid over fitting is to increase size of training dataset. If training

dataset can’t define the problem in satisfactory number of cases, the network can’t

generalize the problem and gives the correct results for cases that are close to ones

in dataset. If number of data is increased in the dataset, network will output more

accurate predictions.

There may be some cases that no more training data can be supplied. In these cases,

artificial data generation method can be used. This is called data augmentation. Let’s

consider the human face detection problem. It is obvious that the training dataset is

contains the human faces. In order to expand training dataset, rotation of training

images by let say up to 10 degree or mirroring the image horizontally may be used.

In both cases modified image contains human faces and can be used as a training

sample.

Another method is to decrease free parameter in the network. Although this may

solve the over fitting problem, decreasing number of free parameter lower the power

of artificial neural network.

If there is a complex problem and no more training data can be generated, these two

methods can’t be applied. Fortunately, there is another method that can be used with

fixed network size and fixed dataset. This is called regularization.

One of the most commonly used regularization technique is L2 regularization, in

other words weight decay. The idea is to add an extra term to cost function that is

called regularization term. Usually, weight decay isn’t applied to bias term. Generic

definition of regularized cost is given in (2.31).

24

C = C0 +λ

2n

∑w

w2 (2.31)

Sum of squared of all weights are added to cost function. Regularization term is

scaled by λ2n

where n is the number of weight parameter. λ is called regularization

parameter. As an example, regularized mean squared error function is given in (2.32)

C(w, b) =1

2N

N∑x=1

‖y(x)− a‖2 +λ

2n

∑w

w2 (2.32)

Let’s check how regularization term affects training. In order to understand effect of

regularization weight update equation with regularized cost should be derived. Rate

of change of regularized cost function with respect to weights is given in (2.33).

∂C

∂w=∂C0

∂w+λ

nw (2.33)

Gradient descent weight update rule for regularized means squared error cost function

is given by.

w →(

1− αλ

n

)w − α∂C0

∂w(2.34)

As it can be seen from equation (2.34), regularization term rescales weights by 1− αλn

.

Since λ is positive value, in every iteration weights are forced to close zero.

There is a variant of L2 norm which is called L1 norm. In L1 norm sum of the

absolute values of all weights is added to the cost function instead of squared sum.

General definition of L1 regularized cost function is given in (2.35).

C = C0 +λ

n

∑w

|w| (2.35)

Gradient descent weight update of L1 regularized cost function is as follows. As a

cost function, mean squared error function is used again.

25

∂C

∂w=∂C0

∂w+λ

nsgn(w) (2.36)

w → w′ = w − αλ

nsgn(w)− α∂C0

∂w(2.37)

If L1 norm is used constant value which is αλnsgn(w) is subtracted from weights at

each iteration.

Both L1 and L2 try to minimize weights. L2 norm affect weight update in the order

of weight magnitude. L1 norm has constant effect. If weights are small L1 drives

weights to zero more brutal than L2 norm. That may cause oscillation around zero.

However, since effect of L2 norm is proportional to weight value, it doesn’t cause

oscillation and regularization becomes faster than L1 norm for large w values.

If regularized cost function is used, in training, small weights will be preferred. Large

weight is only preferred when C0 is small with that weight. After training, only

weights that decrease the cost function will be large. In other words, features that the

most distinctive for training set will be large. Learning distinctive features improves

generalization.

One another regularization method is called dropout. Dropout is a different method

from L2 and L1 regularization. In dropout, cost function isn’t modified. Network

is modified instead. Normally, during training of neural networks, all neurons par-

ticipate feed-forward and back propagation. However in dropout, randomly selected

predefined percentage (dropout ratio) of neurons is removed from network. Feed for-

ward and back propagation is applied only on remaining neurons. After weight up-

date, removed neurons are restored and new randomly selected neurons are removed.

An example of multi layer perceptron network that dropout is applied on is shown in

Figure 2.8.

Let say dropout ratio is 0.5 which is the most common case. When feed forward is

applied for inference full network is used. That means the number of hidden neurons

in inference will be twice of the ones in training. In order to compensate that, weights

of hidden neurons will be divided to two.

26

Figure 2.8: Multi layer perceptron with dropout from [25]

How dropout prevents over fitting and improves performance on test case is not

straight forward. In order to understand that, imagine there are identical two net-

works. They are trained with the same training dataset however in a different sample

order. After training, most probably they will end up with different weight. Also,

they will over fit differently.

Consider these two networks will be used for inference for specific input. They will

give different results. Which network gives the true result is unclear. Some voting

or averaging the outputs of the networks may be a powerful strategy to decide true

output. Therefore, it is thought that averaging of differently over fitted networks

give not over fitted result and improve test performance. Since in different training

iterations different neurons are active, when dropout is used in training, it is like

training different networks. Therefore, the output of a network which is trained with

dropout, is kind of average of outputs of different networks.

27

2.10 Initialization and Optimizers

In order to get good performance of the network, initialization and training of network

is very important. Karpathy et al. gives valuable information about these concepts on

lecture notes of convolutional neural networks for visual recognition lecture in Stan-

ford University [26]. Initialization and parameter update strategies will be explained

below.

Back-propagation algorithm tries to minimize cost function with respect to free vari-

ables of the network such as weights and biases. In order to avoid local-minima,

starting state of the free variables are very important. That means good initializer

methodology is crucial in order to obtain good performance from trained network.

The final state of weights is unknown before training. However, empirical results

suggest that it is logical to believe that half of weights are positive and half of them

are negative with proper training data. Since mean of weight is close to zero, initial-

izing all weight with zero may seem a logical operation. However, when all weights

are zero, output of all training sample will be zero. Therefore, back-propagation algo-

rithm will propagate the same error. Because of that, symmetry should be broken in

weight initialization. Although, assigning all weight to zero is a bad idea, assigning

random number close to zero that breaks symmetry is applicable. One of the com-

mon approaches in weigh initialization is assigning random numbers with Gaussian

distribution. Variance of 0.01 is a practical value.

However, it isn’t always true that initializing with small number provide better per-

formance. When weights are small, back-propagation algorithm computes small gra-

dients. This small gradient becomes smaller when it is back-propagated through the

network to input layer. For deep network, gradients will become so small that doesn’t

affect network when weight updates. For these cases, bigger initial values should be

considered.

When network initialized with Gaussian distribution, all neurons are initialized with

the same variance. In that case, variance of output of neurons with large number of in-

puts will become larger. Variance of weighted input is given in equation (2.38), where

sweighted input. It is assumed that average of weights and input is 0. Equations show

28

that variance is proportional with number of inputs.

Var(s) = Var(n∑i

wixi)

=n∑i

Var(wixi)

=n∑i

[E(wi)]2Var(xi) + E[(xi)]

2Var(wi) + Var(xi)Var(wi)

=n∑i

Var(xi)Var(wi)

= (nVar(w)) Var(x)

(2.38)

A solution to that is to adjust variance of Gaussian with respect to number of inputs.

Glorot et al. [27] proposed an initializer with that idea. They proposed Gaussian

distribution with variance of two divided by the sum of number of neurons in current

layer and the number of neurons in next layer. Explicit equation is given in (2.39). n

is the number of neurons in lth layer. This initializer is called Xavier initializer.

V ar[wl] =2

nl + nl+1(2.39)

In this thesis Caffe framework [18] is used. In caffe framework, Xavier initializer is

implemented with respect to only number of neurons in the input layer. Variance of

Gaussian distribution in Xavier initializer in Caffe is given in 2.40.

V ar[wl] =2

nl(2.40)

Another important application in training is a parameter update mechanism. As it

discussed in back-propagation section, parameters are updated with gradient descent

algorithms where gradients are calculated by back-propagation algorithm.

There are three types of gradient descent algorithms in terms of number of training

sample that it uses. In stochastic gradient descent (SGD) gradients are calculated and

parameters are updated for every training sample. Batch gradient descent algorithm

29

updates parameter by averaging gradients of all training samples in dataset. One pass

of all training sample is called as epoch. In mini-batch gradient descent algorithm,

training samples are divided into a fixed size groups. For every group of sample,

gradients are averaged and parameters are updated. Group size is called batch size.

Mini-batch gradient descent algorithm is the most frequently used one when the num-

ber of samples in the training set is large.

There are also different variants of gradient descent algorithm in terms of update

mechanism. These algorithms try to optimize training speed and performance by

modifying gradient descent algorithm. An informative overview of gradient descent

algorithms is made on [28] by Ruder. These parameter update methodologies will be

explained below based on stochastic gradient descent.

In standard stochastic gradient descent (SGD) algorithm, which also called vanilla

stochastic gradient descent, parameter is updated by subtracting weighted gradient

from itself. Multiplier of gradient is called learning rate which is represented with α.

Vanilla SGD is given in (2.41).

w = w − αdCdw

(2.41)

Gradient of the cost function with respect to parameter represent the steepest increase

in cost. By updating parameter with the negative of the gradient, cost is forced to min-

imize in the steepest decent direction. However, stochastic gradient decent algorithm

has some drawbacks. SGD is very slow near ravines which are common around local

minima. Near ravines, some curves are steeper than direction of minima. Therefore,

cost function starts to oscillate near minima instead of directly going into it.

Momentum method [29] brings solution to that problem. Momentum method is in-

spired from momentum phenomena in physics. An additional velocity term with mo-

mentum is added to gradient descent algorithm. Parameter update is made according

to that velocity. With momentum, cost function changes more smoothly and oscilla-

tions are avoided. In addition to that, local minima can be avoided by the help of the

momentum term.

SGD with momentum parameter update function is given in (2.42). µ corresponds to

30

momentum hyper-parameter. A typical value for µ is 0.9.

vt = µ ∗ vt−1 − α ∗dC

dw

w = w + vt

(2.42)

A variation of momentum SGD is called Nesterov accelerated gradient (NAG) [30].

In Nesterov’s momentum method, next probable state of weight is calculated by the

help of the velocity term. Parameter update is made with gradient of cost function

with respect to that estimated weights. This momentum performs better than stan-

dard momentum since effect of momentum on gradient is also considered. Parameter

update function is given in (2.43).

wahead = w + µ ∗ vt−1

vt = µ ∗ vt−1 − α ∗dC

dwahead

w = w + vt

(2.43)

In practice, derived version of Nesterov parameter update formula (2.43) that is more

similar to momentum SGD is preferred. Derived version of NAG is given in (2.44).

vt = µ ∗ vt−1 − α ∗dC

dw

w = −µ ∗ vt−1 + (1 + µ) ∗ vt(2.44)

Another parameter update strategy to increase performance is annealing learning rate

in time. High learning rate causes big step in cost domain. That is good thing at the

beginning of training since it is desired to reach global minimum as quick as possible.

However, minimum point may be missed due to those big steps near minima. It takes

so much time to settle to minima due to those big steps. Decreasing learning rate near

minima decreases step size. Therefore, probability of settling to minima increases.

Annealing learning rate is a tricky application. If learning rate is decreased too ag-

gressively, it will be very slow to reach to minima. If it decreased slowly, training

takes long time due to oscillations around minima.

31

There are different implementation strategies in annealing learning rate. The first one

is step decay. Learning rate is reduced with some factor after predefined number of

epoch. Annealing strategy of learning rate may change in different networks. For

those networks ad-hoc parameters may be necessary. Step decay is appropriate for

that kind of problems because parameters are easy to control. The other two strategies

are exponential decay and 1/t decay. In exponential decay, learning rate is decreased

proportional to inverse of exponential function. Learning rate update formula for

exponential decay is shown in (2.45). In 1/t decay, learning rate is proportional to

inverse of time. Update rule for 1/t decay is given in (2.46).

α(t) = α0e−kt (2.45)

α(t) = α0/(1 + kt) (2.46)

The methods that are discussed up to now, manipulate parameters globally. Equal

learning rate and momentum are applied all the parameters. If these hyper-parameters

are updated individually for each parameter of network, much better performance

should be achieved. Two method that uses per-parameter adaptive learning is ex-

plained as follows.

The first adaptive learning method is called as Adagrad. This method is proposed

by Duchi et al. [31]. It basically updates low gradients parameters more and high

gradient parameter less. By the help of that cost can escape from saddle points. In

order to achieve sparse data Adagrad is a good strategy. Parameter update formula

for Adagrad is given in (2.47).

Gt = Gt−1 +

(dC

dw

)2

w = w − α√Gt + ε

∗ dCdw

(2.47)

Gt is a variable that holds sum of squared of gradient. Its size is equal to parameter

size. ε is smoothing constant. Typical value for smoothing constant is 1e-8. Note

32

that effective learning rate is reduced for high gradient variables and increased for

low gradients. Empirical results shows that ε is very important parameter. Without

it AdaGrad optimizer performs much worse. The beauty about AdaGrad is that there

is no need to tune learning rate manually. AdaGrad adapts to it with respect to cu-

mulative gradients per parameter. However there is a drawback of this adaptation.

Gt increases in time due to squared cumulation. This makes effective learning rate

so small that gradients doesn’t contribute to training in time. In order to solve that

problem following two algorithms are proposed.

Remember that sum of squared of gradients causes learning rate decay in AdaGrad. In

AdaDelta [32], this term is replaced by decaying average of square of gradients. Av-

eraging and decaying term protect adaptation term from getting so large. By the help

of this, effective learning rate doesn’t vanish. γ represents decaying hyper-parameter.

After replacing squared sum of gradients, formula in (2.48) is obtained.

E[

(dC

dw

)2

]t = γE[

(dC

dw

)2

]t−1 + (1− γ)

(dC

dw

)2

t

w = w − α√E[(dCdw

)2]t + ε

∗ dCdw

(2.48)

Authors of AdaDelta states that update term should have the same unit with parameter

that is to be updated. In order to achieve this they defined another variable that is the

decaying average of squared of parameter updates. Parameter update is shown as

∆w. The equation of the decaying average of squared parameter updates is given in

equation (2.49).

E[(∆w)2]t = γE[(∆w)2]t−1 + (1− γ)(∆w)2t (2.49)

Note that denominator of the effective learning rate is equal to root mean square

(RMS) of gradient. Square-root of equation (2.49) can also be defined as RMS of

parameter update term. In order to satisfy unit balance, authors used RMS of param-

eter update term as learning rate. Previous value of parameter update term is used

33

instead of current one since current values aren’t known. When all pieces are merged,

equation in (2.50) is obtained.

∆w =RMS(∆w)t−1

RMS(dCdw

)t

∗ dCdw

w = w − RMS(∆w)t−1

RMS(dCdw

)t

∗ dCdw

(2.50)

RMSProp [33] is another method that tries to solve effective learning rate decaying

problem of AdaGrad optimizer. RMSProp and AdaGrad optimizers developed at the

same time independently. In RMSProp, problem solved in a simpler way compared

to AdaGrad. Learning rate isn’t discarded. Only squared sum of gradients is replaced

by decaying average of square of gradients. The equation which is given in AdaGrad

(2.48) is actually parameter update rule for RMSProp. Hinton suggest 0.9 for γ and

0.001 for α for good default value.

Another optimizer that is frequently used is Adaptive Moment Estimation (Adam)

optimizer [34]. Adam optimizer can be thought as RMSProb with momentum term.

Adam optimizer keeps decaying average of square of gradients term. In addition

to that decaying average of past gradients term is added. That term is similar to

momentum. Decaying average terms are given in equation (2.51). The first equation

is decaying average of past gradients and the second equation is decaying average of

square of gradients. Note that m is estimate of the first equation and v is estimate of

the second momentum.

mt = β1mt−1 + (1− β1)dC

dw

vt = β2vt−1 + (1− β2)(dC

dw

)2 (2.51)

These two momentums are initialized with zeros. Kingma et al. observed that mo-

ments goes to zero when β1 and β2 are close to one. In order to overcome that, they

proposed bias-corrected term which are shown in (2.52)

34

mt =mt

1− β1vt =

vt1− β2

(2.52)

Parameter updates are made with respect to bias corrected moments. Parameter up-

date rule for Adam optimizer is given in equation (2.53). Suggested hyper-parameter

values in paper are β1 = 0.9, β2 = 0.999 and ε = 1e− 8.

wt = wt−1 −α√vt + ε

mt (2.53)

2.11 Convolutional Neural Networks

Convolutional neural network is a variation of multi layer perceptron. It is inspired by

visual cortex of animals. In 1968, Hubel and Wiesel [35] made a research on mam-

mal’s visual cortex and they realized that some neurons in visual cortex are responds

to local regions of visual field. They called those regions receptive fields. By the help

of that idea convolutional neural networks are designed.

Convolutional neural networks are composed of sequence of layers. Each layer trans-

forms given input array, which is usually 3 dimensional, to another array. Most

frequently used layers in convolutional networks are convolutional, ReLU, pooling,

normalization and fully connected layers. A convolutional network, namely LeNet-

5, proposed by LeCun et al. [36], is given in 2.9 as an example. This network is

composed of 2 convolutional, 2 pooling and 3 fully connected layers. Each layer is

explained as follows.

In convolutional layer, neurons make spatially contiguous local connections that form

receptive fields. Convolutional neural networks are mostly used in vision applications

where inputs are 2D for gray-scale images and 3D for colored images. In Figure 2.10,

connection graph of an example network is shown. For simplicity, input is represented

as 1D. Each neuron in network makes connection to 3 neurons in the previous layer.

Neurons in the layer m are affected from the 3 neuron in layer m-1, neurons in layer

m+1 is affected from the 5 neurons in layer m-1. That means effective fields of

35

Figure 2.9: LeNet-5 Architecture taken from [37]

w0

w1

w2

Layer m-1

Layer m

Layer m+1 w0

w1

w2

w0

w1

w2

w5

w4

w3

Figure 2.10: Local connectivity in multi layer perceptron

neurons at layer m is 3, effective fields of neuron at layer m-1 is 5. If connection

graph is examined, it is seen that all input field is covered as in visual cortex.

One of the most important property of the convolutional layer is weight sharing. In

the same layer, connections are constrained to share the same weight. For example in

Figure 2.10, weights of 3 connections from layer m to layer m-1 are the same for all

neurons in layer m. The shared weights are presented in the same color. By the help

of weight sharing, the number of free parameter is greatly reduced.

Convolutional layers have 4 hyper-parameters that control how it operates on input

array. These are kernel size, kernel number, stride and zero-padding.

36

1. Kernel size determines the receptive field of the each neuron at the output of the

convolutional layer. Kernels contain connection weights of the layer. Assume

there is a three dimensional input image and kernel size is selected as a 5 x 5.

That means one neuron at the output of the convolutional layer has a connection

to the 5x5 area in the input. Kernel has the same depth with input in order to

fully cover it. Therefore, one neuron at the output makes connections to 75

neurons (5 x 5 x 3). In order to determine the value of the neuron at the output,

neurons values that are in the receptive field is multiplied with kernel values.

Summation of these multiplications corresponds to neuron value. Output array

is generated by sliding kernel over the input image. The output is called feature

map. This operation is known as convolution.

2. Kernel number shows how many kernels there are in the convolutional layer.

Each kernel is slided over input image and generates feature map. Number

of feature maps is determined by number of kernels. These feature maps are

concatenated in depth to form three dimensional outputs.

3. Stride is the step size of kernel slide in convolution operation. For image size of

25 x 25 and kernel size of 3 x 3, there is 23 kernel multiplication in horizontal

direction if stride is 1. If step size is 2, there are 12 kernel multiplications since

kernel is slid by 2.

4. Convolution operation decreases output size with respect to input size. For

image size of 25 x 25, kernel size of 3 x 3 and stride 1, output size will be 23 x

23 x numbers of kernels. It is practical to add zeros around input image in order

to increase output size. Zero-padding is the thickness of zeros added around the

input. If zero-padding is selected as 1, input size will be 27 x 27 and output size

will be 25 x 25 that is equal to original input size. The most common usage of

zero-padding is to equate input and output sizes.

Output size of convolutional layer can be calculated by following formula where O is

width of the output W is width of the input, F is kernel size, P is zero padding and

S is stride.

37

O =W − F + 2P

S+ 1 (2.54)

Convolutional layer can still be trained with back propagation algorithm. The gradient

of the shared weights is calculated by simply averaging the gradients of the shared

weights.

ReLU layer is responsible from adding non-linearity to the input of layer. ReLU

(Rectified linear unit) is a one input one output function. If weighted input is bigger

than zero, input is identically transferred to output. Otherwise output is zero. ReLU

layer doesn’t change input size.

Pooling layer takes a rectangular block from convolution output and subsamples it.

There are several methods used in pooling such as averaging and maximizing. Max-

pooling is most frequently used pooling method. In max-pooling, maximum value

in the rectangular block is selected. By sliding that rectangular block over the con-

volutional output, filter response is sub-sampled. This layer can be considered as a

regularization layer.

Normalization layer is used to normalize given input. However, in recent networks

normalization started to decrease its popularity. Local response normalization is a

type of normalization layer. According to the [10], this normalization helps general-

ization. For classification problem in the paper, test error rate is decreased from 13%

to 11% with normalization. The normalization formula is given in (2.55) where a is

input, b is output, N is depth of input, x, y and i is the neuron in the position of first,

second and third dimension, respectively, k,n, α and β are hyper parameters that are

used to tune normalization.

bixiy = aix,y/

k + α

min (N−1,i+n/2)∑j=max (0,i−n/2)

(ajx,y)2

β

(2.55)

ReLU, pooling and normalization layer don’t contain free parameters to train.

38

Table 2.1: Convolutional Layers of LeNet-5 Network

Layer NumberConvolution Max-Pooling

Number of kernel Kernel size Stride Padding Kernel Size Stride1 20 5x5 1 0 2x2 22 50 5x5 1 0 2x2 2

2.12 Some Popular CNN Architectures

Convolutional neural networks are very popular in image processing area. They are

capable of extracting features that are discriminative for training dataset. However,

in order to train convolutional neural network large number of training sample is

necessary. In recent day, publicly available dataset sizes increased significantly. Con-

volutional neural networks that are trained with rich dataset, achieve state-of-art per-

formance especially in image recognition area.

It is seen that features that are extracted from that networks are also very discrimina-

tive for objects that aren’t in the training dataset. Since they give good performance

for different kinds of application, It is common that pre-trained networks are directly

used in other networks as a feature extractor or their weights are transferred and fine-

tuned with target dataset.

Some popular convolutional neural networks are given as follows.

LeNet-5 [37] is convolutional neural network designed for handwritten recognition.

It is composed of two convolutional layer followed by pooling layer and three fully

connected layer. Network architecture is given in 2.9. Fully connected layers have

500 and 10 neurons in order. Details of convolutional layers are given in 2.1.

Visual Geometry Group of University of Oxford proposed some architectures in their

two papers [38] and [39]. In ImageNet Large Scale Visual Recognition Challenge

(ILSVRC), image recognition algorithms are evaluated. Algorithm makes top-5 guess

form 1000 class on test dataset. Their performance is measured according to their

top-5 error.

On ILSVRC-2012 VGG proposed three different CNN architectures. They are called

VGG-CNN-F, VGG-CNN-M and VGG-CNN-S where F is abbreviation of fast, M is

39

Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5 FC6FC6 FC7FC7 FC8FC8

ReLULRNPool

ReLULRNPool ReLU ReLU Dropout

ReLUPool Dropout

Figure 2.11: VGG-CNN-F Network Architecture

Table 2.2: Layers in VGG-CNN-F Network

Layer NumberConvolution Max-Pooling

Number of kernel Kernel size Stride Padding Kernel Size Stride1 64 11x11 4 0 3x3 22 256 5x5 1 2 3x3 23 256 3x3 1 1 - -4 256 3x3 1 1 - -5 256 3x3 1 1 3x3 2

abbreviation of medium and S is abbreviation of slow. They are trained with roughly

1.2 million image taken from imagenet [40, 41] dataset. They reach 16.7%, 13.7%

and 13.1% top-5 error rate on ILSVRC-2012 dataset, respectively.

Generic network architecture of VGG networks is given in Figure 2.11 and layer

detials of VGG-CNN-F is given in table 2.2. LRN corresponds to local response

normalization.

All three network shares common generic architecture differences among VGG-CNN-

F, VGG-CNN-M and VGG-CNN-S are given in table 2.3.

Table 2.3: Differences among VGG-CNN-F, VGG-CNN-M and VGG-CNN-S Net-works

40

Table 2.4: Differences among ConvNet Networks of VGG

In ILSVRC-2014, Visual Geometry Group proposed 6 different convolutional neural

networks with deeper architectures. These networks are called ConvNets. Network

depth of ConvNets changes from 11 to 16. Summary of ConvNets are given in 2.4.

convA-B corresponds to convolutional layer with B number of kernels with kernel

size A x A and FC-C corresponds to fully connected layer with C number of neurons.

ConvNet A, A-LRN, B, C, D and E achieves 10.4%, 10.5%, 9.9%, 8.8%, 8.1%

and 8.0% top-5 error rate on ILSVRC-2012 dataset, respectively. In ILSVRC-2014,

ILSVRC-2012 dataset is used.

41

42

CHAPTER 3

LITERATURE SURVEY

In this chapter, related work in the field of object tracking that uses neural network is

presented.

Tracker algorithms can be divided into two classes. They use generative or discrimi-

native approach. In generative approach, large number of target candidate is cropped

around the target. Tracker algorithm finds the most probable patch that can be a tar-

get. In these methods, patches are transformed into another space by using feature

extractor algorithms. After that, these features are graded by some classifier algo-

rithm. In discriminative approach, target position is estimated by directly segmenting

input image as target and background.

When methods that use deep learning algorithm in tracking are examined, it is ob-

served that mostly generative approach is adopted. Since deep learning algorithms

can learn representation of data from examples, they are very powerful in applica-

tions that requires feature extraction.

In following two papers, [42] and [43], particle filter approach with deep learning

feature extractors is used. In deep learning tracker (DLT) method, Wang et al. [42]

designed an object tracker that uses natural features. Target is tracked with particle

filter principle by the help of these natural features. Feature extractor is updated

on-line manner. Therefore, tracker can adapt to difficulties like illumination change,

occlusion etc. Authors are used de-noising auto-encoder which is a special type of

neural network in order to be used as a feature extractor. De-noising auto encoder is

a network that is trained in unsupervised way. The network is trained with randomly

43

selected 1 million images from 80 million 32 x 32 natural images. It is stated that after

the network is trained, it is able to extract features that are common to natural images.

Therefore, it is thought that for common targets, these features will be distinctive.

One more layer is added on top of de-noising auto encoder for tracking. This is

multiple input one output fully connected layer. This layer is used to determine the

confidence of the target. For on-line tracking, 1000 patches are drawn according

to particle filter approach. These patches are fed to the network and confidences are

calculated. Target is determined according to particle filter approach. If the maximum

confidence is below some threshold value, that means target appearance is changed

significantly. In that case, the whole network is fine tuned.

In CNNTracker [43], algorithm, a generic object tracker which uses convolutional

neural network based model with particle filter methodology is used. The network

used in CNNTracker is composed of two convolutional layers with ReLU and max-

pooling and one fully connected layer. The first convolutional layer contains 6 filters

of kernel size 5 x 5. This is followed by ReLU layer and 2 x 2 max-pooling lay-

ers. The second convolutional layer contains 12 filters with 5 x 5 kernel size. This

layer is also followed by ReLU and 2 x 2 max-pooling layers. Fully connected layer

is connected to the flattened output of second max-pooling layer. The network is

pre-trained off-line and trained layers are transferred to the network which is used

in on-line tracking. The main difference between off-line trained network and on-

line tracking network is the output stage. Since the network in pre-training phase is

trained with CIFAR-10 [44] dataset which has 10 class, fully connected layer has ten

neurons at output stage. In on-line tracking network 10 output fully connected layer

is replaced with 1 output fully connected layer. This network is fine-tuned during

tracking. Output of the network corresponds to likelihood of the input to be target.

This probability is used in particle filter. In particle filter, posterior probabilities of

the target patches are calculated by Monte Carlo sampling. The patch with maximum

probability is labeled as the target. Model update decision is made according to that

posterior priority. If maximum posterior probability of patches, which is labeled as

target, is smaller that T1 and larger than T2, where T2 < T1, then the model is up-

dated. The logic behind the idea is as follows. If posterior probability is larger than

T1, result is reliable. There is no need for training. If posterior probability is smaller

44

than T2, result is not reliable. Training with that data would distort model, and in this

case tracker performance will decrease due to weak target model.

One of the commonly used methods in object tracking is correlation filters. Correla-

tion filter is trained with some images that represent input. Due to FFT based training

algorithm correlation filters can be trained very fast. However in order to achieve

good performance with correlation filters, an input that successfully represents target

should be supplied to correlation filter algorithm.

In following two papers, [45] and [46], representation learning property is used in

correlation filter based trackers.

In [45], a tracker algorithm is implemented by using specific properties of each con-

volutional layer. In this paper, one of the famous neural networks namely VGG-Net

[39] is adopted. VGG-net has five convolutional layers. In this method three of them

which are third, fourth and fifth layer are used. Abstraction capability of deeper lay-

ers is high with respect to lower layers. Therefore, output of fifth convolutional layer

represents more semantic features. However due to the pooling layers in VGG-Net

spatial details are lost in deeper layers. From fifth to third layer abstraction capabil-

ity decreases but spatial detail increases. Therefore by combining properties of these

three layers both accurate and robust tracker could be obtained.

In order to exploit convolutional layer properties, adaptive linear correlation filter is

trained over output of each convolutional layer. Location of target is inferred from the

outputs of the correlation filter. In order to find target position, coarse to fine approach

is used.

In [46], authors proposed a tracker method based on convolutional neural network.

They used activation values of convolutional layers in discriminative correlation filter

based tracker. Authors claim that no fine tuning for specific target is necessary, since

convolutional layer is capable of generating generic features. In addition to that, these

features contain semantic information about object which is very important for target

tracking. According to the authors, activations of the first layer give better tracker

performance, although deeper network contains more complex feature.

The proposed method in [46], data representation stage of two different correlation

45

filter based tracker namely DCF [47] and SRDCF [48] is replaced by convolutional

features. In DCF method, discriminate correlation filter is learned from examples.

Examples are activations of convolutional layer outputs for this case. On-line up-

date rule in [49] is used to approximate solution efficiently by using DFT. When new

frame comes, convolutional features of the target is circular correlated with correla-

tion filter and target location is labeled according to the maximum correlation score.

Periodic assumption in DCF causes periodic boundary effect. This effect limits per-

formance of the tracker. SRDCF method eliminates the periodic boundary effect. In

SRDCF spatial regularization term is added to the cost function of filter. The added

term is a multiplier to regularization term that increases proportional to distance from

target. That new regularization causes a significant performance boost. For target

features, imagenet-vgg-2048 network [38] is used. This network is trained with Ima-

geNet dataset [40, 41]. Imagenet-vgg-2048 has five convolutional layer. In proposed

method, ReLU output of convolutional layer is used as features. Images are fed to the

network after resizing to 224 x 224 and subtracting mean. For gray-scale images the

same image is fed to R,G and B channels. Extracted features are filtered with Hann

window.

Support vector machines are powerful algorithms in data classification. In generative

approach support vector machines can be used as target classifier. In following paper,

tracker that uses both deep learning and support vector machine is expressed.

In [50], Hong et al. proposed a tracker method that uses convolutional neural net-

works and support vector machine. Algorithm tracks target object by the help of

learned discriminative saliency map. Discriminative saliency map is generated by

back-propagating positive samples through the network. In order to track target se-

quential Bayesian filtering method is applied on saliency maps. As a convolutional

neural network, network in [1] is adopted. This network is trained with rich image

dataset off-line and it isn’t fine-tuned during tracking.

In algorithm, outputs of hidden layer are used as features. Target candidates are

cropped around the target. Features of these target candidates are classified with

support vector machine as foreground or background. Foreground images are used to

generate saliency map by back-propagating the same convolutional neural network.

46

Saliency method is firstly proposed in [51]. Features in hidden layer capture semantic

information of target successfully. However due to pooling layers, spatial information

is lost. That problem is solved with target generative appearance model. Generative

appearance model is generated by saliency maps of foreground patches.

In this method, representation learning property of convolutional neural network is

used. A part of a pre-trained network that is trained with large dataset is used as

feature extractor. It is shown that outputs of hidden layer are able to capture semantic

information successfully. In addition, saliency map is generated by back-propagating

positive images that help to localize target precisely.

In object tracking studies, there are some solutions that uses purely neural network

to localize target. The following study uses convolutional neural network in order to

evaluate convolutional features.

In FCNT[52], the same property of convolutional neural network with in [45] is used.

Top layer features of convolutional neural network works like a class detector. On

the other hand lower layers carries target specific properties which can be used to

separate target from surroundings. Authors merged the power of both layer with a

switch mechanism. It is shown that not all feature maps of convolutional layer are

necessary in order to track specific target. Author proposed feature map selection

method in order to eliminate irrelevant and noisy features. VGG-Net [39] is used as a

feature extractor. Proposed algorithm is as follows. On conv4-3 and conv5-3 layers,

feature map selection is applied. Selected feature maps of conv5-3 are called a general

network (GNet) and selected features of conv4-3 is called a specific network (SNet).

In the first frame both network is initialized with foreground heat map regression

method. When new frame comes, it is feed forward to the network and heat maps are

generated by GNet and SNet. Finally distractor detector mechanism decides which

heat map defines target. Generally GNet detects the target. However, although GNet

contains class specific information, it may be distracted by the object with the same

class objects. Let say target is human and there are two people in frame. In that

case GNet outputs a heat map that shows two possible target positions. Besides, SNet

contains target specific property. Therefore SNet is more successful at discriminating

two objects with the same class. However GNet is more robust than SNet. This

47

trade-off is managed by distracter detection mechanism.

In most of the studies transfer learning method is used by directly transferring con-

nection weights. In the following study by Wang et al. [53], another approach is

expressed.

Wang et al. proposed convolutional neural network based object tracker by using

learning hierarchical features. They used ASLA [54] tracker in order to evaluate per-

formance of their learned hierarchical features. However, proposed feature learning

method can be used with different trackers by replacing their feature representations.

In the method proposed in [53], two layer convolutional neural network is used in

order to learn hierarchical features. This network is trained with auxiliary video se-

quences taken from Hans van Hateren natural scene videos [55]. It is claimed that

network is able to learn features that are robust to different kinds of motion patterns

when trained with rich dataset. In order to increase robustness of the learned features,

temporal slowness constrain is applied. An extra term is added to loss function which

penalizes the difference between outputs of two consecutive frames. Due to this term,

network is forced to learn the motion with small temporal change which is the most

common state in videos.

Authors propose a domain adaptation module to adapt learned features to new videos.

By the help of this mechanism, learned features are merged with target specific ones.

That makes tracker robust to both complicated motions and target specific appearance

changes. In many domain adaptation mechanisms, connection weight of network is

directly transferred to test network. However, Wang et al. proposed a different ap-

proach. They added an additional term to loss function of the test network. This

term penalize the difference between weights of pre-trained network and test net-

work. Therefore in the test network the advantages of both generic features and target

specific features are merged.

As it is said before transfer learning is mostly used in object tracking. The networks

whose weights are transferred are mostly used for classification application. There-

fore, the features extracted from that networks are related to notion of object not

motion of object. In following study, network is fine-tuned to learn motion specific

48

features.

A convolutional neural network based tracker with a novel visual tracking algorithm

is proposed in [56]. Convolutional neural network in the algorithm is a combination of

two groups of networks. The first group is named as shared layers and the second part

is domain-specific layers which are connected to shared layers in training phase. The

shared layer is composed of three convolutional layer and two fully connected layer.

Convolutional layers used in shared layers are identical to conv1-3 and two fully con-

nected layer is identical to fc4-5 of VGG-M network [38]. Fully connected layers

have 512 output units with dropout and ReLU layers. In domain-specific layers, there

are multiple branches each of them is trained with individual training domains. There

are K number of fully connected layer which is connected to last fully connected

layer of shared layers for K training sequences. Each layer is binary classification

layer with soft-max cross-entropy. The function of that layers is to distinguish target

and background. As stated above, shared layers are common to all training sequence.

The main purpose in that network architecture is to obtain generic target representa-

tion in shared layers. After training each domain in iterative way, it is expected that

common features of all training sequences are learned by shared layers. In test case,

Shared layers and domain-specific layers are separated. Shared layer in pre-trained

convolutional neural network is preserved and new domain-specific layer which is

untrained is added on top of it. This new domain-specific layer is updated on-line.

When new frame comes, candidate target patches are sampled around the target ran-

domly. Each patch is fed to the network after resizing to input layer size. Output of

the domain-specific network shows how likely that patch corresponds to target. The

network size is small when compared to AlexNet [10] or VGG-Nets [39, 38]. There

are three reasons for authors to pick such small network. The first reason is that net-

work classifies only two classes and there is no need for complex model according.

The second reason is that as network goes deeper, it loses its spatial sensitivity. The

last reason is that since target size is small in general, input layer is chosen small.

Small input naturally decreases the depth of network. Also it is shown that bigger

network doesn’t improve performance.

Training convolutional neural networks requires so many training examples and train-

ing takes long times. In order to overcome these issues, in [57] tracker that uses

49

purely on-line trained convolutional neural network is proposed. In tracking appli-

cations there is only one labeled data which is initialization frame. In order to train

convolutional neural network, effective training strategy is required. Authors made

three contributions to train network in purely on-line manner effectively.

The first one is truncated structural loss function. This loss function overcomes track-

ing error loss accumulation. One of the frequently used cost function is mean square

error lost function. In usual training scheme, vanilla mean square error loss function

is used. In [57] another term is added to loss function as a multiplier. This additional

term is shown in equation (3.1) where Θ(yn, y∗) is the function showing overlap ratio

of target y∗ and patch yn.

∆(yn, y∗) =

∣∣∣∣ 2

1 + exp(−(Θ(yn, y∗)− 0.5))− 1

∣∣∣∣ ∈ [0, 0.245] (3.1)

This term can be thought as a importance indicator. Note that positive samples that

are close to the target and negative samples that are far from the target has higher

importance. In addition to that, it is thought that patches with very small error don’t

have a significant effect on network. These patches are also discarded in training.

That is equal truncating the loss function around zero. By the help of truncated struc-

tural loss function, only samples that are discriminative for target are contributed to

the training. Effect of noisy patches is discarded.

The second contribution is robust sample selection mechanism. Positive and negative

samples are selected according to temporal relation and label noise. There is a positive

and negative pool in order to hold training samples. For each new frame, a predefined

number of patches are cropped around the target randomly. The patches, whose over-

lapping ratio with the target is bigger than 0.5, are labeled as positive patches. Other

patches are labeled as negative. All these patches are fed to the network. Network

output gives the probability of the patches to be target. Predefined number of positive

and negative patches is saved to the positive and negative pools with an additional

quality term. Quality term calculation by the help of a label noise concept is as fol-

lows. Patches are sorted from high to low with respect to target probability which

is network output. Predefined number of patches with high probability is selected.

Average truncated loss of these patches are calculated and subtracted from 1. Note

50

that if there is a negative patch with high target probability, it will decrease target

quality. Patches with noisy label will have low quality and they won’t affect network

much. Quality function is given in (3.2) where the set P contains the samples with

high probability and Ln corresponds to loss of individual patch.

Q = 1− 1

|P |

N∑n∈P

Ln (3.2)

In training, positive samples are drawn with uniform probability and negative samples

are drawn with exponential probability with respect to time. By doing that, short term

negative, long term positive memory restriction is satisfied. Quality term is used in

loss function as a multiplier. Therefore, correct labeled patches makes strong effect

in training. On the other hand, noisy labeled patches make almost effect because their

quality is low.

The third one is a lazy updating scheme. The straight forward approach for tracking

would be training network in every frame. However, this would be computationally

expensive. Authors proposed a method that decrease training frequency and naturally

increase speed of the tracker. In training phase, network is trained until error reaches

to predefined error value ε. The network won’t be trained until loss exceeds 2ε.

Since object appearance doesn’t change frequently, a successful model will give small

loss values for a long time. Therefore lazy updating scheme increases tracker speed

without affecting tracker performance significantly.

Neural network architecture in [57] is composition of four neural networks. There

are three identical neural networks which is called single cue CNN and one fully

connected network called fusion network. Single cue CNN takes 32 x 32 image patch

and outputs one probability term. At the last layer of single cue CNN there is 8

inputs 1 output fully connected layer. In order to merge features of three single cue

CNNs, last fully connected layer is discarded and concatenated. That concatenation

is fed to the Fusion network. Fusion network is composed of one 24 inputs 1 output,

which correspond probability term, fully connected network. Red, green and blue

components are given to the single-cue networks respectively. For gray level image

these components would be two local normalized images and one gradient image. In

51

each iteration, different single cues are trained with its last fully connected layer, in

order. After three single cue CNNs are trained, fusion network is trained with its 8

neuron outputs.

Not all trackers are using generative approach in deep learning. The following study

uses discriminative approach to find target location.

Convolutional neural network is used for target tracking purpose in [58]. In the pro-

posed method, network trained both off-line and on-line. The purpose of off-line

tracking is to teach a network what is object. In on-line tracking, the network is fine-

tuned in order to adapt to tracked target. During on-line training some mistakes may

happen, which causes model deformation. In order to avoid this, two neural networks

is used concurrently in test time. The results of these two networks are used in col-

laboration to determine target location. Convolutional neural network is composed

of seven convolutional layer and three fully connected layer. Between convolutional

layers and fully connected layers multi scale pooling scheme [59] is applied. In most

tracker which uses convolutional neural network, one output neuron which shows the

probability of being target is used. The main difference of the network proposed in

[58] is that it has a 50 x 50 probability map instead of single output. Since input with

size of 100 x 100 is used each output neuron corresponds to 2 x 2 area. The value of a

neuron represents how likely this 2 x 2 area belongs to target. The purpose of the pre-

training is to teach network, object level features. For that purpose, ImageNet dataset

[40, 41] is used in pre-training. Since a deep convolutional neural network is used,

availability of large number of images in dataset is very important. In ImageNet there

are approximately 500k images with labeled bounding box. For training, if pixel is

in bounding box, it is labeled as 1, otherwise 0. Some negative images are also in-

troduced to the network. For negative images all pixels are labeled as 0. When new

frame comes, predefined number of patches with different sizes centered at target at

previous frame are cropped and feed forward through the network. Target searching

is started with the smallest patch. If sum of the network output is below some thresh-

old this patch is skipped. This searching continues from smallest to largest until sum

exceeds a threshold. The patch that exceeds threshold value is defined as target.

Model update frequency is important in tracker performance. If network is updated

52

too frequently, model may be distorted due to inaccurate results. If update frequency

is not high enough, model may not adapt appearance changes. In order to solve

that problem, author proposed two convolutional network structures. One network

is used for adapting short-term appearance changes. The other one is used for long-

term appearance. Both networks is fine-tuned with initial frame. While, long-term

network is updated conservatively, short-term network is updated more aggressively.

Cropped patch is fed to both networks. The target location is determined from the

most confident output.

In generative model, target candidates are generated and the most probable patch

is labeled as target. In addition to that, most of the trackers update their model in

test time in order to adapt appearance changes. On-line training and large number

of feed forward due to generative approach are time consuming. This makes these

kinds of tracker algorithms very slow and not practical. In GOTURN [13], authors

proposed a novel approach to the neural network tracking. Only one forward pass is

necessary and there is no model update. Therefore, GOTURN can run at 100fps. The

network in GOTURN algorithm is composition of three neural networks. There are

two parallel convolutional network and one fully connected network. Convolutional

layers of CaffeNet [60] is used as convolutional network. CaffeNet is trained with

1.2 million data taken from ImageNet [40, 41] dataset. Fully connected network is

composed of four fully connected layers. Three of them contain 4096 neuron and

the final fully connected layer has four neurons which gives output. Each layer has

dropout and ReLU layer. The hyper parameters, such as neuron numbers, kernel

sizes, are taken from the CaffeNet. This whole network is fine-tuned with training

dataset which is generated from auxiliary sequences. In test time no model update is

performed.

Neural network takes two consecutive images as input. In order to feed network

double size of target is cropped in previous frame around target. The same location

is also cropped in current frame. These two cropped images is resized to 227 x 227.

This is the input size of CaffeNet. When these two image is feed-forward, network

output directly gives the position of the target with respect to upper left corner of

previous frame. Network has four output neurons. They are the x and y position of

left upper corner and right lower corner of the target respectively. In order to train

53

network consecutive images and its motion is given as training sample. In addition

to that, images are also used in training. Images are shifted in some direction as if

they are moving. Non-shifted image is counted a previous image and shifted image

is counted as current image. Cropping applied in the same way with video sequence.

This mechanism generates augmented data for object motion. Since large number of

object class can be found in image datasets. By this approach, algorithm gets more

robust to different kinds of objects.

54

CHAPTER 4

PROPOSED METHOD

In this chapter, proposed tracker that tracks face of mice in a video sequences col-

lected from TÜBITAK 115E248 FARE-MIMIK project set up will be explained. The

proposed tracker method is based on convolutional neural network. Given the target

location in the previous frame, tracker algorithm outputs target location in the current

frame.

4.1 Network Architecture

Deep neural networks are able to learn discriminative representation of input dataset.

Each layer of network learns different features. While early layers learn simple fea-

tures, deep layer features contains semantic information related to object in image.

Semantic features are very helpful for tracker algorithms since main purpose of the

tracker is to determine target location precisely. By the help of semantic features,

target identification can be made successfully. However, as network depth increases,

spatial resolution decreases due to pooling layers. Receptive field of one pixel at deep

layer response can be very large. Therefore, accurate target detection can’t be made

by using just deep features. Although shallow layer features contains basic informa-

tions like corner detection, they have good spatial resolution. By using combination

of shallow and deep layer features, accurate and robust tracker can be designed.

Training deep convolutional neural networks requires large number of training data,

in the order of million. The dataset that is prepared for training doesn’t contain so

many training data. Therefore, convolutional layers of pre-trained deep convolutional

55

neural network that is trained rich dataset will be used as a feature extractor. These

layers will be called as feature extraction network. Low and high level features of

feature extraction network is used to design accurate and robust tracker.

Two consecutive frames are fed into network and it is expected that networks outputs

target location. Concatenation of low and high features of both previous and cur-

rent frames is used to find target location. Therefore, two parallel feature extraction

networks is needed.

Low and high level features of feature extraction network are concatenated in the third

dimension. In order to concatenate them, spatial sizes of features should be the same.

However, it may not be possible for some layer. Feature adaptation network is used

to resize and concatenate low and high level features of both previous and current

frames.

Concatenated features are fed into the regression network. Regression network is

responsible from generating target location from features of inputs. This network is

trained with application specific dataset. Therefore, this network learns how to relate

motion of target with given features.

The neural network that is used in this thesis is composed of two parallel feature

extraction network, one feature adaptation network and one regression network that

is connected to output of feature adaptation network. Block diagram of the proposed

network is given in Figure 4.1.

Various different combinations of the network layers of proposed method are exam-

ined in section 5.4.

4.2 Dataset

The mice dataset is generated from the mice videos which are recorded at Hacettepe

University, Neurological Sciences and Psychiatry Institute, Behavior Experiments

Research Laboratory. Videos are recorded for a project that grades the pain level

of a mouse automatically by the help of computer software. The proposed tracker

will be used in this project as mouse face tracker. Two types of videos are recorded.

56

Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5

ReLULRNPool

ReLULRNPool ReLU ReLU


ReLULRNPool


FCExtraFC

ExtraFC6FC6 FC7FC7

Dropout DropoutDropout

FeatureAdaptation

Network

FeatureAdaptation

Network

High Level Features

High Level Features

Regression Network

PreviousFrame

CurrentFrame

Target State

ReLUPool

Low LevelFeatures

Low LevelFeatures

Convolutional Network #1

Convolutional Network #2

Figure 4.1: Network architecture of proposed network

These are videos that mouse is in pain and not. Both types of videos are used in

training of tracker algorithm. However, videos in which mouse is in basal (the state

before drug) state are more valuable for tracker training, because in that state mouse

is more mobile.

For one video record, six cameras are placed around container of mouse as in Figure

4.2. Video are recorded in ultra-high definition (UHD) 3840 x 2048 in 25 frames per

second (FPS). The length of the videos that mouse is in basal state are 30 minutes.

The duration of videos that the mouse is dragged changes from 30 minutes to 1 hour.

Videos aren’t interrupted during recording in order to see effect of drug with respect to

time. The bounding box of face of a mouse is labeled manually at METU Intelligent

Systems and Computer Vision laboratory.

In order to generate dataset, frames that face of a mouse isn’t visible are discarded.

Among the remaining frames, valid frames are determined. If face of mouse moves

between previous and current frames, current frame is labeled as valid frame. Valid

frames in all videos, includes both pain and basal videos, are used in training. That

makes tracker robust to the state change of mouse.

57

Figure 4.2: Video recording setup

After deciding which frames are valid, the data and their label for the training dataset

is prepared as follows.

One valid frame with its previous frame is randomly selected. Search area is defined

for that frame. Search area is a region of interest with size of 1.5 of target and centered

at target at the previous frame. Search area for a frame is given in Figure 4.3. Target

shape is defined as square.

Search area is cropped from both previous and current frame. Note that search area

is defined with respect to target area of previous frame. These 2 cropped frames

are converted from RGB to BGR and resized to 227x227x3 which is input format

of VGG-CNN-F network. After that, Imagenet dataset mean is subtracted from both

image which is also necessary for VGG-CNN-F network. Label related to that data

is three dimensional vector that contains x,y coordinates of upper left corner of target

in current frame with respect to upper left corner of search area and label target size.

Note that target size in label should be scaled since cropped images are resized to

227x227x3. Scaled label can be obtained by dividing label to scale factor, which is

division of width of search area to width of resized input image (227). The final label

is calculated by multiplying resized label with 10 in order to increase loss term.

58

Figure 4.3: Target area is shown in green square and Search area is shown in blue

square

That data and label generation process is applied on all valid frames in a random

order. Randomization is an important in training because if ordered inputs are used,

network may memorize some pattern. That decreases test performance of a neural

network usually.

4.2.1 Data Augmentation

Data augmentation is an important approach for training dataset generation. By the

help of augmented data, target abstraction can be improved. For mice dataset, ver-

tical mirroring and random brightness change is applied for data augmentation. The

augmentation generates artificial data that may exist in real video sequences. Also

brightness change increases robustness of tracker to illumination change. From one

training data, nineteen augmented data is generated. For data generation following

procedure is applied.

First, nine uniformly distributed random number is generated from 0.8 to 1.2. These

numbers are used as brightness value multiplier. Both current and previous frames

are converted from RGB to HSV colorspace. V values of images are multiplied with

generated brightness value multiplier. After multiplication images are converted back

59

1.000000 1.009725 0.970500 1.041494 0.800973

0.845352 0.943010 0.892733 1.150210 1.184576

1.000000 1.009725 0.970500 1.041494 0.800973

0.845352 0.943010 0.892733 1.150210 1.184576

Figure 4.4: Upper ten images are original data with illumination change. Illumination

multiplier is given below each image. Lower ten images are mirrored images with

illumination change.

to RBG color space. By this method nine augmented data is generated. In addition

to that, vertical mirroring is applied to that nine augmented image and the origi-

nal image. By vertical mirroring, ten more augmented data is generated. Example

augmented images are given in Figure 4.4. Note that, after vertical mirroring target

locations are also changed. Labels are updated according to mirror operation. When

image is mirrored, upper left corner goes to upper right corner of the target. Verti-

cal location and width doesn’t change in search area. However, horizontal location

should be subtracted from 227 minus target width in order to get horizontal location

of the upper left corner. In mice dataset, target location is multiplied by 10. Because

of that, first parameter, which corresponds to horizontal location, in the label should

be multiplied by 10 after subtraction operation in the horizontal location.

60

4.3 Off-line Training

As explained in section 4.1, tracker network is composed of four different network

which are two feature extraction network, one feature adaptation network and one

regression network. Tracker network architecture was given in Figure 4.1. Layers

shown in blue are called as feature extraction network. Layers shown in yellow are

feature adaptation network. Network shown in red is named as regression network.

Convolutional layers of VGG-CNN-F are used in tracker network. These layers cor-

respond to feature extraction network of the tracker network. Connection weights of

that network are directly transferred to tracker network. Feature extraction networks

aren’t trained during off-line training. Because, VGG-CNN-F network is trained with

a rich dataset containing roughly 1.2 million images of 1000 different class such as

animals, plants, foods and instruments. The features that extracted from this rich

dataset are known to be highly representative for large number of object. Training of

this network additionally with mice dataset would corrupt that features. In the tracker

algorithm proposed in this thesis, low and high level features of feature extraction

network are used. Therefore, extra training of convolutional layers would decrease

performance of tracker.

Feature adaptation network, which is shown in yellow, is responsible from equating

spatial size of low and high level features. This network is composed of max-pooling

and concatenation. Max-pooling and concatenation operations don’t contain free pa-

rameters. Feature adaptation network isn’t participated in training, since training op-

eration is only valid for free parameter. The only network that is trained in proposed

method is regression network.

In training of regression network, Adam optimizer with learning rate annealing is

used. In Adam optimizer suggested hyper-parameters are used. As a learning rate

annealing policy, step decrease is preferred. Initial learning rate value is selected as

1e − 5. Learning rate is multiplied with 0.5 in every 1000 batch in order to satisfy

step decrease. Batch size is 50. All layers in regression network are initialized with

Xavier initializer implemented in caffe deep learning software framework.

In training, mean squared error loss between predicted target location and ground-

61

truth target location is used. Loss function is given in (4.1). yn represents label and

yn represents network output for given input data. N is batch size which is 50 in

proposed method.

C =1

2N

N∑n=1

‖yn − yn‖22 (4.1)

The network is trained with mice videos. Training dataset is generated according to

procedure that is explained in section 4.2.

4.4 On-line Tracking

In this section, method that is used during online tracking will be explained.

The first frame is read from video. Bounding box target location in the first frame

is taken from ground-truth. Target size is expanded with ratio of 1.5. This area is

called search area as explained previously in section 4.2. Second frame is read from

video. That frame is called current frame. The first frame is called previous frame.

Both previous frame and current frame is cropped to size of search area. Cropped

images are converted to BGR color space from RGB color space, since VGG-CNN-

F network is trained with BGR images. Cropped images resized to 227 x 227 and

imagenet dataset mean is subtracted because input size of VGG-CNN-F is 227 x 227

and it is trained with imagenet by subtracting dataset mean. The cropped area in

the previous frame is given as input to the first feature extraction network and the

cropped area in the current frame is given as input to the second feature extraction

network. By simply feed forwarding these two frames, target location in current

frame is calculated. Network has three output neurons. They corresponds to x,y

coordinate of upper left corner and width of the bounding box. Since target area is

always square, width and height are equal to each other. Note that, since network is

trained with resized image, network output is also scaled to 227 x 227. Scale factor

can be calculated by simply dividing 227 to width of the search area. By simply

dividing network output to scale factor, target location with respect to cropped current

frame is calculated. Remember that, target location is multiplied by 10 in training

62

Frame No : 5319Width ratio 0.98












Figure 4.5: Mouse turns back and tracker losses target. Frame numbers and width

ratios calculated according to previous frames are given below each frame. Successful

tracker results shown in green and failed tracker result is shown in red.

dataset. In order to compensate that, target location of current frame is divided by

10. Therefore, the location of target in the search area is determined. However for

tracking applications, target location in a given frame is necessary. In order to find

final location, coordinate of upper left corner of the search area is added to the first

two term of the output vector which are x,y coordinate of target in search area.

After target location is found, current frame is defined as previous frame whose target

location is known. New frame is read from video and the same procedure is applied

to find target in newly read frame. This process continues until end of the video or

target is lost.

Experimental results shows that when tracker losses target, target size dramatically

decrease in time and goes to zero. In order to identify target loss, simple heuristic is

used. Width of current target is divided by width of the previous target. This division

is called width ratio. If width ratio is smaller than loss threshold, which is 0.9 for this

application, it is assumed that tracker algorithm is failed for current frame.

63

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.10

200

400

600

800

1000

Figure 4.6: Width ratio histogram on 25 FPS test video.

In Figure 4.5, a tracker failure example is given. Tracker can not track target after

frame number 1325, since mouse turns its back to camera. Tracker outputs whose

width ratio is below 0.9 are shown as red square. Note that red squares don’t contain

a mouse face. That shows that width ratio assumption is valid for tracker failure

evaluation for the given example case.

In Figure 4.6, width ratio histogram for test video is given. For 25 FPS video capture

rate, width ratio of natural movement of face of a mouse mostly changes from 0.94 to

1.04. Therefore, 0.9 threshold value doesn’t affect tracker performance.

64

CHAPTER 5

EXPERIMENTAL RESULTS

5.1 Performance Criteria

In order to evaluate performance of single target object tracker, some performance

criterion should be used. Several performance measures are used in literature. Some

of them are popular in tracker studies. However, there is no standard for performance

measure in single object tracking. Cehovin et al. [61] evaluated some popular perfor-

mance measures on 25 widely-used video sequences with 13 trackers. Based on these

analyses, they stated that some measures indicate the same aspects. Among these

measures two uncorrelated performance measure is selected and they proposed per-

formance visualization method that is based on accuracy versus robustness of tracker.

In this section, firstly, popular performance measures of single target object tracking

and visualization methods are explained based on study of Cehovin et al. Secondly,

which performance measures will be used in this thesis will be expressed.

The purpose of performance measure in object tracking is to evaluate how much ob-

ject state assigned by tracker matches the ground truth object state. Some popular

performance measures as follows:

Center Error One of the oldest performance measures is the center error. In center

error measurement, the difference between target center and ground truth center is

measured. The smaller center error represents the better tracker. Usually center error

is visualized by center error versus frame plot. Also, average or mean squared error

can be used as a numeric performance indicator. Center error is represented by δ.

65

δt =∥∥xGt − xTt ∥∥ (5.1)

Center error measure may not be objective because center error greatly depends on

the target size. For large objects, there may be big center error. However this doesn’t

mean the tracker isn’t successful. In order to overcome this, normalized center er-

ror is proposed. In normalized center error, center error is divided by ground truth

target size. The same representation techniques with center error are also used for

normalized center error. Normalized center error is represented by δt.

δt =

∥∥∥∥ xGt − xTtsize(AGt )

∥∥∥∥ (5.2)

Region Overlap Region overlap measure is a ratio of intersection of target and

ground truth region to union of these regions. This is a good performance measure

because both position and size is taken into account. Region overlap is represented

by φ.

φt =AGt ∩ ATtAGt ∪ ATt

(5.3)

The score term that used with region overlap is called true positive. True positive is

the number of frames, whose region overlap is larger than threshold, is divided by

number of frames in video sequence. It is shown as Pτ .

Pτ (ΛG,ΛT ) =

∥∥∥{t|φt > τ}Nt=1

∥∥∥N

(5.4)

Tracking Length Tracking length measure is the duration of tracking from the first

frame to the frame that tracker fails. Failure can be decided with the region overlap

term. If region overlap falls below some threshold value τ , it is thought that tracker

is failed. This performance measurement is highly dependent to video sequence. If

there is some difficulty in early frames, tracker would fail. Since remaining video is

discarded after tracker is failed, tracker is evaluated on a limited duration of a video

66

Figure 5.1: Correlation of performance measures, correlation of the diagonal entries

have the maximum value, that is 1

sequence. Due to that reason, tracking length measure isn’t considered as a good

performance measure. Tracking length is denoted by Lτ .

Failure Rate Failure rate is a ratio of re-initialization of tracker upon failure to

number of frames in a video sequence. Failure is detected by the help of region

overlap just as it is used in tracking length measure. Failure rate is denoted by Fτ

where τ is threshold value for region overlap. This tracker failure threshold is called

re-initialization threshold. Unlike tracking length, failure rate evaluates tracker on

whole video sequence.

True positive, tracking length and failure rate measures are calculated for a given

threshold value τ . In order to visualize these performance measures, performance

measure vs threshold plot can be used. Area under curve (AUC) in that plots are good

numerical indicator for performance measures.

Cehovin et al. provided an experimental analysis on these performance criteria. They

show some of the performance measures are highly correlated. They calculated cor-

relation matrix from the measure results of the selected tracker and video sequence

pairs. The heat map of correlation of performance measures is given in Figure 5.1. It

is seen that there is a high correlation among performance measure 1 to 3 and 4 to 7.

That means it doesn’t matter which performance measure is selected among 1 to 3 or

4 to 7. They give pretty much the same performance measures.

Performance evaluation on single target tracking is mainly focused on robustness and

67

accuracy of tracker. Cehovin et al. proposed a simple accuracy versus robustness plot

in order to compare object trackers.

Since failure rate is a measure of how good an algorithm tracks object without losing

it, this measure can be used as a robustness measure. However, failure rate doesn’t

contain any information about accuracy. For accuracy measure, one of the center

errors, region overlap or tracking length measures can be used. As stated before,

tracking length measure isn’t reliable because it doesn’t use whole video sequence.

Region overlap both uses size and location of target. Therefore, it defines accuracy

better than center error which uses only location. In addition to that, region overlap is

highly correlated to true positive and tracking length measures. That also shows the

representative power of region overlap measure.

In this thesis, different variations of the proposed mouse face tracker are compared

with each other. For comparison, accuracy versus robustness plot is supplied. AUC

of true positive versus region overlap will be used as a accuracy and 1 minus AUC of

failure rate versus re-initialization threshold will be used as robustness term.

However, comparing proposed tracker with each other isn’t enough for detailed per-

formance evaluation. For that purpose, true positive versus region overlap threshold is

given for detailed accuracy measure. Although center error is highly correlated with

region overlap, true positive versus center error threshold gives detailed information

for target centering property. Failure rate versus region overlap threshold are supplied

for detailed robustness performance.

5.2 Test networks

In section 4.1, pseudo architecture of proposed tracker is explained. In order to eval-

uate performance of the proposed tracker, 9 test networks with different network ar-

chitecture is proposed. One of the test network is variant of a state-of-art GOTURN

[13] tracker. Performance of 9 test network is evaluated and compared with each

other. The architecture of those 9 networks are summarized in table 5.1 and will be

presented in detail below. In order to name trackers, C5L,H − Cx − F y convention is

used, where L is depth of low level feature, H is depth of high level feature in feature

68

Table 5.1: Summary of Test Networks

Network NameLow Level

FeatureHigh Level

FeatureRegression Network

Number ofConvolutional

layer

Number ofFully Connected

layerC5

0,5 − C0 − F 4 - Pool5 - 4C5

0,5 − C1 − F 3 - Pool5 1 3

C53,5 − C0 − F 4 ReLU Output

of Conv3Pool5 - 4


of Conv3Pool5 1 3

C51,5 − C1 − F 3 Pool1 Pool5 1 3

C52,5 − C1 − F 3 Pool2 Pool5 1 3


of Conv4Pool5 1 3

C52,4 − C1 − F 3 Pool2

ReLU Outputof Conv4

1 3

C52,4 − C2 − F 3 Pool2

ReLU Outputof Conv4

2 3

extractor network, x is the number of convolutional layer and y is the number of fully

connected layer in the regression network. If low level feature isn’t used in network,

L is equal to 0. If convolutional layer isn’t used in regression layer, x is equal to 0.

As a base test network architecture, network architecture of generic single object

tracker proposed by Held et al. namely GOTURN [13] is used. GOTURN is state-of-

art tracker that can run at 100 FPS. That speed is very satisfactory for neural network

based tracker algorithm. Generally, neural network based trackers runs at 0.8 FPS to

15 FPS due to on-line training.

Since it is trained with generic objects, GOTURN perform better with standard targets

like ball, car, human etc. If its evaluated on mice dataset, Its performance wouldn’t

be satisfactory. Therefore, a mouse face tracker that has the same architecture except

output layer with GOTURN is implemented. Output layer is changed with 3 neuron

output layer instead of 4 neuron output layer. Because mouse face is square shaped

and one of the width or height parameter isn’t necessary. This network is called

C50,5 − C0 − F 4 network. It is trained with mice dataset like other networks in order

to increase performance of GOTURN on mouse video sequences. Block diagram is

69

given in Figure A.1.

In original GOTURN implementation, convolutional layers of AlexNet are used as

feature extraction network. In order to achieve faster operation, convolutional layers

of VGG-CNN-F network is used in C50,5 − C0 − F 4 network instead of AlexNet.

On ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC-2012), both

network architectures achieved similar error rates. In addition to that, VGG-CNN-F

network is specialized for fast training and inference.

Regression network is composed of four fully connected layers. The first three fully

connected layer contains 4096 neurons with ReLU and 0.5 dropout. The final layer

has three neurons that correspond to target state. Three outputs represent x,y co-

ordinate of upper left point of target and size of target (scaled by 227/target size).

Fully connected layers are called fc extra, fc6, fc7 and fc 8, in order. Full network

architecture is given in Figure A.3

While fully connected layers operate on all input independently, convolutional layers

consider spatial relations. Depth of input is also taken into account in convolution op-

eration. That means features related to content of the depth are extracted as well. Note

that in C50,5−C0−F 4 network, pool5 features of feature extraction network are con-

catenated. All layers before conv5 are also convolutional layer. Therefore, concate-

nation of pool5 features still contains spatial information. Features of two different

pool5 features should be merged. For that purpose, FC extra layer in C50,5 −C0 − F 4

network is replaced by convolutional layer namely conv extra. This convolutional

layer has 256 kernels with 3x3 kernel size. This layer has stride 1 and padding 1.

This network is called C50,5 − C1 − F 3 network. Block diagram of C5

0,5 − C1 − F 3

network is given in A.2.

Pool5 features represent the high level features. Although high level features con-

tains semantic features, receptive field of neurons in conv5 layer is very large. That

makes harder to localize target precisely. Therefore, low level features should also be

included into network.

InC53,5−C0−F 4 andC5

3,5−C1−F 3 networks, ReLU output of layer 3 is added to the

network. In order to see the effect of convolutional layer on mixture of high and low

70

level features, C53,5 − C0 − F 4 network contains FC extra layer and C5

3,5 − C1 − F 3

network contains Conv extra layer. Block diagram of these networks are given in

Figure A.4 and A.5, respectively.

InC53,5−C0−F 4 andC5

3,5−C1−F 3 networks, ReLU output of conv3 layer is used as

a low level feature. However, it is not clear what the depth of low level features should

be. In order to measure performance of different low level features, C51,5 − C1 − F 3,

C52,5 − C1 − F 3, C5

3,5 − C1 − F 3 and C54,5 − C1 − F 3 networks are proposed.

In C51,5−C1−F 3 network, conv5 features of VGG-CNN-F network is used, in order

to make use of high level representation property. For low level features that have

high spatial resolution, pool output of conv1 layer is used. Input size of VGG-CNN-F

network is 227x227x3. That makes size of conv5 output 6x6x256 and size of pool

output of conv1, which is called pool1 layer, 27x27x64. As stated above concatena-

tion of conv5 and pool1 features are used as input to the regression network. Features

are concatenated in third dimension. In order to do that spatial size of pool1 layer

should be equated to conv5 layer. Concatenation layer and two max-pooling layer

with kernel size 7x7 and stride 4 is used on top of pool1 layer in feature adaptation

network, in order to equate spatial sizes. In regression network conv extra, fc6, fc7

and fc8 layers are used.

C52,5 − C1 − F 3 network uses pool2 features. In order to match feature sizes, feature

adaptation network in C53,5−C1−F 3 network is used. The same regression network

with C51,5 − C1 − F 3 network is used. Network block diagrams of C5

2,5 − C1 − F 3

network is given in Figure A.7.

C54,5 − C1 − F 3 network uses ReLU output of conv4 layer. The same regression

network with C51,5 − C1 − F 3 network is used. Network block diagrams of C5

4,5 −C1 − F 3 network are given in Figure A.6.

In all test networks so far, pool5 features are used as high level features. It is known

that high level features represent input semantically. However for mouse face tracking

problem, there is only one class of input that is to be tracked. Such high level features

may not be necessary. In order to evaluate this, C52,4 − C1 − F 3 network is defined.

Block diagram of C52,4 − C1 − F 3 network is given in Figure A.8. In that network,

71

pool2 features are used as low level feature and ReLU output of conv4 features are

used as high level feature. Note that their feature map size which is 13 x 13 is equal to

each other. No pooling layer is necessary to concatenate features in feature adaptation

network.

Depth of the network is decreased by one layer, since conv5 layer of VGG-CNN-

F isn’t used in C52,4 − C1 − F 3 network. In C5

2,4 − C2 − F 3 network, one more

convolutional layer is added to keep network depth the same with other networks

except C52,4−C1−F 3 network. It is expected that high level features of mice dataset

will be extracted by conv5 extra layer. Block diagram C52,4 − C2 − F 3 network is

given in Figure A.9. Conv5 extra layer has the same properties with conv extra layer

except conv5 extra layer has 512 kernels.

5.3 Test Procedure

Performance of 9 different trackers is measured on video sequences. 7 video se-

quences is cropped from a video sequence that isn’t included in training dataset.

These test videos are selected considering face visibility of mouse.

Proposed tracker has target tracker failure detection mechanism as explained in on-

line tracking 4.4. In order to evaluate trackers objectively, this property is turned-off.

Tracker is assumed to be failed if region overlap of target and ground-truth bounding

box falls below threshold value. After tracker failure, target location is reinitialized

from ground-truth. This tracker failure threshold will be called reinitialization thresh-

old.

True positive versus region overlap threshold and true positive versus center error

threshold plots are given for reinitialization threshold value of 0. Each tracker is run

on all test videos and results are averaged to obtain final plots. Area under curve in

that plots are good indicator of performance. These values are given at right of the

name of network in legend of figure. A higher AUC value means a better perfor-

mance.

Note that, failure rate performance measure, which is the best robustness indicator

72

among mentioned performance measures, is measured for defined re-initialization

threshold. In order to plot failure rate versus re-initialization threshold plot, trackers

should be run on all test videos for all re-initialization threshold interval. Trackers are

run on all test videos with 50 different threshold values that changes from 0 to 0.98

with 0.02 step values. For failure rate measure, a lower AUC value represents better

performance.

The speed of a tracker is calculated by averaging FPS values of the tracker on all

videos. There are 350 results for each tracker since tracker is run with 50 different

re-initialization thresholds on 7 video sequences. Averaging tracker speed on 350

videos gives a reliable performance measure.

5.4 Performance Evaluation

In this section, effect of convolutional layer with high level features, effect of convo-

lutional layer with fusion of low and high level features, depth of low level feature

comparison, depth of high level feature comparison and overall comparison of test

networks will be explained with experimental results.

5.4.1 Effect of Convolutional Layer

As mentioned before, C50,5 − C0 − F 4 Network shares the same architecture except

output layer with state-of-art GOTURN tracker. C50,5−C1−F 3 Network is a slightly

modified version of C50,5 − C0 − F 4 Network. In C5

0,5 − C1 − F 3 Network, FC extra

layer is replaced with conv extra layer. Performance plots of both trackers are given

in Figures 5.2, 5.3 and 5.4.

Experimental results show that convolutional layer in the regression network increases

robustness of tracker in terms of both robustness and accuracy. Failure rate of C50,5 −

C1 − F 3 Network is lower than C50,5 − C0 − F 4 Network between region overlap

threshold 0 and 0.5. Above threshold 0.5, performance of both network are close to

each other. That is because both networks are not successful in tracking with high

region overlap.

73

0.0 0.2 0.4 0.6 0.8 1.0Region Overlap Threshold

0.0

0.2

0.4

0.6

0.8

1.0

True P

osit

ive

C 50, 5−C0−F 4 / [0.656]

C 50, 5−C1−F 3 / [0.702]

Figure 5.2: True Positive versus Region Overlap Ratio plot for C50,5 − C0 − F 4 and

C50,5 − C1 − F 3 Networks

0.0000 0.0005 0.0010 0.0015 0.0020Normalized Center Error Threshold

0.0

0.2

0.4

0.6

0.8

1.0

True P

osit

ive

C 50, 5−C0−F 4 / [0.741]

C 50, 5−C1−F 3 / [0.743]

Figure 5.3: True Positive versus Normalized Center Error plot for C50,5−C0−F 4 and


74

0.0 0.2 0.4 0.6 0.8 1.0Reinitialization Threshold

0.0

0.2

0.4

0.6

0.8

1.0

Failu

re R

ate

C 50, 5−C0−F 4 / [0.465]

C 50, 5−C1−F 3 / [0.432]

Figure 5.4: Failure Rate versus Region Overlap Ratio plot for C50,5 − C0 − F 4 and


Although, normalized center error performance of both network close to each other,

region overlap performance of C50,5 − C1 − F 3 Network is much better. That shows

that convolutional layer especially improves bounding box performance. Experimen-

tal results supports that additional convolutional layer in regression network is more

successful than only fully connected layer in merging features that comes from dif-

ferent networks.

5.4.2 Effect of Low Level Features and Convolutional Layer in Feature Fusion

Networks

In target localization, low level features are also necessary due to wide receptive fields

of high level features. It is expected that if low level features are used with high level

features accuracy of tracker will be increased. In this section, effect of low level

features is represented with experimental results. C53,5−C0−F 4 and C5

3,5−C1−F 3

Networks are compared with C50,5 − C0 − F 4 and C5

0,5 − C1 − F 3 Networks. Both

C53,5 − C0 − F 4 and C5

3,5 − C1 − F 3 Networks uses ReLU output of conv3 layer. In

addition to that, effect of convolutional layer is evaluated when it is used with fusion

of high and low level features. Performance plots are given in Figures 5.5, 5.6 and

5.7.

75


0.0

0.2

0.4

0.6

0.8

1.0

True P

osit

ive

C 50, 5−C0−F 4 / [0.656]

C 50, 5−C1−F 3 / [0.702]

C 53, 5−C0−F 4 / [0.318]

C 53, 5−C1−F 3 / [0.770]

Figure 5.5: True Positive versus Region Overlap Ratio plot for C50,5 − C0 − F 4,

C50,5 − C1 − F 3, C5

3,5 − C0 − F 4 and C53,5 − C1 − F 3 Networks


0.0

0.2

0.4

0.6

0.8

1.0

True P

osit

ive

C 50, 5−C0−F 4 / [0.741]

C 50, 5−C1−F 3 / [0.743]

C 53, 5−C0−F 4 / [0.520]

C 53, 5−C1−F 3 / [0.786]

Figure 5.6: True Positive versus Normalized Center Error plot for C50,5 − C0 − F 4,

C50,5 − C1 − F 3, C5


76


0.0

0.2

0.4

0.6

0.8

1.0

Failu

re R

ate

C 50, 5−C0−F 4 / [0.465]

C 50, 5−C1−F 3 / [0.432]

C 53, 5−C0−F 4 / [0.611]

C 53, 5−C1−F 3 / [0.362]

Figure 5.7: Failure Rate versus Region Overlap Ratio plot for C50,5−C0−F 4, C5

0,5−C1 − F 3, C5


When ReLU output of conv3 layer is used with conv extra layer, performance of

tracker is significantly increased as expected. However, If low level features are used

with FC extra layer, performance decreases significantly. This is because of large

number of free parameters in FC extra layer.

Conv5 layer contains 256 feature maps with size of 6 x 6. Conv 3 layer contains 256

feature maps with size of 13 x 13. ReLU of conv3 features are downscaled to 6 x 6

by max pooling. Features of 4 layers are concatenated in C53,5 − C0 − F 4 Network.

That makes 1024 feature maps with size of 6 x 6. These features are flatten and given

to the FC extra layer. Fc extra layer have 36864 inputs 4096 outputs. That correspond

151 million free parameters just in one layer. Therefore, C53,5 − C0 − F 4 Network

easily over-fit to training dataset and losses its ability to generalize mice movement

behavior which decreases tracker performance significantly.

5.4.3 Effect of Depth of Low Level Features

It is seen that low level features improves tracker performance. In this section, effect

of depth of low level features is evaluated. C51,5−C1−F 3, C5

2,5−C1−F 3, C53,5−C1−

F 3 and C54,5 − C1 − F 3 Networks that use low level features of pool1, pool2, ReLU

of conv3, ReLU of conv4 respectively, are compared. They use conv extra layer since

77


0.0

0.2

0.4

0.6

0.8

1.0Tr

ue P

osit

ive

C 52, 5−C1−F 3 / [0.805]

C 53, 5−C1−F 3 / [0.770]

C 54, 5−C1−F 3 / [0.574]

C 51, 5−C1−F 3 / [0.804]


C52,5 − C1 − F 3, C5


it is shown that fully connected layer can’t perform well with large number of feature

maps. Performance plots are given in Figures 5.8, 5.9, 5.10.

When ReLU output of conv4 is used as low level feature, performance of tracker is

worse than the others. That is because it still has large receptive field to locate target

precisely.

If C51,5 − C1 − F 3, C5

2,5 − C1 − F 3 and C53,5 − C1 − F 3 are examined, it is seen that

they all perform well in terms of normalized center error. However, C52,5 − C1 − F 3

Network and C51,5 − C1 − F 3 Network are better at region overlap. Because pool1

and pool2 features have smaller receptive field and features are still discriminative

for mouse face. Trackers, which use pool1 and pool2 features, outperform the others.

However, C52,5 − C1 − F 3 Network is slightly better than C5

1,5 − C1 − F 3 Network.

5.4.4 Effect of Depth of High Level Features

Pool2 features give good performance as a low level features. Effect of depth of the

high level feature is examined by comparing C52,5 − C1 − F 3, C5

2,4 − C1 − F 3 and

C52,4 −C2 − F 3 Networks. In C5

2,4 −C1 − F 3 Network, ReLU output of conv4 layer

is used. In C52,4 − C2 − F 3 Network, additional convolutional layer is added. Aim of

78


0.0

0.2

0.4

0.6

0.8

1.0Tr

ue P

osit

ive

C 52, 5−C1−F 3 / [0.788]

C 53, 5−C1−F 3 / [0.786]

C 54, 5−C1−F 3 / [0.717]

C 51, 5−C1−F 3 / [0.778]


C52,5 − C1 − F 3, C5



0.0

0.2

0.4

0.6

0.8

1.0

Failu

re R

ate

C 52, 5−C1−F 3 / [0.337]

C 53, 5−C1−F 3 / [0.362]

C 54, 5−C1−F 3 / [0.473]

C 51, 5−C1−F 3 / [0.338]

Figure 5.10: Failure Rate versus Region Overlap Ratio plot for C51,5 − C1 − F 3,

C52,5 − C1 − F 3, C5


79


0.0

0.2

0.4

0.6

0.8

1.0Tr

ue P

osit

ive

C 52, 5−C1−F 3 / [0.805]

C 52, 4−C1−F 3 / [0.579]

C 52, 4−C2−F 3 / [0.706]


C52,4 − C2 − F 3 and C5

2,5 − C1 − F 3 Networks

adding this layer is to obtain high level feature that is extracted from fusion of pool2

and ReLu of conv4 with mice dataset. Performance plots are given in Figures 5.11,

5.12 and 5.13.

Performance of C52,4 − C1 − F 3 Network is lower than the others. Pool5 layer con-

tains more semantic features when compared to conv4 layer features. That increases

performance of network of identifying input object. In addition to that network depth

is decreased in C52,4 − C1 − F 3 Network. If network depth is increased by adding

extra convolutional layer, tracker performance increases. However, C52,5 − C1 − F 3

Network still performs better. Because VGG-CNN-F network is trained with a rich

dataset compared to mice dataset which makes features of VGG-CNN-F network

more representative.

5.4.5 Overall Comparison

Robustness versus accuracy plot is given in Figure 5.14. AUC of true positive versus

region overlap is used as a accuracy and 1 minus AUC of failure rate versus reinitial-

ization threshold is used as robustness term.

It is seen that trackers, that uses low and high level features with convolutional layer

80


0.0

0.2

0.4

0.6

0.8

1.0Tr

ue P

osit

ive

C 52, 5−C1−F 3 / [0.788]

C 52, 4−C1−F 3 / [0.704]

C 52, 4−C2−F 3 / [0.770]


C52,4 − C2 − F 3 and C5



0.0

0.2

0.4

0.6

0.8

1.0

Failu

re R

ate

C 52, 5−C1−F 3 / [0.337]

C 52, 4−C1−F 3 / [0.433]

C 52, 4−C2−F 3 / [0.411]

Figure 5.13: Failure Rate versus Region Overlap Ratio plot for C52,4 − C1 − F 3,

C52,4 − C2 − F 3 and C5


81

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00Accuracy

0.000.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.800.850.900.951.00

Robust

ness

C 52, 5−C1−F 3

C 50, 5−C0−F 4

C 50, 5−C1−F 3

C 53, 5−C0−F 4

C 53, 5−C1−F 3

C 54, 5−C1−F 3

C 51, 5−C1−F 3

C 52, 4−C1−F 3

C 52, 4−C2−F 3

Figure 5.14: Robustness vs Accuracy Plot of All Trackers

in their regression network, gives better performance in terms of both accuracy and

robustness. C53,4−C0−F 4 gives the worst performance even though it uses both low

and high level features. This network doesn’t use convolutional layer in regression

network. Therefore, it overfits to dataset.

5.5 System Performance

All trackers are run on a workstation that is located at METU Intelligent Systems and

Computer Vision laboratory. Workstation contains Intel Core I7 3.3 GHz CPU and

NVidia TitanX GPU. Caffe deep learning framework is used in the implementation

of the networks. Since architecture of artificial neural networks are very suitable for

parallel computing, they run faster on GPU. For proposed method, tracker runs 35

times faster on GPU than CPU.

Mice dataset is stored on HDD. Training of test networks are performed on GPU.

For that configuration, training of one batch takes 0.8 seconds where batch size is 50.

Training of test network approximately takes 5000 batch iteration.

Tracker speeds in non-optimized code are given in table 5.2. Tracking of a frame

contains cropping and resizing of previous and current frames, feed forwarding them

to network and label transformation from search area to bounding box in image. The

82

Table 5.2: Tracker speeds of C52,5 − C1 − F 3 Network and Test Networks

Network Name Throughput (FPS)C5

2,5 − C1 − F 3 Network 113.77C5

0,5 − C0 − F 4 Network 127.61C5

0,5 − C1 − F 3 Network 126.87C5

3,5 − C0 − F 4 Network 105.39C5

3,5 − C1 − F 3 Network 113.42C5

4,5 − C1 − F 3 Network 113.96C5

1,5 − C1 − F 3 Network 125.19C5

2,4 − C1 − F 3 Network 98.51C5

2,4 − C2 − F 3 Network 117.54

size of given frames to tracker before cropping is 1280 x 720. According to size of

image, resizing and cropping durations may change.

83

84

CHAPTER 6

CONCLUSION

The aim of this thesis is to design an object tracker specialized for face of a mouse.

For that purpose, a special convolutional neural network architecture that takes two

consecutive frames and outputs target bounding box is proposed. Convolutional neu-

ral network is trained with mice dataset. Mice dataset is generated from the videos

that are recorded at Hacettepe University Neurological Sciences and Psychiatry Insti-

tute, Behavior Experiments Research Laboratory. Face locations in video sequences

are labeled by METU Computer Vision and Intelligent Systems Research Laboratory

members.

Deep neural network can learn how to represent data by training with training dataset.

Each layer of network learns feature extractors with different complexity. While shal-

low layer learns basic features like edges, deep layers learn more semantic features.

The aim in object tracking is to track object by defining its bounding box during

video sequence. In order to satisfy that tracker should be able to identify object that

is to be tracked and define precise bounding box for target. It is known that high

level features contain semantic features for input images that are useful for object

identification. However, high level features has high receptive field which makes

localization of given input hard. In this thesis, high level and low level features are

merged for accurate and robust target tracking.

Tracker Network is composed of four neural networks. Two of them is called feature

extraction network. Since large number of training data is necessary to train con-

volutional neural network in order to obtain good feature extractors, convolutional

85

layers of VGG-CNN-F is used as a feature extractor network, which was trained with

ImageNet dataset that contains 1.2 million images. Third network is called feature

adaptation network that is used to merge low and high level features of feature ex-

traction network. The fourth network is called regression network. The only network

that is trained with mice dataset is regression network. VGG-CNN-F network isn’t

trained since fine-tuning this network with mice dataset, which has limited number

of data compared to ImageNet dataset, would distort representative power of it. Opti-

mum depth for high and low level features is selected from experimental results. For

given consecutive frame, high and low level features are extracted from VGG-CNN-F

network and concatenated. Concatenated features are related with bounding box of

target by regression network. Regression network learns how to define bounding box

from the features of consecutive mouse frames, since it is trained with mice dataset.

Experimental results showed that an additional convolutional layer at the input of the

regression network performs better because concatenation of features from the feature

extraction networks contains spatial information. That information can be exploited

by convolutional layer. Although proposed method is specialized in tracking face of

mouse, it can be adapted any target by changing training dataset.

Neural network based object trackers are slow due to on-line training. They adapt to

object changes during tracking. In this thesis, regression network learns how to define

bounding box for natural movements of face of a mouse. In other words, it learns

how mouse moves. There is no need for on-line adaptation. Proposed tracker defines

bounding box by just feed forwarding the network. Therefore, proposed tracker can

run at 113 FPS with GPU.

Tracker performance can be increased more if more training data is used.

86

REFERENCES

[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies foraccurate object detection and semantic segmentation,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

[2] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Lecun, “Over-feat: Integrated recognition, localization and detection using convolutional net-works,” http://arxiv.org/abs/1312.6229.

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolu-tional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 37, no. 9, pp. 1904–1916, 2015.

[4] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferringmid-level image representations using convolutional neural networks,” in Pro-ceedings of the IEEE conference on computer vision and pattern recognition,pp. 1717–1724, 2014.

[5] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell,“Decaf: A deep convolutional activation feature for generic visual recognition.,”in ICML, pp. 647–655, 2014.

[6] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns forfine-grained category detection,” in European Conference on Computer Vision,pp. 834–849, Springer, 2014.

[7] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neuralnetworks,” in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pp. 1653–1660, 2014.

[8] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simultaneous detectionand segmentation,” in European Conference on Computer Vision, pp. 297–312,Springer, 2014.

[9] S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell, A. Hertzmann, andH. Winnemoeller, “Recognizing image style,” arXiv preprint arXiv:1311.3715,2013.

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification withdeep convolutional neural networks,” in Advances in neural information pro-cessing systems, pp. 1097–1105, 2012.

87

[11] G. Levi and T. Hassner, “Age and gender classification using convolutional neu-ral networks,” in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops, pp. 34–42, 2015.

[12] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for vi-sual recognition and description,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 2625–2634, 2015.

[13] D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 fps with deepregression networks,” arXiv preprint arXiv:1604.01802, 2016.

[14] I. G. Y. Bengio and A. Courville, “Deep learning.” Book in preparation for MITPress, 2016.

[15] S. Herculano-Houzel, “The remarkable, yet not extraordinary, human brain asa scaled-up primate brain and its associated cost,” Proceedings of the NationalAcademy of Sciences, vol. 109, no. Supplement 1, pp. 10661–10668, 2012.

[16] B. Catanzaro, “Deep learning with cots hpc systems,” 2013.

[17] Theano Development Team, “Theano: A Python framework for fast compu-tation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688, May2016.

[18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-rama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embed-ding,” arXiv preprint arXiv:1408.5093, 2014.

[19] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané,R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas,O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Ten-sorFlow: Large-scale machine learning on heterogeneous systems,” 2015. Soft-ware available from tensorflow.org.

[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-nition,” arXiv preprint arXiv:1512.03385, 2015.

[21] Wikimedia, “Neuron,” 2006. [Online; accessed August 02, 2016].

[22] A. H. Gittis, S. H. Moghadam, and S. du Lac, “Mechanisms of sustained high fir-ing rates in two classes of vestibular nucleus neurons: differential contributionsof resurgent na, kv3, and bk currents,” Journal of neurophysiology, vol. 104,no. 3, pp. 1625–1634, 2010.

88

[23] F. Rosenblatt, The perceptron, a perceiving and recognizing automaton ProjectPara. Cornell Aeronautical Laboratory, 1957.

[24] D. Williams and G. Hinton, “Learning representations by back-propagating er-rors,” Nature, vol. 323, pp. 533–536, 1986.

[25] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,“Dropout: a simple way to prevent neural networks from overfitting.,” Journalof Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[26] A. Karpathy, “Cs231n convolutional neural networks for visual recognition.”http://cs231n.github.io. Accessed: 2016-08-20.

[27] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedfor-ward neural networks.,” in Aistats, vol. 9, pp. 249–256, 2010.

[28] S. Ruder, “An overview of gradient descent optimization algorithms.” http://sebastianruder.com/optimizing-gradient-descent/index.html. Accessed: 2016-08-20.

[29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representationsby back-propagating errors,” Cognitive modeling, vol. 5, no. 3, p. 1, 1988.

[30] Y. Nesterov, “A method for unconstrained convex minimization problem withthe rate of convergence o (1/k2),” in Doklady an SSSR, vol. 269, pp. 543–547,1983.

[31] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for onlinelearning and stochastic optimization,” Journal of Machine Learning Research,vol. 12, no. Jul, pp. 2121–2159, 2011.

[32] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprintarXiv:1212.5701, 2012.

[33] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by arunning average of its recent magnitude,” COURSERA: Neural Networks forMachine Learning, vol. 4, no. 2, 2012.

[34] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXivpreprint arXiv:1412.6980, 2014.

[35] D. H. Hubel and T. N. Wiesel, “Receptive fields and functional architecture ofmonkey striate cortex,” The Journal of physiology, vol. 195, no. 1, pp. 215–243,1968.

[36] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, andL. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”Neural computation, vol. 1, no. 4, pp. 541–551, 1989.

89

http://cs231n.github.io

http://sebastianruder.com/optimizing-gradient-descent/index.html



[37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,pp. 2278–2324, 1998.

[38] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of thedevil in the details: Delving deep into convolutional nets,” arXiv preprintarXiv:1405.3531, 2014.

[39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.

[40] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on, pp. 248–255, IEEE, 2009.

[41] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recog-nition challenge,” International Journal of Computer Vision, vol. 115, no. 3,pp. 211–252, 2015.

[42] N. Wang and D.-Y. Yeung, “Learning a deep compact image representationfor visual tracking,” in Advances in Neural Information Processing Systems 26(C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,eds.), pp. 809–817, Curran Associates, Inc., 2013.

[43] Y. Chen, X. Yang, B. Zhong, S. Pan, D. Chen, and H. Zhang, “Cnntracker:Online discriminative object tracking via deep convolutional neural network,”Applied Soft Computing, vol. 38, pp. 1088–1098, 2016.

[44] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tinyimages,” 2009.

[45] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical convolutional fea-tures for visual tracking,” in Proceedings of the IEEE International Conferenceon Computer Vision), 2015.

[46] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg, “Convolutionalfeatures for correlation filter based visual tracking,” in Proceedings of the IEEEInternational Conference on Computer Vision Workshops, pp. 58–66, 2015.

[47] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object track-ing using adaptive correlation filters,” in Computer Vision and Pattern Recogni-tion (CVPR), 2010 IEEE Conference on, pp. 2544–2550, IEEE, 2010.

[48] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg, “Learning spatiallyregularized correlation filters for visual tracking,” in Proceedings of the IEEEInternational Conference on Computer Vision, pp. 4310–4318, 2015.

90

[49] M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate scale estimationfor robust visual tracking,” in British Machine Vision Conference, Nottingham,September 1-5, 2014, BMVA Press, 2014.

[50] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning discrimi-native saliency map with convolutional neural network,” in Proceedings of the32nd International Conference on Machine Learning, 2015, Lille, France, 6-11July 2015, 2015.

[51] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional net-works: Visualising image classification models and saliency maps,” CoRR,vol. abs/1312.6034, 2013.

[52] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convo-lutional networks,” in 2015 IEEE International Conference on Computer Vision(ICCV), pp. 3119–3127, Dec 2015.

[53] L. Wang, T. Liu, G. Wang, K. L. Chan, and Q. Yang, “Video tracking usinglearned hierarchical features,” IEEE Transactions on Image Processing, vol. 24,no. 4, pp. 1424–1435, 2015.

[54] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural localsparse appearance model,” in Computer vision and pattern recognition (CVPR),2012 IEEE Conference on, pp. 1822–1829, IEEE, 2012.

[55] C. Cadieu and B. A. Olshausen, “Learning transformational invariants from nat-ural movies,” in Advances in neural information processing systems, pp. 209–216, 2008.

[56] H. Nam and B. Han, “Learning multi-domain convolutional neural networks forvisual tracking,” arXiv preprint arXiv:1510.07945, 2015.

[57] H. Li, Y. Li, and F. Porikli, “Deeptrack: Learning discriminative feature repre-sentations by convolutional neural networks for visual tracking,” in Proceedingsof the British Machine Vision Conference, BMVA Press, 2014.

[58] N. Wang, S. Li, A. Gupta, and D.-Y. Yeung, “Transferring rich feature hierar-chies for robust visual tracking,” arXiv preprint arXiv:1501.04587, 2015.

[59] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convo-lutional networks for visual recognition,” IEEE transactions on pattern analysisand machine intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.

[60] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-rama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embed-ding,” in Proceedings of the 22nd ACM international conference on Multimedia,pp. 675–678, ACM, 2014.

91

[61] L. Cehovin, M. Kristan, and A. Leonardis, “Is my new tracker really betterthan yours?,” in IEEE Winter Conference on Applications of Computer Vision,pp. 540–547, IEEE, 2014.

92

APPENDIX A

BLOCK DIAGRAMS OF TEST NETWORKS


ReLULRNPool


Conv1Conv1 Conv2Conv2 Conv3Conv3 Conv4Conv4 Conv5Conv5ReLULRNPool

ReLULRNPool

ReLU ReLU

FCExtraFC

ExtraFC6FC6 FC7FC7 FC8FC8

Dropout DropoutDropoutConcatanationConcatanation

Pool 5

Pool 5

Feature ExtractionNetwork #1


Regression Network

PreviousFrame

CurrentFrame

TargetState

ReLUPool

ReLUPool

Figure A.1: C50,5 − C0 − F 4 Network. Network architecture of GOTURN is imple-

mented by changing feature extraction network and output layer.

93


ReLULRNPool



ReLULRNPool

ReLU ReLU

ConvExtra

ConvExtra

FC6FC6 FC7FC7 FC8FC8Dropout DropoutReLUConcatanationConcatanation

Pool 5

Pool 5



Regression Network

PreviousFrame

CurrentFrame

TargetState

ReLUPool

ReLUPool

Figure A.2: C50,5 − C1 − F 3 Network. Fully connected layer of C5

0,5 − C0 − F 4

Network is replaced with convolutional layer.

FC7FC7


ReLULRNPool



ReLULRNPool

ReLU ReLU

FCExtraFC

ExtraFC6FC6 FC7FC7


Pool5

Pool5

Regression Network

PreviousFrame

CurrentFrame

Target State

ReLUPool



MaxPoolingMax

Pooling

MaxPoolingMax

Pooling

ConcatanationConcatanation

Pool2

Pool2

ReLUPool

Figure A.3: Network architecture of C52,5 − C1 − F 3 network

94


ReLULRNPool



ReLULRNPool

ReLU ReLU

FCExtraFC

ExtraFC6FC6 FC7FC7 FC8FC8


MaxPoolingMax

Pooling

MaxPoolingMax

Pooling


ReLU of Conv3

ReLU of Conv3

Pool5

Pool5


Feature ExtractiıonNetwork #2

Regression Network

PreviousFrame

CurrentFrame

TargetState

ReLuPool

ReLuPool

Figure A.4: C53,5 − C0 − F 4 Network. ReLU of Conv3 layer is added as a low level

feature


ReLULRNPool



ReLULRNPool

ReLU ReLU

ConvExtra

ConvExtra

FC6FC6 FC7FC7 FC8FC8Dropout DropoutReLU

MaxPoolingMax

Pooling

MaxPoolingMax

Pooling


ReLU of Conv3

ReLU of Conv3

Pool5

Pool5



Regression Network

PreviousFrame

CurrentFrame

TargetState

ReLUPool

ReLUPool

Figure A.5: C53,5 − C1 − F 3 Network. Fully connected layer of C5

3,5 − C0 − F 4

Network is replaced with convolutional layer.

95


ReLULRNPool



ReLULRNPool

ReLU ReLU

ConvExtra

ConvExtra


MaxPoolingMax

Pooling

MaxPoolingMax

Pooling


Conv4 Relu output

Conv4 Relu output

Pool5

Pool5



Regression Network

PreviousFrame

CurrentFrame

TargetState

ReLUPool

ReLUPool

Figure A.6: C54,5 − C1 − F 3 Network. ReLu of Conv4 layer is used as a low level

feature


ReLULRNPool



ReLULRNPool

ReLU ReLU

ConvExtra

ConvExtra


MaxPoolingMax

Pooling

MaxPoolingMax

Pooling


Pool1

Pool1

Pool5

Pool5



Regression Network

PreviousFrame

CurrentFrame

TargetState

ReLUPool

ReLUPool

Figure A.7: C51,5 − C1 − F 3 Network. Pool1 is used as low level feature

96


ReLULRNPool



ReLULRNPool

ReLU ReLU

ConvExtra

ConvExtra


Pool2

Pool2

Conv4 Relu Output

Conv4 Relu Output



Regression Network

PreviousFrame

CurrentFrame

TargetState

Figure A.8: C52,4 − C1 − F 3 Network. Pool2 is used as low level feature and ReLU

output of conv4 layer is used as high level feature


ReLULRNPool



ReLULRNPool

ReLU ReLU

ConvExtra

ConvExtra


Pool2

Pool2

Conv4 Relu Output

Conv4 Relu Output

ReLU



Regression Network

PreviousFrame

CurrentFrame

TargetState

Conv5Extra

Conv5Extra

Figure A.9: C52,4 − C2 − F 3 Network. Conv5 extra layer is added to C5

2,4 − C1 − F 3

Network in order to extract dataset specific high level features

97

mouse face tracking using convolutional neural

Documents