Automatic recognition of primate behaviors and social ... · 2 Recognizing and modeling social behaviors of animals has many applications, in-cluding: (1) improved understanding of

AUTOMATIC RECOGNITION OF PRIMATE BEHAVIORS AND SOCIAL

INTERACTIONS FROM VIDEOS

A Dissertation Presented

By

Nastaran Ghadar

to

The Department of Electrical & Computer Engineering Department

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in the field of

Electrical and Computer Engineering

Northeastern University

Boston, Massachusetts

May 2015

NORTHEASTERN UNIVERSITY

Abstract

College of Engineering

Department of Electrical and Computer Engineering

Doctor of Philosophy

Automatic Recognition of Primate Behaviors and Social Interactions

from Videos

by Nastaran Ghadar

http://www.northeastern.edu/

http://www.coe.neu.edu/coe/index.html

http://www.ece.neu.edu/

2

Recognizing and modeling social behaviors of animals has many applications, in-

cluding: (1) improved understanding of their behavior; (2) enhanced protection of

species; (3) hosting of specimens in enriched environment at zoos; and (4) efficient

extraction and analysis of data in important basic and applied biological research,

where animal models are used. Currently, understanding social behaviors in ani-

mals is achieved either by direct human observations or by videotaping and then

coding the behaviors. Both of these approaches have major limitations includ-

ing being heavily time consuming and requiring highly trained behavioral science

experts. Having an automated system to recognize and model social behaviors

would facilitate the scientific study of complex behaviors with less impact due to

these constraints. However, research in this area is very limited.

In this dissertation, we describe a framework that adopts current practices from

computer vision and machine learning in creating the preliminary steps towards

solving the problem of automatically recognizing behaviors of primates in a social

group (in this case, a pen hosting a group of 3 or more primates). Several chal-

lenges need to be overcome in order to achieve primate activity recognition from

videos, some of which are: the massive size of continuous video recordings from

multiple cameras over days and weeks, illumination variations throughout the day,

background changes due to moving objects in the pen and humans passing by (e.g.

for feeding or observing), highly variable shapes and poses of primates, and the

low visibility of color-coded primate collars causing difficulty in identifying the

3

primates.

This study is unique, to our knowledge, because it tackles automatic primate

behavior and interaction recognition in social groups hosted in a pen for the first

time. Results indicate that the activities extracted based on the detection and

tracking algorithms developed are sufficiently accurate to infer primate behaviors

and social interactions.

———————————————————-

Disclaimer: This work is supported by National Science Foundation (NSF) under

grant BCS-1027724. Any opinions, findings and conclusions or recommendations

expressed in this material are those of the author and do not necessarily reflect

the views of the NSF. Some parts of this work are a joint work with my colleague,

Xikang Zhang .

Acknowledgements

I want to thank my advisor Prof. Deniz Erdogmus for his support, training and

supervision during my Ph.D studies. I truly appreciate his guidance from the early

steps of my studies until graduation.

I would like to thank my colleague Xikang Zhang, who has been working on this

project with me and helped me with parts of this work.

I would also like to thank my colleagues from OHSU, Dr. Izhak Shafran, Dr.

Katherine Grant, Dr. Kristine Coleman, and Alireza Bayesteh. I also want to

extend my appreciation for my committee member, Dr. Jennifer Dy. I want

to acknowledge the National Science Foundation (BCS-1027724) for their sup-

port. I give my thanks to current and previous members of our lab, Prof. Dana

Brooks, Erhan Bas, Seyhmus Guler, Jamshid Sourati, Sina Moghadamfallahi,

Sarah Brown, Sheng You, Marzieh Haghighi, Asieh Ahani, Hooman Nezamfar

and other colleagues and members of Bspiral group for their friendliness, support,

and creating a pleasant environment during my research.

I would like to thank Payam Nia, my husband for his love and support. Finally,

I would like to thank my sisters Yasi and Shabnam, who have been always there

for me and supported me in every possible way.

4

Contents

Acknowledgements 4

List of Figures 7

List of Tables 9

1 Introduction 11

1.1 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Primate Behavior Research . . . . . . . . . . . . . . . . . . . . . . . 15

1.3 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 19

1.4 Description of Framework of Dissertation . . . . . . . . . . . . . . . 23

2 Data Collection and Preparation 25

2.1 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Recording Behaviors with Multi-channel Audio and Video Data . . 27

3 Segmentation and Object Recognition 32

3.1 Review on Object Detection Algorithms . . . . . . . . . . . . . . . 33

3.1.1 Segmentation-based Approaches . . . . . . . . . . . . . . . . 33

3.1.2 Background-modeling-based Object Detection . . . . . . . . 36

3.1.3 Supervised-learning-based Background Subtraction . . . . . 41

3.1.4 Point Detectors . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.1.5 Feature-based Object Detection . . . . . . . . . . . . . . . . 43

3.1.6 Shape-based Object Detection . . . . . . . . . . . . . . . . . 51

3.1.7 Template-based Object Detection . . . . . . . . . . . . . . . 52

3.1.8 Classifier-based Object Detection . . . . . . . . . . . . . . . 53

3.1.9 Deep Neural Networks and Convolutional Neural Networks . 54

3.1.10 Comparison between Detection Algorithms and Finding aSuitable Approach for our Problem . . . . . . . . . . . . . . 55

5

Contents 6

3.2 Primate Detection in 2D . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2.1 Background Subtraction . . . . . . . . . . . . . . . . . . . . 58

3.2.2 Using HOG and Color Features and Classification . . . . . . 59

3.2.3 Primate Identification . . . . . . . . . . . . . . . . . . . . . 59

4 Object Tracking 62

4.1 Definition and Common Algorithms . . . . . . . . . . . . . . . . . . 62

4.1.1 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.1.2 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.1.3 Multiobject Data Association and State Estimation . . . . . 70

4.2 Primate Tracking in 2D . . . . . . . . . . . . . . . . . . . . . . . . 71

5 Calibration and 3D Reconstruction 75

5.1 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.1.1 Explicit Camera Calibration . . . . . . . . . . . . . . . . . . 76

5.2 Visual Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3 Calibration and Visual Hull Reconstruction of Primates . . . . . . . 81

5.3.1 Multiview Environment and Calibration . . . . . . . . . . . 81

5.3.2 3D Visual Hull Reconstruction of Primates . . . . . . . . . . 82

6 Activity Recognition Based on Spatial Relation 83

6.1 Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.2 Primate Activity Recognition . . . . . . . . . . . . . . . . . . . . . 85

6.2.1 Velocity Measures . . . . . . . . . . . . . . . . . . . . . . . . 85

7 Experimental Results 91

7.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.2 2D Primate Detection . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.3 Multiview Environment and 3D Primate Visual Hull Results . . . . 97

7.4 2D Primate Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.5 Primate Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.6 Fusion of Multiple Views . . . . . . . . . . . . . . . . . . . . . . . . 102

8 Discussion and Conclusion 104

8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8.2 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 106

List of Figures

1.1 Primate research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1 Primate research, group of four primates viewed from different cameras

in the pen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Environment set up and lens installation. . . . . . . . . . . . . . . . . 29

2.3 Primate research, group of four primates viewed from different cameras

in the pen with different setting than figure 2.1 . . . . . . . . . . . . 30

2.4 A sample image in norpix software from different . . . . . . . . . . . . . 31

3.1 This figure shows an example of static background subtraction algorithm.

This image is taken from [11] . . . . . . . . . . . . . . . . . . . . . . 38

3.2 Background normalization using the static background image. . . . . . . 60

5.1 2D example of the visual hull approximation algorithm. C1, C2, C3

are different views with corresponding silhouettes S1, S2, S3. Theyellow area is the approximation of the visual hull; the area enclosedby black lines is the actual visual hull; and the blue shape in thecenter is the object. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.1 A sample image of locomotion activity. The primate that is shown with

the red box is moving but no other primate has motivated this movement. 87

6.2 These series of images from top right to bottom left show the chasing

and avoiding activities that are happening between the two primates

that are shown with red circles. . . . . . . . . . . . . . . . . . . . . . 88

6.3 These series of images from top right to bottom left show the avoiding

activity for the primate that is specified with the red circle. Note that

this activity is not a result of chasing in this case. . . . . . . . . . . . . 89

6.4 This figure shows the decision tree we used to evaluate our test set. Th

leaf nodes show the decision made based on the feature values. . . . . . 90

7.1 Sample image from four views. . . . . . . . . . . . . . . . . . . . . . . 93

7

List of Figures 8

7.2 Primate detection in 2D. In column one, green boxes are the groundtruth; red boxes are the detection results. Column two shows the ex-tracted silhouettes by background subtraction over detected bound-ing boxes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.3 PR-curve of 2D detection. . . . . . . . . . . . . . . . . . . . . . . . 96

7.4 Calibration process. A checkerboard of size 16.8′′ × 24′′ is used forcalibration. The top figure shows the 3D locations of each camera. . 98

7.5 3D visual hull reconstruction result sample. Column one are theoriginal images; Column two shows the binary images from 2D pri-mate detection; Column three is the visual hull constructed fromthree views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

List of Tables

1.1 Example macaque behaviors often encoded in human observations . 17

7.1 2D primate detection results from 4 views, video 1 . . . . . . . . . . 96

7.2 2D primate detection results from 2 views, video 2 . . . . . . . . . . 97

7.3 2D primate tracking results from 2 views, video 1 . . . . . . . . . . 100

7.4 2D primate tracking results from 2 views, video 2 . . . . . . . . . . 100

7.5 Activity recognition results on view 1, video 3. . . . . . . . . . . . . 101



9

To my mom;

For she has always been there for me and believed in me.

10

Chapter 1

Introduction

1.1 Motivation and Overview

In recent decades, biologists and scientists have shown great interest in studying

animals from images and videos. Wildlife recordings taken in the field, represent

challenging real-life situations for automated visual analysis. Biologists are inter-

ested in detection and analysis of the behavior of animals from camera traps [1].

For this purpose, biologists conduct a substantial portion of their research in the

field, and they collect large amounts of video from animals, which include monitor-

ing video, videos from field trips, and personally recorded wildlife video footage [2].

The result of this data collection is massive video data, which sometimes span for

11

Chapter 1. Introduction 12

several hundreds of hours. While fieldwork is very demanding, videotape analysis

is truly tiring. The quantity of video footage that must be viewed is enormous,

during which time, numerous notes and qualitative observations have to be done.

Unfortunately, for manual indexing biologists have to browse linearly through the

videos to find and describe objects and events of interest and they need to exam-

ine numerous hours of videotapes. The task of locating and identifying animals

in each video frame is heavily labor and tedious task for large amounts of videos

[4]. Since domain experts should preferably perform indexing, it quickly becomes

an expensive task; therefore they are in desperate need of computational video

analysis tools. Visual analysis methods have the ability to significantly acceler-

ate the process of video indexing and enable novel ways to efficiently access and

search large video collections. Usually camera trap videos tend to have very low

frame rates which cause popular tracking and supervised learning algorithms to

fail. While a lot of research has been performed on the visual analysis of human

beings and human-related events, there has been unexpectedly little work on the

problem of the automated analysis of animals despite its great importance.

An effective exploration of methods to tackle the specific aims of automatic recog-

nition of primate behaviors requires a multidisciplinary approach with expertise

from both primatology and computational science. This dissertation tackles the

computer vision aspects of a larger multi-disciplinary NSF project, whose goal is

to automatically recognize behavior of individual primates in social groups, using


audio and video recordings from multiple cameras and microphones. The animal

studies were conducted under the IUCAC approval of OHSU, and were supervised

by Dr. Kristine Coleman and Dr. Kathleen Grant who are experts in primate

models to study disease risks, and behavioral ecology. The project was managed

by Dr. Izhak Shafran who is an expert in speech recognition, and also supervised

the instrumentation and the data collection with assistance from Alireza Bayesteh

Tashk and Dr. Guillaume Thibault. The audio recordings are being analyzed by

Alireza Bayesteh Tashk under the guidance of Dr. Izhak Shafran.

As mentioned, we have chosen to focus our work on primates, in particular rhesus

macaques. The goal of this dissertation is to present a computational approach to

tackle the problem of automated understanding of behavior of highly social ani-

mals using computer vision algorithms. We expect this to have far-reaching effects

on primatology, behavioral ecology, animal husbandry and neuroscience. It will

enable researchers to formulate new cyber-enabled strategies in behavioral ecology,

conversational biology and animal husbandry. For example, the proposed methods

will allow scientists to closely monitor critical stages of life, such as mating and

breeding in captive breeding programs for highly social animals (e.g., wolves, voles)

before releasing them into the wild. Continuous monitoring of routine behaviors

in zoos and other facilities will also help scientists to identify individuals that

need special attention. For example, this research will allow researchers to follow

non-social behaviors (e.g. eating and sleeping), which are necessary to ensure the


welfare of animals. This knowledge could help reduce injuries and mortality rate

from fighting in zoos, primate centers, and other such facilities. In neuroscience,

the proposed research could help overcome current hurdles in studying the influ-

ence of social status and environmental factors associated, for instance, with drug

and alcohol consumption [16]. The proposed methods will also enable better un-

derstanding of daily patterns in behavior, or behaviors that occur at night, which

are currently constrained by the heavy dependence on human observers. To tackle

the task of learning behaviors from observations, one primary assumption is ap-

plied: All complex behaviors consist of a sequence of elementary behaviors and

by recognizing the elementary behaviors; one can learn the complicated dynamic

behaviors over time.

Figure 1.1: Primate research


1.2 Primate Behavior Research

After briefly describing behavior research of primates that are relevant to this

work and current measurements, we summarize the state-of-the-art algorithms

from computer vision for tackling computational challenges in our specific aims.

Social relationships are very important to scientists who study macaque behav-

iors, both in the wild and in captivity. In the wild, rhesus macaques typically

live in large crowds consisting of approximately 50-80 individuals including mul-

tiple males and females. Females remain in their natal groups their whole lives,

while males leave the troop at around puberty and move into new troops. There-

fore females have strong relationships with their daughters and sisters. There are

various levels at which these social behaviors can be described [5] – social interac-

tions, behavior that occurs between individuals, e.g., aggressive display between

two monkeys; social relationships, succession of social interactions between in-

dividuals, e.g., dominant/subordinate relationship between two individuals; and

social structures, networks of social relationships, e.g., dominance hierarchy in

the troop. Since the publication of Jeanne Altmanns seminal paper in 1974 ex-

plaining sampling methods [6], more or less the algorithm for measuring behavior,

including social behavior, has remained the same. Focal animal sampling is one

of the most commonly utilized observation approaches for studying animals in


groups. In this method, one individual (focal) is observed for some specific time

(observation period) and behaviors of interest are recorded. Researchers usually

define behaviors of interest on an ethogram, a quantitative description of behav-

ior, using labels such as the ones listed in Table 1.1. Some of the behaviors

that are often recorded in studies of social dominance include eye brow threat

(mouth open, brow back, agonistic behavior), bared-teeth or fear grimace (lips

retracted, teeth bared; submissive behavior), lipsmack (facial expression involving

rapid opening and closing of mouth and lips), and affiliative behavior [7], where

certain behaviors may co-occur (e.g., animals can lipsmack and move). The main

advantage of video recordings is that there is no room for human intrusion and

it replaces direct observations, at the cost of viewing multiple perspectives and

videotapes. To monitor eating and drinking behaviors, there are mechanisms to

automatically log the time and quantity of intake, but so far no automatic solution

exists for evaluating the behaviors of individuals in groups. While observational

methodologies have not undergone major changes, ways to interpret the data have

evolved with statistical methods. From computing statistics related to duration,

frequency and latencies of behaviors, now analysis often includes context in terms

of preceding behavior. Sociograms provide another perspective of social behavior

and relationship, representing associations between individuals using lines whose

thickness depends on the strength of association [8]. In this research our interest

is in some of the specific behaviors presented in Table 1.1, where we utilize the


focal observation method to annotate the recordings and create a data set to train

and validate our models. Specifically, we focus on types of behaviors, that can be

interpreted from the animals, relative position.

Label Comment

Aggression rough behavior or bitingChase pursuitDisplace subject leaves when approachedExplore inspects objects other than foodFear grimace subject bares teethForage searching presumably for foodFreeze subject is inactive; may move eyesGroom with hands or mouthLipsmack rapid movement of lipsLocomotion motion of entire bodyPlay grunting, wrestling, jumping, etcStationary immobile, moving head or armThreat scream, lunge, ground beating, etcVigilant subject scans environment with eyes

Table 1.1: Example macaque behaviors often encoded in human observations.

To better understand some of these behaviors, we discuss them separately.

Dominance: Rhesus macaques naturally live in social groups and they establish a

linear ”dominance hierarchy” over time. This hierarchy may change over time and

depends on many factors (age, sex, aggression, intelligence perhaps), and also could

depend on the support of other primates in the group. As clear from the expression,

primates that have a higher rank in the hierarchy tend to be more dominant,

i.e. displace lower ranked individuals from resources (mates, space, food). They

tend to have higher reproductive success (either by mating more often, or by


having more resources to invest in their offspring). The rank is established through

play, interactions and affiliated interactions (and rather tautological, that’s exactly

how it’s measured too). It is interesting to know that this maintenance of social

position, and social knowledge of one’s rank is one of the claimed theories for why

humans have been forced to evolve large brains.

Grooming: One of the most common activities among primates is grooming.

Grooming other primates is an important mechanism that shows their affection

for each other. There are several reasons for why a primate might groom another

primate ,subordinate animals tend to groom more dominant ones; males groom

females for sexual purposes. Mothers groom for practical purposes, infants to keep

their fur clean; but one thing that is definite is that it strengthens links between

them and keeps the primate social structure together.

Communication: This includes scents, body postures, gestures, and vocaliza-

tions. Some of these appear to be autonomic responses indicating emotional states:

fear, excitement, confidence, anger. Others seem to have a more specific purpose:

loud ranging calls in indri, howler monkeys and gibbons; quiet contact calls in

lemurs to keep the group together; fear calls in lost infants, or on spotting preda-

tors. From our human perspective, we often find it easier to associate sounds with

specific meaning, but among non-human primates, gestures and actions are often

used. Presentation and mounting behavior are often used to diffuse potentially

aggressive situations. Yawns exposing teeth are often threats, as is direct eye


contact. Facial expression is important too. It is very obvious in chimps: their ex-

pression often appears all too human-like; but other primates also use stereotyped

eyelid flashes or lip slaps.

Aggressive and affiliative behavior: As mentioned before, many behaviors

exist to keep the group structure running smoothly for the members of the group.

There are occasions though when these behaviors (especially aggression) are di-

rected outside the group.

Distance related behaviors: Behaviors including locomotion (running, jump-

ing, walking and climbing) and specifics of foraging behavior.

1.3 Background and Related Work

In this section, we provide a selected review of closely related work. In the com-

puter vision community, many studies employ videos of animals as standard data

sets to develop new algorithms, especially for tracking or behavior recognition.

Most of the presented methodologies on animal analysis are conducted in highly

controlled environments, for instance, with a static camera, in a well-defined loca-

tion, with static background, and with no environmental factors interfering, such

as occlusion, different illumination conditions, and interfering objects [9, 10]. One

common scenario for a controlled environment would be monitoring applications,


where there is a static background and a static camera [12]. This setting makes it

straightforward to learn the static background and easily obtain the foreground by

looking for the devisions from the background. More sophisticated techniques have

also been introduced. Khan et al. [13] developed a system that can automatically

generate the three dimensional trajectory of primates in an outdoor environment.

Their purpose is to evaluate the navigational abilities of non-human primates.

Their system extracts primate kinematic features such as path length, speed, and

other variables impossible for an unaided observer to note. From trajectories, they

computed and validated a path length measurement and proposed a method for

automatic behavior detection. Also, their system is used to examine the gender

differences in spatial navigation of rhesus primates. They set the environment

in a way to avoid occlusion, i.e. an open environment with minimal perturba-

tions, and they did not analyze the social interactions between primates, but put

their focus on individual actions.Chaumont et al. [14] proposed a computerized

method and a software called Mice Profiler, that uses geometrical primitives to

model and track social interactions in mice. Their system monitors a comprehen-

sive repertoire of behavioral states and temporal evolution, which is utilized for

identifying the key events that trigger social contact. Balch et al. [15] proposed an

automated labeling system to study social insect behaviors. Their ultimate goal

is to automatically create executable models of animal behavior. An algorithm

proposed by Burghardt and Calic [17] detects animal faces using Haar features


and then track animals; such algorithms would not work for animals whose faces

are not visible or hard to track. Other approaches [18, 19] have the user mark

or extract the location of the animal by hand. This, of course, is extraordinarily

time-consuming. Khorami et. al [10] proposed an approach that is able to detect

multiple types of animals in an entirely unsupervised scenario. The goal of their

system is to detect multiple types on animals in an unsupervised manner. Walther

et al. [20] apply saliency maps to minimize multi-agent tracking of low-contrast

translucent targets in underwater footage. Haering et al. [21] use neural network

algorithms to detect high-level events, such as hunts, by classifying and tracking

moving object blobs. Tweed and Calway [22] proposed an approach that achieves

multiple object tracking by developing a periodic model of animal motion and

exploit conditional density propagation to track flocks of birds. Ramanan and

Forsyth [23] proposed an interesting method, where they use low-level detectors

and a mean shift construct to create an appearance model for the animal and

use it to detect the animal in future frames. Their method takes into account

temporal coherency when building appearance models of animals. While they

present very good results in their paper, they only deal with three different animal

species and with cases that have no occlusion. Everingham et al. [24] proposed an

approach that combines a minimal manually labeled set with an object tracking

technique to gradually improve the detection model; however, they only deal with

human faces. Gibson et al. [25] and Hannuna et al. [26] try to address the issue


of animal behavior classification by detecting and classifying animal gait by ap-

plying statistical analysis on a sparse motion information extracted from wildlife

footage. Burghardt et al.[17] presents an algorithm that tracks animal faces in

wildlife rushes and populates a database [27] with appropriate semantics defining

their basic locomotive behavior. Their detection algorithm is an adapted version

of a human face detection method that exploits Haar-like features and the Ad-

aBoost classification algorithm [28] the Kanade-Lucas-Tomasi method, fusing it

with a specific interest model applied to the detected face region. They achieved

reliable detection and temporally smooth tracking of animal faces. Furthermore,

the tracking information is exploited to classify locomotive behavior of the tracked

animal, e.g. lion walking left or trotting towards the camera. Finally, the extracted

metadata about the presence of the animal, together with its locomotive behav-

ior, creates a strong prior in the process of learning animal models as well as in

extracting the additional semantic information about the animal’s behavior and

environment. The presented algorithm is a part of a large content-based retrieval

system [29] within the ICBR project that focuses on the computer vision research

challenges in the domain of wildlife documentary production. This algorithm is

close to what we are presenting in this project.


1.4 Description of Framework of Dissertation

In this dissertation, we developed a general framework for detecting, localizing,

tracking, and reconstructing images of social animals in a 3D observation environ-

ment. Finally, using these results, we were able to extract elementary behaviors

from videos.

As evident from the cited literature, the necessary components have developed

sufficiently in recent years to allow computational scientists to undertake the chal-

lenge of creating a framework for modeling and recognizing behavior of individuals

in their social groups. The structure of this dissertation is as follows:

1. Recording behaviors with multi-channel audio and video data: In

Chapter 2, I will discuss the details of data collection and how we acquired

our data for our experiments.

2. Detecting individual primates in the pen: In Chapter 3, I will start

with the definition of object detection. There are several algorithms currently

available in the literature for object detection and each has their advantages

and disadvantages. After introducing these algorithms and discussing where

they work best, I define the framework of our detection algorithm and why

we chose the proposed methods.


3. Tracking individuals over time: In Chapter 4,I will introduce some of

the most common algorithms for object tracking and when we would expect

to get a good performance out of them. Finally I will discuss the details of

our tracking algorithm.

4. Calibration and 3D visual hull reconstruction of primates: In Chap-

ter 5, I will explain the details necessary for us to obtain a 3D silhouette of

the primates in the pen and decide whether having a 3D system is helpful

or not.

5. Recognizing individual behaviors: In Chapter 6, I will discuss the activ-

ities we are interested in. After that, I will describe an algorithm to recognize

them.

6. Experimental results: In Chapter 7, I will present the results of each

section separately and discuss the results

7. Conclusion, discussion, and future work: In Chapter 8, I will discuss

the pros and cons of our algorithm and how one can improve it in terms of

efficiency and performance.

Chapter 2

Data Collection and Preparation

2.1 Acknowledgement

This section is completely done by our collaborators at OHSU. All the data was

collected by the OHSU team, which was led by Dr. Shafran. I would like to

acknowledge Alireza Bayesteh Tashk, Guillaume Thibault, and Meysam Asgari

for the grunt work they did for two years collecting the data. I would like to

acknowledge Dr.Kristine Coleman, Nicola Robertson and Megan McClintik for

conducting the animal studies, and Dr. Kathy Grant for her input in the process.

25

Chapter2. Data Collection and Preparation 26

2.2 Experimental Setup

Overall five groups of animals were observed and each group consisted of 4 or 6

rhesus macaques held in a pen (approximately 12 ft (length) x 7 ft (deep) x 7 ft

(high)) at the Oregon National Primate Research Center (ONPRC) using a pro-

tocol approved by the OHSU’s Institutional Review Board.

Individuals from isolated cages were put into the pen and their behavioral activi-

ties were recorded for two days from about 7am to 7pm, and there was no recording

when the lights were off. After a week their behavior was recorded again for two

days. By this time they have established a dominance hierarchy, i.e. a stable

phase. Two more sessions of two days were recorded to observe the effect of an

escalating series of perturbations, i.e. a perturbed phase. Major perturbations ap-

plied were: 1) Human Impostor (introduce an unfamiliar human near the cage or

pen for 15 minutes), 2) Resource Competition (modulate certain resources, for in-

stance preferred resting areas, toys, and treats), and 3) Social Instability( removal

of the most dominant individual for the entire last week). These perturbations cre-

ated the chance to observe interactions that establish social dominance hierarchies.


Camera-‐1

Camera-‐4

1

2

3

4

1

2

3

1

2

3

4

3

1

Camera-‐3

2

Camera-‐2 4

3

3

1

1

1

2

Figure 2.1: Primate research, group of four primates viewed from differentcameras in the pen

2.3 Recording Behaviors with Multi-channel Au-

dio and Video Data

Automating recognition of behaviors requires capturing all the information rele-

vant for detecting individuals in the pen, tracking their movement over time and

recognizing their vocalization.

In the video domain, to avoid occlusions and to maximize coverage of the entire vol-

ume of the pen, we recorded behaviors using cameras from multiple perspectives,


three cameras (GC1380CH, 2/3” CCD) with wide aperture lenses (Optron 5mm

f/2.8) on three corners of the pen and a forth camera (GC1380CH, 2/3” CCD)

with a wide-angle fisheye lens (Edmunds Optics NT62-274, focal length 1.8mm,

F1.4,185 x 185 degrees) on the top of the pen. Ideally, the pen should be uniformly

illuminated to avoid blotchy over-exposed and dark under-exposed regions in the

imag, but this is very difficult to achieve. We minimized the illumination variation

by relying on several overhead incandescent tube lights which was supplemented

by a light box mounted at the floor level. The lights were programmed to switch

off during the night hours, about 7pm to 7am. Figure 2.2 shows the camera setup

and figures 2.3 and 2.1 show a typical camera frame from four views for the two

groups of primates.

Additionally, for simplifying the task of identifying the individuals in the video

recordings, we color-coded the collars on the monkeys in a group. Collars were

powder coated with one of the six colors: purple, green, orange, blue, red and

yellow for the group of six monkeys and green, yellow, black, and red for the

group of four monkeys.

Obtaining high-level synchronization of frames from the four cameras was done by

triggering the cameras to capture each frame by a common trigger signal (National

Instrument Pulse Generation Module). The trigger signals were controlled and

programmed by a high-level software, the StreamPix 5, from Norpix on a dedicated

data collection workstation. StreamPix is NorPix’s flagship software product.


View 1: Edmunds Op*cs NT62-‐274 fish-‐eye lens

View 2, 3, 4: Kowa WIDE MEGAPIXEL high resolu*ons lens, model-‐LM5JCM

1

4 2

3

Figure 2.2: Environment set up and lens installation.

StreamPix has become the fundamental digital video camera recording software.

It offers a state of the art user interface, and a lot of usage flexibility for single or

multiple camera recordings.

With StreamPix, it is possible to view, control, and acquire from multiple cam-

eras simultaneously, all in the same user interface. StreamPix provides a complete

management console for cameras, simplifying the setup, controlling and acquisi-

tion from any number and type of cameras. The number of digital video cameras

supported is only limited by a condition wherein the combined data rate of the


Camera-‐1

1

2

3

4

Camera-‐4

12

4

Camera-‐2

1

Camera-‐3

3

4

1

Figure 2.3: Primate research, group of four primates viewed from differentcameras in the pen with different setting than figure 2.1

cameras exceeds the internal bus bandwidth or processor capabilities of the com-

puter.

The software was programmed to automatically start recording during the daylight

hours in the pen and the video from the four cameras were streamed into the work

station’s high-speed RAID array consisting of 8 disks, each with a capacity of about

2TB. The video was recorded with a resolution of 1360 x 1024 pixels, each pixel


Figure 2.4: A sample image in norpix software from different .

quantized at 8-bits for each of the three channels. We had two such workstations,

one at the pen and one in the lab. After a few sessions the disks were swapped

between the two workstations and the recorded data, totaling about 1.5TB per

session, was then offloaded to larger file servers to be archived.

Since our focus is on the video domain, we do not discuss the details of audio set

up.

Chapter 3

Segmentation and Object

Recognition

In this section, we start by definition of object detection and present different ap-

proaches of detecting objects using different methods such as frame differencing,

optical flow, point detectors, background subtraction, temporal differencing, clas-

sification methods and the feature types of different methods of object detection

such as edge based feature type, patch based feature type etc. We compare the

accuracy rate of these methods and identify the advantages and disadvantages of

each method to find out what situations these algorithms work best, and find a

direction to pursue for our detection algorithm.

32

Chapter 3. Segmentation and Object Recognition 33

3.1 Review on Object Detection Algorithms

Object detection is finding areas of interest in images and videos and clustering

the pixels related to these areas of interest as objects[30]. Object detection is one

of the main challenging fields in computer vision and image processing area. It

is the essence of any tracking algorithm or activity recognition algorithm. Most

common techniques for object detection use the information in a single frame,

however there are several algorithms that use temporal information computed from

analyzing a sequence of frames in order to reduce the number of false detections

and increase accuracy rate [31]. To summarize the algorithms for object detection,

here the main and common categories that are used for object detection are briefly

discussed.

3.1.1 Segmentation-based Approaches

Segmentation based algorithms are used to segment the image frame into segments

to find out the objects of interest. There are some different principles that based

upon one can segment an image and they play an important part in finding ob-

jects of interest. Once segmentation is done, segmented objects are considered for

detecting the desired object.

1. Graph Cut:


When graph cut algorithm is used in computer vision, the input image is

considered as a graph and the graph splitting problem will be image segmen-

tation. In this representation the image will be the graph G, the pixels of the

image are the vertices, V = u, v, . . . .., of the graph and will be partitioned

into N disjoint sub-graphs (regions), Ai , by pruning the weighted edges

of the graph. The weights between the nodes are processed based on the

similarity of color, brightness and texture. Wu and Leahy proposed [104]to

use color similarity as the minimum cut condition for dividing an image into

regions but their method suffers from over segmentation. Yi and Moon [104]

considered graph cut image segmentation as pixel labeling problem. The

label of the foreground object (s-node) is set to be 1 and the background

(t-node) is set to be 0. By minimizing the energy-function with the help

of minimum graph cut the process of pixel labeling can be done. Shi and

Malik [33] proposed the normalized cut to overcome the over segmentation

problem. The ‘cut ’of their method depends on the sum of weights of the

edges in the cut and on the ratio of the total connection weights of nodes

in each partition to all nodes of the graph. For image-based segmentation,

the product of the spatial proximity and color similarity defines the weights

between the nodes. Graph cuts can only find a global optimum; therefor a

background/foreground situation does not work for multiple objects. And

the memory usage of graph cuts increases quickly as the image size increases.


2. Mean-shift Clustering:

Mean shift clustering is a segmentation algorithm that is used to cluster

image pixels of an image frame to clusters. For an input image, the algorithm

is initialized by randomly choosing a large number of cluster centers from

the data. In the next step each of the cluster centers are moved to the mean

of the data. The mean of the data is lying inside the multi-dimensional

ellipsoid. The multi-dimensional ellipsoid is centered on the cluster center.

Mean-shift vector is a vector which is defined by the old and the new cluster

centers. Comaniciu and Meer [34] used the Mean-shift clustering for image

segmentation problem to find clusters in the joint spatial and color space,

[l, u, v, x, y], where [l, u, v] denotes the color and [x, y] is the spatial location.

Saravanakumar et al. [35] represented the objects using properties of the

HSV color space. The weakness of this algorithm is the tracking drift (or

tracking failure) especially when the color distributions of target object and

the background clutter (or other objects) become similar.

3. Active Contours:

Active contour model, or snakes, is a segmentation algorithm to find the

outline of an object from a possibly noisy 2D image. A snake is an energy

minimizing, deformable spline influenced by constraint and image forces that

pull it towards object contours and internal forces that resist deformation.

Snakes may be understood as a special case of the general technique of


matching a deformable model to an image by means of energy minimization

[105]. Snakes do not solve the entire problem of finding contours in images,

since the method needs a good understanding of the desired contour shape

initially. Rather, they depend on other mechanisms such as interaction with

a user, interaction with some higher-level image understanding process, or

information from image data adjacent in time or space.

3.1.2 Background-modeling-based Object Detection

1. Static Background Subtraction: A very common object detection approach is

by creating a representation of the part of the image, which is the background

and find differences from the model for each consecutive frame in the video

imagery. In simple background subtraction an absolute difference is taken

between every current image frame It(x, y) and the reference background

image, B(x, y) , to find out the motion detection mask D(x, y). Usually

there are two main approaches to choose the reference background image,

which we discuss later. In this approach a threshold,T , is determined and for

each pixel , one could make a decision if t a pixel belongs to the background

or foreground by this rule:

|D(x, y)B(x, y)| > T


If the absolute difference is greater than or equal to the threshold, the pixel

is classified as foreground; otherwise the pixel is classified as background.

Any significant change in an image region from the background model is

noted down as a moving object. The pixels in the regions of the undergoing

change are marked as moving objects and reserved for further processing.

This process is referred to as the background subtraction. Figure 3.1 shows

an example of background subtraction. There are various methods of back-

ground subtraction as discussed in the survey [36] such as frame differencing,

region-based, spatial information, Hidden Markov Models (HMM) and eigen

space decomposition.[37].

As mentioned earlier, there are mainly two approaches for choosing the back-

ground reference image.

(a) Recursive Algorithm: Recursive techniques for background subtraction

[38, 39] which do not maintain a buffer for background estimation. This

method recursively updates a single background model based on each

input frame. In this scenario, input frames from distant past could have

an effect on the current background model being analyzed. Recursive

techniques require less storage compared with non-recursive techniques,

but any error in the background model can have a considerable effect for

a much longer period of time. This technique includes various methods


Figure 3.1: This figure shows an example of static background subtractionalgorithm. This image is taken from [11]

such as approximate median, adaptive background, Gaussian of mixture

[30].

(b) Non-Recursive Algorithm: Non-recursive techniques [38, 39] that use

a sliding-window approach for estimating changes in the background.

The process includes storing a buffer of the previous n video frames

and estimating the background image based on the temporal variation

of each pixel within the buffer. Non-recursive techniques have high

adaptability, as they do not depend on the history beyond those frames


stored in the buffer as in recursive algorithms. On the other hand,

the storage requirement can be very huge if a large buffer is needed to

manage the slow-moving data traffic. [30]

The problem with background subtraction [40, 41]is that it automatically

updates the background from the incoming video frame and it is not able to

overcome motion in the background, illumination changes, and shadows .

2. Gaussian Mixture Model: Knowing the moving object distribution in the first

frame of the video sequence, one can localize the object in the next frames

by tracking its distribution. Gaussian Mixture Model is a popular technique

for modeling dynamic background as it can represent complex distribution

of each pixel. The common steps in this method are as follows [43] : The

values of a pixel are modeled as a mixture of gaussians. At each iteration,

gaussians are evaluated using a simple heuristic to get which are likely to

correspond to background. Pixels that do not match with the “background

gaussians” are classified as foreground. Foreground pixels are grouped us-

ing 2D connected component analysis. Bodor et al. [42] tried to develop

automated intelligent vision-based monitoring systems. They detect objects

appearing in a digitized video sequence with the use of a mixture of gaussians

for background/foreground segmentation. Stauffer and Grimson [44] use a

mixture of gaussians to model the pixel color. In this method, every pixel

value of current frame is checked against the existing Gaussian distributions


of the background model. Until a matching Gaussian is found the pixel val-

ues are checked continuously in the model. The mean and variance of the

matched Gaussian is updated when a match is found. If this pixel value does

not fit into any one of the Gaussian distributions, the distribution with the

least weight is replaced by a new distribution mean as current pixel value,

with high variance at initial stage, and a low weight. Classification of pixels

is done based on whether matched distribution represents the background

process.

GMM suffers from slow convergence at the starting stage of detecting back-

grounds. Also it sometimes leads to false motion detection in complex back-

grounds. For example, rapid changes in the lighting of the outdoor scene

such as those caused by the sun suddenly going behind or the lights going

off, can introduce some major errors.

3. Eigen-space Decomposition of Background: Another approach for back-

ground modeling based object detection is eigen-space decomposition. It

is less sensitive to illumination. In this method, by projecting the current

image to the eigen-space and calculating the difference between the recon-

structed and actual image, the foreground objects are detected.

4. Hidden Markov Model: Recently Hidden Markov Models are widely used for

background subtraction. Corresponding to the events in the environment it

represents the intensity variations of a pixel in an image sequence as discrete


states. Hidden Markov Models (HMM) used by Rittscher et al. [45] classified

small blocks of an image into three states. Stenger et al. [46] used HMMs for

the background subtraction in the context of detecting light on /off events

in a room.

3.1.3 Supervised-learning-based Background Subtraction

Supervised learning based background subtraction methods can also be used for

object detection. Supervised learning mechanism helps in learning of different ob-

jects from a set of examples automatically. Supervised learning methods generate

a function that maps inputs to desired outputs for a given set of learning examples.

Classification problem is the standard formulation of supervised learning, where

the learner approximates the behavior of a function. This approximation is done

by generating an output in the form of either a continuous value (regression), or

a class label (classification). Some of the learning approaches are boosting. Viola

et al. [47], support vector machines [48].

1. Adaptive Boosting: Boosting is done by combining many base classifiers to

find accurate results. First step of training phase of the Adaboost algorithm

is to construct an initial distribution of weights over the training set . Next

step of Adaptive boosting is that the boosting mechanism selects the base

classifier with least error. The error of the classifier is proportional to the


misclassified data weights. Then, the misclassified data weights are increased

which are selected by the base classifier. Finally, in the next iteration the

algorithm selects another classifier that performs better on the misclassified

data.

2. Support Vector Machines: For a linear system, the available data can be

clustered into two classes or groups by finding the maximum marginal hyper

plane that separates one hyper plane and the closest data points help in

defining the margin of the maximized hyper plane. The data points that

lie on the hyper plane margin boundary are called the support vectors. For

object detection purpose the objects can be included in two classes, object

class (positive samples) and the non-object class (negative samples). For

applying SVM classifier to a nonlinear system, a kernel trick has to be applied

to the input feature vector, which is extracted from the input.

3.1.4 Point Detectors

Point detectors are used in finding useful points in images, which have an expressive

texture in their respective localities [37]. A useful interest point is one, which is

invariant to changes in illumination and camera viewpoint. Some commonly used

interest point detectors include Moravec’s detector, Harris detector, KLT detector,

SIFT detector, SURF, etc. [49]. The most important part in these algorithms


is to match the point descriptors in consecutive frames and having noise makes

these type of algorithms weak. Another problem with point detector algorithms

is that they are computationally very expensive and they do not work well under

illumination changes.

3.1.5 Feature-based Object Detection

In feature-based object detection, standardization of image features is important.

One or more features are extracted from the image and objects of interest are

modeled in terms of these features. Features may be shape, size or the color of

objects. In object detection, selection of the right features plays an important

role. To clearly distinguish the objects in the feature space we need to find the

object visual feature uniqueness.

1. Color Features: Unlike many other image features (e.g. shape) color is

relatively constant under viewpoint changes and it is easy to be acquired.

Although color is not always appropriate as the sole means of detecting and

tracking objects, but the low computational cost of the algorithms proposed

makes color a desirable feature to exploit when appropriate. To increase the

discriminative power of intensity based descriptors color feature descriptors

are used [50]. To describe the color information of an object, RGB color space

is usually used. But RGB color space is not a perceptually uniform color


space. Another color space used is HSV (Hue, Saturation and Value), which

is an approximately uniform color space. With respect to intensity of light,

HSV color model is scale-invariant as well as shift-invariant. Two physical

factors primarily influenced the apparent color of an object- 1) the spec-

tral power distribution of the illuminant and 2) object’s surface reflectance

property., therefore there is no efficient color space, which can define the

features of an object perfectly. Color descriptors in recent studies can be

classified into novel histogram-based color descriptors and Scale Invariant

Feature Transform (SIFT) based [51] color descriptors. In SIFT descriptor

the intensity channel is a combination of R, G and B channels. Therefore

SIFT descriptor is variant to light color change. Sebastien et al. [52] dealt

with object learning using color information. The GHOSP (Genetic Hy-

brid Optimization Search of Parameters) algorithm is developed which uses

multidimensional observations which are taken from RGB color images, that

contain object to be learnt. Zhenjun et al. [53] used combined feature set

which is built using color histogram (HC) bins and gradient orientation his-

togram (HOG) bins considering the color and contour representation of an

object for object detection. The combined feature set is the evolvement of

color, edge orientation histograms and SIFT descriptors.

2. Edge Features: The change in intensities of an image is strongly related

to object boundaries because after just the object boundary, the intensity


instantly changes. To identify the instant change, edge detection techniques

are used. Compared to color features, edge features are less sensitive to

illumination changes. Canny Edge detector is mostly used in finding the

edges of an object. Roberts operator, Sobel operator and Prewitt operator

are also used for finding the edges.

3. Texture Features: In Comparison to color features and edge features, a

processing step is required to generate the descriptors for the texture fea-

tures. Local Binary Patterns (LBP) texture features are known as one of

the efficient texture features. The LBP is defined as a gray scale invariant

texture measure, derived from a general definition of texture in a local neigh-

borhood. The most important property of the LBP operator is its tolerance

against illumination changes.

4. Optical Flow: Optical flow is one of the widely used methods used in

motion-based object segmentation and tracking applications. Furthermore

it is also used in tracking objects in video with moving background or in a

scene by a moving camera. The translation of each pixel in a region can be

found by a dense field of displacement vectors defined as optical flow. Optical

flow methods [58] involve calculating the image optical flow field and doing

clustering processing according to the optical flow distribution characteristics

of the image. In computing optical flow, brightness constraint is used as a

measure,i.e. assuming that brightness of corresponding pixels is constant in


consecutive frames. This method is very attractive in detecting and Dalal

et al. [59] developed a detector that could be used to analyze film and TV

content, or to detect pedestrians from moving car applications in which the

camera and the background often move as much as the people in the scene. It

studies oriented histograms of various kinds of local differences or differentials

of optical flow as motion features, evaluating these both independently and

in combination with the Histogram of Oriented Gradient (HOG) appearance

descriptors.

The downside of optical flow is large quantity of calculations, sensitivity

to noise, and poor anti-noise performance, which make it not suitable for

real-time object detection and tracking.

5. Spatio-temporal Features: Recently local spatio-temporal features have

become very popular to use. These features provide a visual representation

for recognition of actions and visual object detection [65]. Salient and motion

pattern characteristics in videos, are captured by local spatio-temporal fea-

tures. These features provide relative representation of events independently.

While presenting events the spatio-temporal shifts and scales of events, back-

ground clutter and multiple motions in the scene are considered. To show

the low level presentation of an object such as pedestrian, space-time con-

tours are used.


6. Gradient Features: Gradient features are important in object detection

in video sequences. To represent objects like human body, shape/contour of

the object body is used in gradient-based methods.

HOG Features

Since we used HOG features in our algorithm, this approach is explained in

more details. Histogram of Oriented Gradients (HOG) are feature descrip-

tors used for object detection. HOG features have become very famous in

computer vision algorithms for object detection and heavily used in recent

years. The technique sums how many times gradient orientation in local-

ized portions of an image happens. The fundamental assumption behind the

HOG descriptors is that local object appearance and shape within an image

can be described by edge directions. Implementation of the descriptors can

be obtained by dividing the image into small-connected regions called cells,

and for each cell compiling a histogram of gradient directions or edge orien-

tations for the pixels within the cell. The combination of these histograms

represents the descriptor. For improved accuracy, the local histograms can be

contrast-normalized by calculating a measure of the intensity across a larger

region of the image called a block, and then using this value to normalize

all cells within the block. This normalization results in better invariance to

changes in illumination or shadowing. Since the HOG descriptor operates on

localized cells, the method upholds invariance to geometric and photometric


transformations, except for object orientation [56].

There are four major steps in using HOG features for object detection. These

steps are as follows:

Gradient Computation:

The first step in finding HOG features is computing the gradient values.

Often a 1-D centered, point discrete derivative mask is applied on horizontal,

or vertical, or both directions. For this purpose a mask filter is used to filter

the color or intensity of the image, i.e.:

[−1, 0, 1] and [−1, 0, 1]T

Orientation Binning:

The second step of calculation involves creating the cell histograms. Each

pixel within the cell casts a weighted vote for an orientation-based histogram

channel based on the values found in the gradient computation. The cells

themselves can either be rectangular or radial in shape, and the histogram

channels are evenly spread over 0 to 180 degrees or 0 to 360 degrees, depend-

ing on whether the gradient is “unsigned” or “signed”. Dalal and Triggs

found that unsigned gradients used in conjunction with 9 histogram chan-

nels performed best in their human detection experiments. As for the vote

weight, pixel contribution can either be the gradient magnitude itself, or


some function of the magnitude. In tests the gradient magnitude itself gen-

erally produces the best results. Other options for the vote weight could

include the square root or square of the gradient magnitude, or some clipped

version of the magnitude[56].

Descriptor Blocks:

In order to account for changes in illumination and contrast, the gradient

strengths must be locally normalized, which requires grouping the cells to-

gether into larger, spatially connected blocks. The HOG descriptor is then

the vector of the components of the normalized cell histograms from all of

the block regions. These blocks typically overlap, meaning that each cell con-

tributes more than once to the final descriptor. Two main block geometries

exist: rectangular R-HOG blocks and circular C-HOG blocks. R-HOG blocks

are generally square grids, represented by three parameters: the number of

cells per block, the number of pixels per cell, and the number of channels

per cell histogram. In the Dalal and Triggs human detection experiment, the

optimal parameters were found to be 3x3 cell-blocks of 6x6 pixel cells with 9

histogram channels. Moreover, they found that some minor improvement in

performance could be gained by applying a Gaussian spatial window within

each block before tabulating histogram votes in order to weight pixels around

the edge of the blocks less. The R-HOG blocks appear quite similar to the

SIFT descriptors; however, despite their similar formation, R-HOG blocks


are computed in dense grids at some single scale without orientation align-

ment, whereas SIFT descriptors are computed at sparse, scale-invariant key

image points and are rotated to align orientation. In addition, the R-HOG

blocks are used in conjunction to encode spatial form information, while

SIFT descriptors are used singly.

C-HOG blocks can be found in two variants: those with a single, central cell

and those with an angularly divided central cell. In addition, these C-HOG

blocks can be described with four parameters: the number of angular and ra-

dial bins, the radius of the center bin, and the expansion factor for the radius

of additional radial bins. Dalal and Triggs found that the two main variants

provided equal performance, and that two radial bins with four angular bins,

a center radius of 4 pixels, and an expansion factor of 2 provided the best

performance in their experimentation. Also, Gaussian weighting provided no

benefit when used in conjunction with the C-HOG blocks. C-HOG blocks

appear similar to Shape Contexts, but differ strongly in that C-HOG blocks

contain cells with several orientation channels, while Shape Contexts only

make use of a single edge presence count in their formulation[56].

Block normalization:

Dalal and Triggs[56] explore different methods for block normalization. Over-

all, the performance significantly improves comparing to non-normalized


data.

SVM classifier:

The final step in object recognition using HOG descriptors is to feed the

descriptors into some supervised pattern recognition. Such as an SVM clas-

sifier [57]. Once trained on images containing some particular object, the

SVM classifier can make decisions regarding the presence of an object, such

as a human or an animal, in additional test images.

7. Multiple Features Fusion: The multi-feature fusion scheme has achieved

high boosting performance or robustness, in the field of computer vision,

multimedia and audio–visual speech processing, etc [65].

3.1.6 Shape-based Object Detection

Shape-based object detection is one of the complex problems due to the difficulty

of segmenting objects of interest in the images. The detection and shape charac-

terization of the objects becomes more difficult for complex scenes where there are

many objects with occlusions and shading.


3.1.7 Template-based Object Detection

If a template describing a specific object is available, object detection becomes a

process of matching features between the template and the image sequence under

analysis. There are two types of object template matching, fixed and deformable

template matching. Fixed template matching is useful when object shapes do not

change with respect to the viewing angle of the camera. The major technique

that has been used in fixed template matching is by correlation, which is gener-

ally immune to noise and illumination effects in the images, but suffers from high

computational complexity caused by summations over the entire template. Jiyan

et al. [67] proposed Content-Adaptive Progressive Occlusion Analysis (CAPOA)

algorithm; which analyzes the occlusion situation within a given region of interest

and generates corresponding template mask. Detection of a reappearing target is

somewhat difficult with this method. Deformable template matching approaches

are more suitable for cases where objects vary due to rigid and non-rigid defor-

mations. Because of the deformable nature of objects in most videos, deformable

models are more appealing in tracking tasks. Zhong et al. [68] proposed a novel

method for object detection using prototype-based deformable template models..

Deformed template is obtained by applying a parameterized deformation trans-

form on the prototype. The prototype-based template combines both the global

structure information and local image cues to derive an interpretation. Xiaobai

Liu et al. [69] proposed hybrid online templates for object detection which uses


different features such as flatness, texture, or edge/corner. The template consists

of multiple types of features, including edges, texture regions, and flatness regions.

The limitation of this method is, that the discriminative power of features change

along with the object movements. This means that the hybrid template should be

adaptively updated by either adjusting the feature confidence or substituting the

old features with the new discovered ones from the currently observed frames.

3.1.8 Classifier-based Object Detection

In this approach the detection problem becomes a classification problem between

two classes of background (negative) and foreground (positive). For classification

different features may be used such as color, texture, etc. Liu et al. [66] presented

a novel semiautomatic segmentation method for single video object extraction.

Proposed method formulates the separation of the video objects from the back-

ground as a classification problem. Each frame was divided into small blocks of

uniform size, which are called object blocks if the centering pixels belong to the

object, or background blocks otherwise. After a manual segmentation of the first

frame, the blocks of this frame were used as the training samples for the object-

background classifier. Yuhua et al. [70] presented new face detection method from

a video sequence using classification. First, a classifier with a set of parameters

was built up based on the knowledge of the interest object. Then both positive

and negative sample data were fed into the classifier to adjust those parameters.


There was a mapping between the object and the classifier. For complex objects,

multiple classifiers needed to be integrated. The limitation with this method is

that more object features need to be embedded to train the object model under

different environment and light conditions.

3.1.9 Deep Neural Networks and Convolutional Neural Net-

works

Deep Neural Networks (DNNs) have recently shown outstanding performance on

image classification tasks. In recent years, DNNs have emerged as a powerful

machine learning model[107]. DNNs exhibit major differences from traditional

approaches for classification. First, they are deep architectures, which have the

capacity to learn more complex models than shallow ones. This expressively and

robust training algorithms allow for learning powerful object representations with-

out the need to hand design features. Convolutional Neural Networks (CNNs) were

heavily used in the 1990s ([106]), but then fell out of fashion with the rise of sup-

port vector machines. In 2012, Krizhevsky et al. [107] rekindled interest in CNNs

by showing substantially higher image classification accuracy on the ImageNet

Large Scale Visual Recognition Challenge (ILSVRC) [108, 109]. Their success re-

sulted from training a large CNN on 1.2 million labeled images, together with a


few twists on LeCun’s CNN (e.g.,max(x;0)rectifying non-linearity and “dropout”

regularization).

However, one factor that is very important in using DNN and CNN is that although

they could provide very good detection results, they are computationally very

expensive. Krizhevsky et al [107] used over a million training images and multiple

GPUs to speed up the algorithm, and did not do pre-training even though it will

help but takes more time.

3.1.10 Comparison between Detection Algorithms and Find-

ing a Suitable Approach for our Problem

While there are so many algorithms in the literature for object recognition and

localization, to find a suitable approach to follow for our problem we needed to

find out what the pros and cons of each approach were. While we did not exhaust

all available approaches in the literature we have tried many algorithms to reach

a reasonable solution. Based on the categories provided in the previous section we

will discuss the reasoning behind our algorithm and what methods we did try to

come up with our algorithm.

Segmentation approaches do not need training, but usually they do not work well

when there are multiple objects in the image and the background and foreground

show similarity. We tried meanshift and camshift on our dataset but after a


few frames, the detection box loses the primates as expected because the color

distribution of primates and background clutter could become similar. Second,

background subtraction algorithms are usually used as a pre-processing algorithm

in most detection approaches, but they are not sufficient for complex scenarios.

For our experiments we tried using GMMs and simple background subtraction

but both of them do not provide a good accuracy and when we apply them they

produce silhouettes of the primates. Therefore, we used a simple background

subtraction algorithm as a pre-processing step to save speed and eliminate the

obvious background portions of the image frame. Third, for supervised learning

based background subtraction, we learned an svm classifier using color information

but it failed and did not classify the background accurately as there are more than

two classes of color distribution in our images. Fourth, point detectors are common

algorithms for object detection. We tried using SIFT, SURF, and KLT. All of these

algorithms are computationally very expensive, but the most important problem

with these algorithms is to match the point descriptors in consecutive frames and

having noise and too many interesting points make these types of algorithms weak

for our purpose. Fifth, for feature-based object detection, we used both color

and HOG features. Texture, contour, and edge features do not work well for

our scenario. The primates exhibit a very hard shape that the contour, edge, or

texture are not distinctive features. We also did not pursue using optical flow

approaches as the illumination changes and noise make these algorithms pretty


weak in our situation. For classification approaches, we used a binary classifier to

classify between the color and HOG features extracted previously. We could have

used nonlinear classifiers but to avoid extensive computation time we suffice using

a linear classifier. We did not use template-based or shape-based algorithms as the

primate shape is pretty variable and using these algorithms will not be suitable for

objects with too much shape variability. Although CNN and DNN have produced

very compelling results in the past couple of years, we did not use these approaches.

One main factor in not using these approaches is that they are computationally

very expensive and usually applied on GPUs to achieve reasonable computation

time.

3.2 Primate Detection in 2D

For our experiments based on the discussed algorithms, we used a 3-step detection

algorithm. First, we created a background model for background subtraction

approach. Then we created a large training set for desired objects, primates,

and finally we used HOG and color features to train an svm classifier to classify

between foreground (primates), and background (everything else).


3.2.1 Background Subtraction

In our project we used a simple background subtraction method as a pre-processing

step. This step will help both in computation speed and in accuracy. Using several

frames from different instances of time during the day, we created a static back-

ground image as a reference. The criteria that we used for background subtraction

is as follows:

For each pixel (x, y), let I(x,y) = {t1(x,y), ..., tN (x,y)}

B(x, y) = 1N

∑t∈I(x,y) It(x, y)

For each pixel of test image T , if |T (x, y)−B(x, y)| < Threshold , then ignore

T (x, y).

where I(x,y) is the intensity value at pixel (x, y), and ti(x,y) is the intensity value

of pixel (x, y) at time stamp ti, and B is the reference background image. Finally,

to detect primates, each camera view was processed separately: a static reference

background image was created and all frames were equalized to match the reference

background in terms of illumination distribution. Figure 3.2 shows an example of

a test image after normalization. The detection accuracy rises significantly and

number of false positives decreases profoundly using background subtraction as a

pre-processing step.


3.2.2 Using HOG and Color Features and Classification

In this part, first a training set of primate bounding boxes were manually generated

for various positions and poses for each view; (2) background subtraction was

employed to eliminate obvious non-primate portions of (test) frames; (3) image

features including HOG and Average RGB (aRGB) were extracted (the basic

detection sliding window sizes are set to [60, 60], [60,100], [100,60], [100,100]); (4)

an SVM classifier was trained as a primate detector, using 10-level pyramid images

to detect instances of multiple scales. The scale step size is 1.05. The same HOG

and aRGB features are extracted over the windows at all locations and linear SVM

classifier is run to decide if an instance is a primate or not;(5) Finally, multiple

detections are fused with non-maximum suppression.

3.2.3 Primate Identification

Each of the primates in the cage has a collar with a certain color. These colors

differ for the two groups of primates. For the group of six, these colors are red,

yellow, brown, purple, green, and blue. For the group of four primates these colors

are black, green, red, and blue. We developed an algorithm to find the likelihood of

these colors appearing in the detected bounding boxes for primates. To find these

colors, we used the RGB values of each color with a standard deviation of 50 gray

level intensity. For example, the RGB value of red color is [255,0,0] and we assumed


Camera-1

Camera-3 Camera-4

Camera-2 1

2

3 4

1 2

3

4

1

2 3

2

3

4

Figure 3.2: Background normalization using the static background image.

a pixel is considered red if the red component of the pixel is at least 205 and the

green and blue component of the pixels are less than 50. After finding out the

voting for each of the desired colors in the detected bounding boxes, based on the

number of pixels found for each color, we decided if we could determine the color

of the primate’s collar in that bounding box. The voting number should be higher

than a minimum threshold (at least 20 pixels) and less than a maximum threshold

(at most 50 pixels) to be considered as a known collar. Since the colors of collars

are not visible at all times identifying primates for all detected bounding boxes

is not possible. However, we can intermittently identify some of these detected


bounding boxes. Combining the result of detection and identification, we can get

color labels for primates. This information is used for tracking intermittent.

Chapter 4

Object Tracking

4.1 Definition and Common Algorithms

Object tracking algorithms process sequence of consecutive video frames and ob-

tain the movement of objects between the frames. Usually the final goal in video

analysis is to recognize activities and behaviors of interesting objects and tracking

is an intermediate step to achieve this goal. Some of the main categories that need

accurate object tracking are as follows:

1. Motion-based recognition: human identification based on gait, automatic

object detection

62

Chapter 4. Object Tracking 63

2. Automated surveillance: monitoring a scene to detect suspicious activities

or unlikely events

3. Video indexing: automatic annotation and retrieval of the videos in multi-

media databases

4. Human-computer interaction: gesture recognition, eye gaze tracking for data

input to computers

5. Traffic monitoring: real-time gathering of traffic statistics to direct traffic

flow

6. Vehicle navigation: video-based path planning and obstacle avoidance capa-

bilities

7. Human and animal behavior analysis: recognize and understand human/an-

imal activities

In its easiest form, tracking can be defined as the problem of estimating the tra-

jectory of an object in the image plane as it moves around a scene. In other words,

a tracker assigns consistent labels to the tracked objects in different frames of a

video. The object detection task and object correspondence task between the in-

stances of the object across frames can be done separately or jointly. In the first

scenario, with the help of object detection algorithm, possible object regions in

every frame are obtained, and object correspondence across frames is performed


by object tracker. In the latter scenario, information obtained from previous

frames helps in finding the object region and correct estimation of correspondence

is done jointly by iterative updating of object region and its location. Addition-

ally, depending on the tracking domain, a tracker can also provide object-centric

information, such as orientation, area, or shape of an object. Some of the major

challenges in object tracking are:

1. Loss of information caused by projection of the 3D world on a 2D image.

2. Noise in images.

3. Complex object motion.

4. Non-rigid or articulated nature of objects.

5. Partial and full object occlusions.

6. Complex object shapes.

7. Illumination changes.

8. Real-time processing requirements of objects.

Several approaches for object tracking have been proposed. Almost all tracking

algorithms assume that the object motion is smooth with no abrupt changes. One

can further constrain the object motion to be of constant velocity or constant

acceleration based on a priori information. Prior knowledge about the number


and the size of objects, or the object appearance and shape, can also be used to

simplify the problem. Based on our detection algorithm we introduce one of the

most common data association algorithms that is used for object tracking, kalman

filter, and works well with the dynamics of our system. After introducing kalman

filter and its main characteristics, we discuss the details of this approach for our

case.

4.1.1 Kalman Filter

Kalman filter is an optimal single object state estimator, i.e. infers parameters

of interest from indirect, inaccurate and uncertain observations to estimate the

current state of a variable or object. Kalman filter is used as an estimator to

predict and correct system state. It helps in studying system dynamics, estimation,

analysis, control and processing. It is not only powerful practically but also very

well precise theoretically. Kalman filter predicts the states of past, present, and

future of an object or variable efficiently. It is recursive so that new measurements

can be processed as they arrive. For a linear system Kalman filter finds the correct

estimation, with white Gaussian noise. So, if all the system noise is gaussian, the

Kalman filter minimizes the mean square error of the estimated parameters and

gives the best estimation and if the noise is not Gaussian, given only the mean

and standard deviation of noise, the Kalman filter is the best linear estimator. In

image analysis, kalman filter can be used to estimate the location of an object in


consecutive frames, by having the approximate location of objects from detection

and some prior information on the type of movement of the object. For example,

if we are looking to track a pedestrian who is walking and occasionally occluded

behind trees or other people, if we use a detection algorithm and find out his/her

location in some frames and assume that he/she is walking with a constant velocity

we can estimate the accurate position of the pedestrian in each frame using kalman

filter.

Kalman filter has become very famous in the past decades because:

1. It gives good results in practice due to optimality and structure

2. It is convenient form for online real time processing

3. It is easy to formulate and implement given a basic understanding

4. Its measurement equations need not be inverted

The Kalman filter dynamic model uses a system’s dynamic model (e.g., physical

laws of motion), known as control inputs to that system, and multiple sequential

measurements (from sensors) to form an estimate of the system’s time varying

state that is better than an estimate obtained by using any other measurement

[72].

Dynamic System Model:


To formulate a kalman filter problem, we require discrete time linear dynamic

system description by vector difference equation with additive white noise that

models unpredictable disturbances.

The state of a deterministic dynamic system is the smallest vector that summarizes

the past of the system in full. Knowledge of the state allows theoretically prediction

of the future (and prior) dynamics and outputs of the deterministic system in the

absence of noise. The state of the filter is represented by two variables:

xn|m, a posteriori state estimate at time n given observations up to and including

at time m; Pn|m, a posteriori error covariance matrix (a measure of the estimated

accuracy of the state estimate).

xn|m represents the estimate of x at time n given observations up to, and including

at time m <= n.

Kalman filter model assumes that the true state at time k is evolved from the

state at (k − 1) according to:


Xk = FKXk−1 + Bkuk + wk (4.1)

where, Fk is the state transition model, which is applied to the previous state Xk1;

Bk is the control-input model, which is applied to the control vector uk;

wk is the process noise, which is assumed to be drawn from a zero mean multi-

variate normal distribution with covariance Qk.

There are two distinct phases in a Kalman filter: Predict and update. The predic-

tion phase uses state estimated from the previous time step to produce an estimate

of the state at current time step. This predicted state is also known as the “a pri-

ori state estimate” since it does not get information from the latest observation.

In the update (correction) phase, a priori state estimate is combined with current

observation to refine the state estimate. This improved estimate is termed the “a

posteriori state estimate” [73].

Predict:

Predicted (a priori) state estimate: xk|k−1 = Fkxk−1|k−1 + Bkuk

Predicted (a priori) estimate covariance: Pk|k−1 = FkPk−1|k−1FTk + Qk


Update:

Innovation or measurement residual: yk = zk −Hkxk|k−1

Innovation (or residual) covariance: Sk = HkPk|k−1HTk + Rk

Optimal Kalman gain: Kk = Pk|k−1HTk S−1k

Updated (a posteriori) state estimate: xk|k = xk|k−1 + Kkyk

Updated (a posteriori) estimate covariance: Pk|k = (I −KkHk)Pk|k−1

where Hk is the different observation matrices. The formula for the updated

estimate and covariance above are valid for the optimal Kalman gain.

The Extended Kalman filter (EKF) is a nonlinear version of Kalman Filter. The

result of Extended Kalman Filtering shows faster convergence in terms of itera-

tions in comparison with traditional methods, though each iteration cost is higher.

There might also be some cases where EKF finds better or more robust solutions

rather than Kalman filter.

4.1.2 Particle Filter

The problem with Kalman filter is that the state variables are normally distributed

(Gaussian). Kalman filter will give poor estimations for state variables that do

not follow Gaussian distribution. This problem of the kalman filter can be solved

with the help of particle filtering [74]. Since their introduction in 1993, particle


filtering methods have become a very popular class of algorithms to solve these

estimation problems numerically in an online manner, i.e. recursively as observa-

tions become available, and are now routinely used in fields as diverse as computer

vision, econometrics, robotics and navigation[110].

4.1.3 Multiobject Data Association and State Estimation

Kalman filter, extended kalman filter and particle filtering give very good results

when the objects are not close to each other. For tracking multiple objects in a

video sequence using Kalman or particle filters, the most likely measurement for

a particular moving object needs to be associated with the object’s state. This

is called the correspondence problem. So for multiple-object tracking the most

important step is to solve the correspondence problem before kalman or particle

filters are applied. Nearest neighbor approach is the simplest method to solve

the correspondence problem. However the correspondence problem is hard to deal

with when the moving objects are close to each other, and then the correspondence

shows incorrect results. These filters fail to converge when incorrectly associated

measurement occurs. Several statistical data association techniques exist to tackle

this problem. Two commonly used techniques for data association in this complex

scenarios are Joint Probability Data Association Filtering (JPDAF) and Multiple

Hypothesis Tracking (MHT) [110].


4.2 Primate Tracking in 2D

To track primates, our algorithm has three major steps.

1. Initialization:

The fist step in tracking algorithm is to initialize the tracker. For this matter,

the first frame of the desired sequence is given to the user. User is asked

to specify the location of each primate by making a marker selection on the

center of the primate. Each primate, is specified by the color of their collar

and the user is also asked to give a confidence number between 0 and 1 when

they identify the primate center. For example if a primate with a certain

collar is occluded, then the confidence number for that primate would be

0 and if a primate is partially occluded based on the judgement of user, a

confidence number less than .5 is given and if a primate is totally visible, the

given confidence 1.

2. Nearest Neighbor and Data Association:

After initialization, the next step would be to apply nearest neighbor al-

gorithm to solve the correspondence issue .Since there are more than one

primate in the pen, before applying Kalman filter we need to solve the cor-

respondence issue. For this matter we used the occasional identification of

collars as benchmarks between frames and used nearest neighbor algorithm


to create trajectories for each primate. To be able to track primates cor-

rectly, we applied some geometric constrains. These geometric constrains

are: 1) If a primate is leaving the scene, it can only appear from margins

of the frame and not anywhere else. 2) Number of detected primates can

not be more than 4, since there are at most 4 primates in the scene. 3) If

number of detected bounding boxes change from one frame to the next, the

new added bounding box is only considered not an outlier iff the it is close

the other one detected in the previous frame. Using these constrains and

the identification method, we were able to interpolate the trajectories using

nearest neighbor algorithm.

3. Kalman filter:

The final step in our tracking algorithm is to apply kalman filter to the

trajectories for each primate to get a better trajectory estimation for each

primate. The model that we used for kalman filter is as follows:

initial acceleration magnitude = .005,

Gaussian noise standard deviation of acceleration : noisemag = .1,

measurement noise in the x direction: noisex = 1

measurement noise in the y direction: noisey = 1

initial velocity magnitude in x and y direction =

(positionatframe2 − positionatframe1)/t Q = estimate of initial location esti-

mation of where the primate is, what we are updating:


[positionX, positionY, velocityX, velocityY ]

measurement noise matrix : Ez =

∣∣∣∣∣∣∣∣noisex 0

0 noisey

∣∣∣∣∣∣∣∣Estimateofinitialprimatepositionvariance(covariancematrix) :Ex × noise2mag

Ex =

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

t4

40 t3

20

0 t4

40 t3

2

t3

20 t2 0

0 t3

20 t2

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣Coefficient Matrix:statetransition + inputcontrol

State update matrix: A =

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

1 0 t 0

0 1 0 t

0 0 1 0

0 0 0 1

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣B = [ t

2

2, t

2

2, t, t]

Measurement function to apply to the state estimate Q to get our expect new

measurement:

C =

∣∣∣∣∣∣∣∣1 0 0 0

0 1 0 0

∣∣∣∣∣∣∣∣Predict next state of the primate with the last state and predicted motion:


Qestimate = A×Qestimate + B × u

Predict next covariance : P = A× P × A′ + Ex

Predicted primate measurement covariance: P

Kalman Gain: K = P × C ′ × inv(C × P × C ′ + Ez)

Update the state estimate:

Qestimate = Qestimate + K × (Qmeasurement − C ×Qestimate)

Using this method, the tracking accuracy improves. The advantage of this algo-

rithm is that when a primate leaves the scene and comes back, it can track it

using the color information from the primate’s collar; However without using this

information by only using NN correspondence, or kalman filter, it will be very hard

to accurately track an object that leaves the scene and comes back at a different

location with a different direction of movement.

Chapter 5

Calibration and 3D

Reconstruction

5.1 Camera Calibration

It has been more than a decade that researchers in computer vision have been

interested in digitizing time-varying events that have been recorded by video cam-

eras from multiple viewpoints to 3D scenes. Usually the events in the videos are

human activities and the ultimate goal is to let the observer view the event from

any arbitrary viewpoint. This is called free-viewpoint video. Some of t he applica-

tions of converting a scene into 3D models are: 1) 3D tele-immersion, 2) digitizing

rare cultural performances, 3) sports action, and 4) generating content for 3D

75

Chapter 5. Calibration and and 3D Reconstruction 76

video-based realistic training and demos for surgery, medicine and other technical

fields.

Currently in all multi-camera systems [84–90], calibration and synchronization

must be done during an offline calibration stage before the actual video is cap-

tured. A person has to go to the scene with a calibration object such as a planar

calibration grid or a point LED, and different shots from different angles are taken

from the person with the calibration object. This offline step makes the calibration

process hard, as if the cameras move constantly and there is a need for calibration

more than once, this task has to be done every time.

5.1.1 Explicit Camera Calibration

Physical camera parameters are commonly divided into extrinsic and intrinsic pa-

rameters. Extrinsic parameters are needed to transform object coordinates to a

camera centered coordinate frame. In multi-camera systems, the extrinsic pa-

rameters also describe the relationship between the cameras. The pinhole camera

model is based on the principle of co-linearity, where each point in the object space

is projected by a straight line through the projection center into the image plane.

The intrinsic camera parameters include the effective focal length, the scale factor,

and the image center. This information is usually provided by the company, which

is building the cameras.


5.2 Visual Hull

The earliest attempts in reconstruction of 3D models from images used the sil-

houettes of objects as sources of shape information. A 2D silhouette is the set

of close contours that outline the projection of the object onto the image plane.

Segmentation of the silhouettes from the rest of the image and combination with

silhouettes taken from different views provide a Shape-From-Silhouette(SFS). The

result of the SFS construction is an upper bound of the real object’s shape in

contrast to a lower bound, which is a big advantage for obstacle avoidance in the

field of robotic or visibility analysis in navigation. One of the advantages in using

SFS technique is the easy implementation of calculation for the silhouettes in sim-

ple situations, such as an indoor environment with static illumination and static

cameras (without these assumptions it can be difficult to calculate an accurate sil-

houette out of the images, because of shadows or moving backgrounds). Another

application of SFS estimation is the field of motion capturing [94].On the other

hand there are also disadvantages for these techniques. Usually these algorithms

are slow, which is an issue for real-time applications. The silhouette calculations

are relatively sensitive to noise such as bad camera calibration, which makes the

resulting 3D shapes inaccurate. Furthermore, the result of each SFS algorithm is

just an approximation of the actual object’s shape, especially if there are only a

limited number of cameras and therefore this approach is not practical for appli-

cations like detailed shape recognition or realistic shape reconstruction of objects


[94].

Laurentini introduced the term of the Visual Hull in 1991 [92]. If the camera

intrinsic and extrinsic parameters are known from calibration, then the visual

hull of objects [100, 101, 103] can be computed by intersecting the visual cones

corresponding to silhouettes captured from multiple views. The visual hull of a

3D object S is the maximal volume consistent with silhouettes of S. A formal

definition of Visual Hull (VH) is first introduced by Laurentini [100] as following:

“The visual hull V H(S,R) of an object S relative to a viewing region

R is a region of E3 such that, for each point P ∈ V H(S,R) and each

viewpoint V ∈ R, the half-line starting at V and passing through P

contains at least a point of S.”[100]

If we consider these definitions it is easy to see, that S < V H(S,R). Directly

building visual hulls by intersecting the visual cones is very difficult in practice

due to the curved and irregular surface of objects, which results in a complex

geometrical representation for its cones. Therefore approximation methods are

preferred. Polyhedral shape based approach [101] and volume based approach

[102] are normally used for this purpose. We adopt the latter approach for its

efficiency. Algorithm 5.2 shows a pseudocode of the approach.

[h]


1. Divide the 3D space of interest into N×N×N discrete voxels vn, n = 1, .., N3.

voxels

2. Initialize all the N3 voxels as object voxels

3. For n = 1 to N3 {

— For k = 1 to K {

—— Project vn into the kth image plane by the projection function P k;

—— If the projected area P k(vn) lies completely outside Sk, then classify vn

as non-object voxels;

— }

}

4. The visual hull V H is approximated by the union of all the object voxels.

Another more efficient way to calculate an approximation of the visual hull is

a volume based approach [96–99].Even though this technique is very easy and

fast, it has a big disadvantage; The resulting shape is significantly larger then the

true object shape, which makes it only feasible for application in which only an

approximation is used [94]. The modern approaches use surface-based represen-

tations instead of the volumetric representation of the scene, which allows using

regularization in an energy minimization framework. These techniques result in a

higher robustness to outliers and erroneous camera calibration. Furthermore these

approaches try to overcome the inability to reconstruct concavities, due to the fact


𝐶1 𝐶2

𝐶3

𝑆1 𝑆2

𝑆3

Visual Hull Approximation

Actual Visual Hull

Figure 5.1: 2D example of the visual hull approximation algorithm. C1, C2, C3

are different views with corresponding silhouettes S1, S2, S3. The yellow areais the approximation of the visual hull; the area enclosed by black lines is the

actual visual hull; and the blue shape in the center is the object.

that they do not affect the silhouettes by using in addition stereo-based methods.

They are used to repeatedly ignored inconsistent voxels and so result in smoother

reconstruction. So that in addition the aim is to archive a photo consistency [95].


5.3 Calibration and Visual Hull Reconstruction

of Primates

5.3.1 Multiview Environment and Calibration

In order to determine the visual hull corresponding to a set of primate silhouettes,

the cameras that produced the images must be calibrated. This means that the

intrinsic camera parameters (such as focal length, principal point) and the pose

must be (at least approximately) known. So camera calibration is another nec-

essary step in building our 3D vision assisted observation environment. We use

four cameras from different views as a quantitative sensor to recover 3D quanti-

tative measures about the observed scene from 2D images. For our study, from a

calibrated camera we can measure how far a primate is from the camera, or the

height of the primate, etc. Here we briefly introduce the calibration algorithm we

applied in our system and some specifications about the environment.

The calibration algorithm we used is very similar to [? ] which estimates the

intrinsic parameters, including focal length, principal point, skew coefficient, and

distortions, and extrinsic parameters including rotations and translations.


5.3.2 3D Visual Hull Reconstruction of Primates

After calibration, we used the primate detection results to reconstruct the 3D

visual hulls of the primates in the pen. For each view, we have a detection log

that gives us the bounding boxes around primates; combining the detection results

and the foregrounds obtained from the background subtraction technique, we can

get a better estimate of the location and shape of primates in 2D. For each frame,

we created a binary image with primates as foreground and the rest as background,

in each view. Finally, we used these images to create the approximate 3D visual

hull of primates. Since we only have four cameras obtaining an accurate 3D visual

hull of the primates was not feasible, therefore; we decided to proceed with the

processing of videos in 2D, and to fuse the information we get from each view

separately at the end.

Chapter 6

Activity Recognition Based on

Spatial Relation

6.1 Activity Recognition

Initial work on activity recognition involved extracting a huge description from a

video sequence. This could have been a table of motion scale, rate, and position

within a segmented figure [76] or a table of the presence of motion at each location

[77]. Both of these techniques were able to distinguish some range of activities,

but because they were individually the full descriptions of a video sequence, rather

than features extracted from a sequence, it was difficult to use them as the building

blocks of a more complicated system.

83

Chapter6. Activity Recognition Based on Spatial Relation 84

Another approach to activity recognition is the use of explicit models of the ac-

tivities to be recognized. Domains where this approach has been applied include

face and facial expression recognition [78] and human pose modeling [79]. These

techniques can be very effective, but by their nature, they cannot offer general

models of the information in video in the way that less domain-specific features

can. Furthermore, these type of activities require high quality shots of the face

with low number of occlusions which is not applicable in many scenarios such as

ours as the face of primates are dark which makes it very hard to distinguish their

facial expressions and even occluded in many frames.

Recent work in activity recognition has been largely based on local spatio-temporal

features. Many of these features seem to be inspired by the success of statistical

models of local features in object recognition. In both domains, features are first

detected by some interest point detector running over all locations at multiple

scales. Local maxima of the detector are taken to be the center of a local spatial

or spatio-temporal patch, which is extracted and summarized by some descrip-

tor. Most of the time, these features are then clustered and assigned to words

in a codebook, allowing the use of bag-of-words models from statistical natural

language processing. Since at this point our system relies on finding the identi-

ties and locations of primates at consecutive frames, recognizing spatio-temporal

activities would be the best direction to pursue.


6.2 Primate Activity Recognition

The task of primate activity recognition is to use the primate locations and iden-

tities given by the tracking output to detect interesting activities that we may

want to explore or monitor. Some of these interesting activities were mentioned in

table 1 in the first chapter. However, technically speaking, not all of the activities

can be detected or classified, even for human beings. For example, it is very hard

for the camera to detect activities related to tiny features such as lips or teeth of

primates. These features are small and easily subjected to occlusion. Some other

activities are too hard or too complex to be classified correctly as there are many

categories that could be interpreted as one action. For example, ”play”, which can

include moving, jumping, wrestling and grunting, which makes it so hard to be

classified correctly. Therefore, we put our focus on activities that are not subject

to interpretations and we can classify them ourselves without the need of experts

to validate our classifications for training data.

6.2.1 Velocity Measures

Fortunately, there are several interesting activities that are important and techni-

cally easy to detect and interpret. These activities include stationary, locomotion,

chasing and avoiding. All these activities can be defined only by the position


trajectories of the centers of the primates, which are available from the track-

ing outputs. Specifically, we assume there are two basic activities: stationary and

moving. Moving include self-moving and pairwise moving. We defined self-moving

as ”locomotion”, which associates with only one primate. Pairwise moving is de-

fined as the activities which involve two primates moving simultaneously and there

is a causal relationship between them. As there can be many interesting activities

in the pairwise moving class, we only consider chasing and avoiding as examples

in this paper. Each of the interesting activities is defined by a few heuristics which

we developed from our observation of the sample videos. In the following we will

give a detailed illustration of these heuristic features.

1. Stationary: velocity of a primate is smaller than a predefined threshold Th1

all the time.

2. Moving: velocity of a primate is greater than a predefined threshold Th1

for a predefined number of frames.

3. Locomotion: it is ”moving” but does not belong to any known pairwise

activities. Figure 6.1 shows an example of locomotion.

4. Chasing: Suppose there are two primates M1 and M2. The position trajec-

tories of them are defined as ~p1 and ~p2. We can compute their first derivative

~v1 and ~v2. Without loss of generality, we assume M1 is chasing M2, then we

have the following necessary conditions:


Figure 6.1: A sample image of locomotion activity. The primate that is shownwith the red box is moving but no other primate has motivated this movement.

F1 : ~v1 > Th1

F2 : ~v2 > Th1

F3 : arccos (~v1·( ~p2− ~p1)

| ~v1|·|( ~p2− ~p1)|) < Th2

F4 : |( ~p2− ~p1)| < Th3

where, Fi shows the heuristic feature we computed to determine the chasing

activity. The intuitions behind these heuristic constraints are obvious. The

first two equations ensure that both primates are moving. The third equation

hints that the chasing primate is trying to get close to the chased one. Finally,

the last equation constraints that the two involved primates should be not

too far from each other and the distance between them should relatively not

grow much as they are following each other. Figure 6.2 shows an example of

chasing.


Figure 6.2: These series of images from top right to bottom left show thechasing and avoiding activities that are happening between the two primates

that are shown with red circles.

5. Avoiding: avoiding heuristics can be defined similar to chasing. Again, if we

assume that primate M1 is chasing primate M2, here primate M2 is avoiding

primate M1. Avoiding can be explained by these equations.

F1 : ~v1 > Th1

F2 : ~v2 > Th1

F3 : arccos (~v2·( ~p2− ~p1)

| ~v1|·|( ~p2− ~p1)|) > Th2

Similar to chasing, the first two equations ensure that both primates are

moving.The third equation hints that the avoiding primate is trying to get

further from the chasing one. Figure 6.3 shows an example of avoiding.


Figure 6.3: These series of images from top right to bottom left show theavoiding activity for the primate that is specified with the red circle. Note that

this activity is not a result of chasing in this case.

With the heuristic features above, the primate activity recognition problem is

equivalent to a binary decision tree. For each primate we calculated the values

for all the heuristic features in the training set, and labeled them with different

activities. We then fed this information to a binary decision tree using MATLAB

to find the optimal cut points for each threshold. To avoid over-fitting, we used

kfold cross validation with k = 10. Figure 6.4 shows the decision tree we built for

our activity classification.

If F1 < Th1 then primate M1 is stationary.

— else if F2 < Th2 then primate M1 is locomotive.

—— else if F3 > Th3 then primate M1 is avoiding primate M2.

——— else if F4 < Th4 then primate M1 is chasing primate M2.

———— else primate M1 is locomotive.


stationary�

chasing

avoiding

locomo.on

locomo.on

F1<Th1 F1>Th1

F2<Th1 F2>Th1

F3<Th2 F3>Th2

F4<Th3 F4>Th3

Th1 = 9.3 Th2 = 0.86 Th3 = 318

Figure 6.4: This figure shows the decision tree we used to evaluate our testset. Th leaf nodes show the decision made based on the feature values.

Algorithm above shows the decision process for primate M1. This algorithm an-

swers the question of ”what is the activity of a given primate,M1, with a given set

of features, i.e. Fi ?”

Chapter 7

Experimental Results

7.1 Experiments

For our experiments we used a 2.39 GHz(2 processors) CPU with 48 GH RAM.

We run all of our experiments on MATLAB. We used OpenCV and c++ for the

detection algorithm and then used MEX files to call it from MATLAB.

There are several hours of recorded data from four cameras, which were recorded in

the primates’ pen. However these data are not annotated and for our experiments

we had to label them manually. We had to create training sets for each view

and also test sets to test our algorithm on them and compare the results of our

algorithm with the manually labeled test set. Since labeling primates is a very

time consuming event and we are not experts in recognizing all activities, for our

91

Chapter 7. Experimental Results 92

test set we used two different data sets. One with the first group of primates, and

the other one with the second group of primates. In each of these test sets, we

put our focus on activities that are related to relative position of primates to each

other as explained in the previous chapter.

The two data sets are named 20121026 (video 1) and 20130619 (video 2). The

data set 20121026 is a video of 400 frames. There are six primates observed in

this video. The 20130619 is a video of 700 frames. Four primates are observed

in this video. The second group of primates (group of four) was generally much

less hostile than the first group (group of six) and most of the times they were

sitting around. We looked for portions of video that contained the full number

of primates and chose portions which primates were moving and had interesting

activities. At the beginning we annotated primates from each of the four views

and tested our detection algorithm on four view, however as we will see in the

“Detection ” section, we realized that view 3 and view 4 do not carry much extra

information than the combination of view 1 and view 2, and furthermore because

of the structure of the pen and the benches, the primates were occluding each other

in many frames and we had to discard those frames. Therefore, for our tracking

and activity recognition algorithm we focused on view 1 and view 2. Figure 7.1

shows a sample image frame from four views.


Camera-1

Camera-3 Camera-4

Camera-2 1

2

3 4

1 2

3

4

1

2 3

2

3

4

Figure 7.1: Sample image from four views.

7.2 2D Primate Detection

The challenge of detection comes from multiple factors. Firstly, due to the settings

of the environment, the illumination varied in different locations, furthermore, it

may change from time to time, too. So we cannot simply rely on background

subtraction or illumination-sensitive features. Secondly, although the primates

wear collars of different color, these are easily occluded when they move, or become

indistinguishable when the illumination is low. The main challenge to detect

primates with HOG feature is the variable shape of the primate body. The reason


that HOG can successfully detect pedestrians, for instance, is that the contours

of all standing human beings look similar. The ratio between width and height is

almost constant. However, the contour of a crouching monkey is quite different

from that of a jumping one.

For each view we trained a separate detector. We used about 5000 positive training

samples (primates) and 2000 negative samples (non primates) for training each

view. We used the two test videos mentioned before, to evaluate the detector’s

performance. The results are shown in Table 7.1 and table 7.2 . TP stands for

true positive, FP stands for false positive and FN stands for false negative. The

PR curve in Figure 7.3 shows the relation between precision and recall rate with

SVM threshold varied. From Figure 7.3, we can see that view 2 and view 4 are

better than view 3 and View 1. It is reasonable because in view 2 and view 4

the background is simpler and the primates are usually separated. In view 1, the

background is strongly cluttered so there are many false positives. In view 3, the

primates on the benches often occlude each other and the illumination is low on

the floor area, so it is difficult to locate primates and therefore many false negatives

occur. Figure 7.2 is a good illustration for these points.


View 1

View 2

View 4

View 3

Figure 7.2: Primate detection in 2D. In column one, green boxes are theground truth; red boxes are the detection results. Column two shows the ex-tracted silhouettes by background subtraction over detected bounding boxes.


Table 7.1: 2D primate detection results from 4 views, video 1

Cameras TP FP FN Precision RecallView 1 180 79 45 0.70 0.80View 2 98 29 37 0.77 0.73View 3 93 9 9 0.91 0.91View 4 129 11 77 0.92 0.63Overall 500 128 168 0.80 0.75

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall (TP/(TP+FN))

Pre

cisi

on (

TP

/(T

P+

FP

))

Detection over all four views

cam1cam2cam3cam4

Figure 7.3: PR-curve of 2D detection.


Table 7.2: 2D primate detection results from 2 views, video 2

Cameras TP FP FN Precision RecallView 1 295 18 107 0.73 0.94View 2 141 12 33 0.81 .92Overall 430 28 143 0.75 0.93

7.3 Multiview Environment and 3D Primate Vi-

sual Hull Results

Camera calibration is a necessary step in building the 3D vision assisted obser-

vation environment. We used four cameras from different views as a quantitative

sensor to recover 3D quantitative measures about the observed scene from 2D

images.

We used the factory information of each camera lens used in our pen, i.e. 3 of

the Kowa lenses (LM5JC1M 2/3”, focal length 5mm, f/2.8) for the wall cameras

and the Edmunds fish eye lens (Optics NT62-274, focal length 1.8mm, f/1.4) for

the ceiling camera. Using MATLAB camera calibration toolbox, we estimated the

intrinsic parameters, including focal length, principal point, skew coefficient, and

distortions, and extrinsic parameters including rotations and translations. Figure

7.4 illustrates our calibration process.

After calibration, we used the detection results to reconstruct the 3D visual hulls

of primates in the pen. For each view, we had a detection log which gave us the


View 3

View 4 View 2 View 1

Figure 7.4: Calibration process. A checkerboard of size 16.8′′ × 24′′ is usedfor calibration. The top figure shows the 3D locations of each camera.


bounding boxes around primates. For each frame, we created a binarized image

with primates as foreground and the rest as background in each view. Finally we

used these images to create 3D visual hulls of the primates. However, since the

number of cameras is limited, and in each view the detection algorithm only gives

us a bounding box, having a profound shape after 3D reconstruction is not possible;

therefore the final shape of 3D primate reconstruction will not be accurate. Figure

7.5 illustrates this process for one primate in a frame.

View 2

View 3

View 4

View 2

View 3

View 4

Figure 7.5: 3D visual hull reconstruction result sample. Column one arethe original images; Column two shows the binary images from 2D primate

detection; Column three is the visual hull constructed from three views.


Table 7.3: 2D primate tracking results from 2 views, video 1

Primates R Br B Y G OverallAccuracy (view 1) 0.89 0.83 0.92 0.62 0.59 0.77Accuracy (view 2) 0.94 0.89 0.86 0.71 0.75 0.83

Table 7.4: 2D primate tracking results from 2 views, video 2

Primates Y R G Bl OverallAccuracy (view 1) .91 .95 .63 .85 .83Accuracy (View 2) .90 .93 .58 0.81 .81

7.4 2D Primate Tracking

After initialization of the tracker, we used the results of our detection algorithm

to achieve trajectories of each primate. Table 7.3 and table 7.4 show the results of

tracking algorithm for the two sets of videos. The abbreviation after each primate

shows its collar, i.e. R stands for red, Br stands for brown, B stands for blue, G

stands for green, and Bl stands for black. We can see that the accuracy improves

significantly from the detection stage.

7.5 Primate Activities

To evaluate the performance of the activity recognition algorithm, we used a video

form camera 1, video 20121011 (video 3), view 1 containing 1500 frames .This

video contains quite frequent interesting activities and is from the top view. The


advantage of the fish-eye camera in using for activity recognition is that it permits

us to see all the primates all the time and it is easy to observe their relations.

Video 2 does not contain much interesting activities. Sometimes primates may

disappear due to occlusion and limited view area. But this is more close to what

is happening in real life. And since the sizes of the primates are bigger, we can

get better results.In both data sets, we divided videos temporally into 50-frame

segments as the activity samples. For each segment, the position of each primate

was manually labeled frame by frame. For each segment, each primate was given

an activity label such as stationary, locomotion, etc. Table 7.5 shows the results

of our algorithm.“GT“ is ground truth label while “DI“ is the detected incidents.

“TP“ stands for true positive while “FP“ stands for false positive. “FN“ stands

for false negative. We can see that the algorithm has a high accuracy and is able

to detect the activities in most cases. After evaluating our activity recognition

algorithm, we tested it on the tracking results obtained from the previous stage.

GT DI TP(TPR=TP/GT) FP FNChasing 15 10 10 (0.66) 0 4Avoiding 19 16 14 (0.74) 2 6Locomotion 50 53 46 (0.92) 7 7Stationary 96 100 95 (0.99) 5 1Occlude 0 0 0 (NA) 0 0

Table 7.5: Activity recognition results on view 1, video 3.

The activity detection results on “view 1, video 2” and “view 2, video 2” are shown

in table 7.6 and table 7.7.


GT DI TP(TPR=TP/GT) FP FNChasing 0 0 0 (NA) 0 0Avoiding 3 1 1 (0.33) 0 3Locomotion 11 11 9 (0.81) 2 2Stationary 42 44 42 (1) 2 0Occlude 0 0 0 (NA) 0 0


GT DI TP(TPR=TP/GT) FP FNChasing 0 0 0 (NA) 0 0Avoiding 0 0 0 (NA) 0 0Locomotion 9 8 7 (0.77) 1 2Stationary 35 35 34 (0.97) 1 1Occlude 16 16 15 (0.93) 1 1


From the tables we can see that the activity recognition algorithm overall has a

good accuracy and is able to detect activities correctly.

7.6 Fusion of Multiple Views

From the training set for dataset2, we see that there are no occlusions on view1,

whereas in view2 , there are 16 events of occlusion. Intuitively this makes sense,

since the top view is the most informative view and has the least occlusion gen-

erally. To fuse the results obtained from different views, a binary decision tree

with weighted coefficients is created. On first level, if an event is occluded in one

view, we rely on other views that are not occluded. If an event is not occluded


the decision will be made based on the four views. For view one the coefficient is

the highest, then view 2, then view 3, and finally view 4.

Decision = w1 × v1 + w2 × v2 + w3 × v3 + w4 × v4

where, wi shows the weights for each view, and vi is a 4×1 vector consisting of zeros

and one, which shows the decision for each view (chasing, avoiding, locomotion,

or stationary) and the decision is a 4 × 1 vector, where the highest value shows

the final decision. Since here we are dealing with only two views and the weight

for view 1 is higher than the weight for view 2, the final decision will be equal to

the recognition results from view 1.

Chapter 8

Discussion and Conclusion

8.1 Conclusion

Recognizing and modeling social behaviors of animals have many applications.

Limited research exists in the area of automatic primate behavior analysis using

videos in open non-occluded environments where primates are observed. In this

dissertation, we presented a complete framework to recognize some of the activ-

ities of social primates in groups. This framework contains different modules:

experimental set up, data collection module, primate detection module, primate

tracking module, and primate activity recognition module. In experimental set up

a group of primates with different colored collars around their neck were put in a

pen and observed during a couple of weeks. In the next module, data collection

104

Chapter 8. Discussion and Conclusion 105

module, we used four cameras (three side cameras and one fish-eye camera on

the ceiling) around the pen to record the activities of primates. Using streampix

software, data was stored and synchronized from different cameras. For detection

module, we first applied a static background subtraction to our test frames and

then equalized their histogram so they become more similar in terms of color dis-

tribution. We then used color features and HOG features to detect primates. In

tracking module, we used the color information of the collars around the primates’

neck and combined it with NN and kalman filter to get smooth trajectories for

each of the primates. For the final module, activity recognition, we used some

heuristics to define temporal based activities such as chasing or stationary. For

each of these modules we validated our algorithm by creating test videos from

our original data. The primate detection accuracy varied from 60 to 85 percent

depending on the view, activity level of the primates, and the rate of occlusion

events. With the help of tracking, the accuracy increased by an average of 10

percent across video clips. We were able to recognize some preliminary behavior

of primates, such as stationary, avoiding, chasing, locomotion. Our results are

promising and sufficiently accurate to analyze primate behaviors and social inter-

actions. This study is unique to the best our knowledge, because it studies primate

groups in a closed and controlled research environment for the first time and pro-

posed a complete framework that successfully tackled the issue of automatically

recognizing behaviors of primates. The design of each module was separate from

Chapter 8. Discussion and Conclusion 106

the other modules, which made it possible to evaluate the performance of each

module separately. This property makes our framework feasible for use of others

who would like to explore in this area. One can focus on any of the modules and

use alternate algorithms to improve the performance without the need to change

the entire framework.

Major challenges encountered, include the massive size of data, lack of labeled data

and annotated activities, environmental difficulties such as illumination variations

throughout the day, background changes due to perturbations introduced (moving

objects in the pen such as swings or toys, and human passing by), highly variable

shapes and poses of primates, and the low visibility of collars which made it very

difficult to identify the primates.

8.2 Discussion and Future Work

Some of the directions that would be suitable for future work would be : First

one of our major challenges was that we did not have any training sets. So we

had to spend a lot of time creating training sets by labeling primates both for

detection phase and activity phase. One of the approaches that in the past couple

of years got a lot attention is DNN for detection and classification that can work

semi supervised or not supervised. Taking advantage of this property one might

be able to use this method for detection of primates without the need of massive

Bibliography 107

labeling. If one wants to improve our current algorithm, definitely adding more

training data will help with the accuracy of detection. For the experimental set up

module, changing the position of cameras can help. Apart from top view and one

of the side views, the other two side views do not add much information since a lot

of times, primates tend to sit on the benches and from those views they occlude

each other. For the activity recognition module, having a more accurate shape of

primates could help us extract more features and recognize more activities. Facial

features might be hard to distinguish now, but if the illumination improves it

might be useful to extract some facial features and classify some activities based

on that.

Bibliography

[1] J. M. Rowcliffe and C. Carbone. “Surveys using camera traps: are we looking

to a brighter future?” Animal Conservation, 11(3):185–186, 2008.

[2] A. Herler and A. Stoger, “Vocalizations and associated behaviour of Asian

elephant (Elephas maximus) calve,” Behaviour, 149, 575–599, 2012.

[3] T. Burghardt and J. Calic, “Analysing animal behaviour in wildlife videos

using face detection and tracking,” Vision, Image and Signal Processing, IEE

Proceedings, vol.153, no.3, pp.305,312, 2006.

[4] J. Morrow-Tesch, J. Dailey, and H. Jiang, “A video data base system for

studying animal behavior,” J. Anim. Science, 76(10), 2605–2608, 1998.

[5] D. Reby, R. Andre-Obrecht, A. Galinier, J. Farinas, and B. Cargnelutti, “Cep-

stral coecients and hidden markov models reveal idiosyn- cratic voice charac-

teristics in red deer (cervus elaphus) stags,” J Acoust Soc Am, 120 (6), 4080-9,

2006.

108

Bibliography 109

[6] J. Altmann, “Observational study of behavior: sampling methods,” Behaviour,

49 (3), 227-267, 1974.

[7] D. Maestripieri and K. Wallen, “Aliative and submissive communication in

rhesus macaques,” Primates, 38 (2), 127-138, 1997.

[8] P. Martin and P. Bateson, “Measuring Behavior: An introductory guide,”

Cambridge University Press, 2007.

[9] M. Dunn, J. Billingsley, and N. Finch, “Future Trends Machine vision classi-

fication of animals,” Proceedings of the 10th Annual Conference on Mechatron-

ics and Machine Vision in Practice, Mechatronics and Machine Vision, ed. by

Billingsley J (Perth, Australia, 2003.

[10] P. Khorrami, J. Wang, and T. Huang, “Multiple animal species detec-

tion using robust principal component analysis and large displacement optical

flow,” Proceedings of the 21st International Conference on Pattern Recognition

(ICPR), Workshop on Visual Observation and Analysis of Animal and Insect

Behavior, 2012.

[11] http://docs.opencv.org/trunk/doc/tutorials/video/background-

subtraction/background-subtraction.html

[12] D. Walther, D. Edgington, and C. Koch, “Detection and tracking of objects

in underwater video,” IEEE Computer Society Conference on Computer Vision

and Pattern Recognition, 544–549, 2004.

Bibliography 110

[13] Z. Khan, R. A. Herman, K. Wallen, and T. Balch, “An outdoor 3-d visual

tracking system for the study of spatial navigation and memory in rhesus mon-

keys,” Behavior research methods, vol. 37, no. 3, pp. 453–463, 2005.

[14] F. de Chaumont, R. D. Coura, P. Serreau, A. Cressant, J. Chabout, S. Gra-

non, and J.-C. Olivo-Marin, “Computerized video analysis of social interactions

in mice,” Nature Methods, vol. 9, no. 4, pp. 410–417, 2012.

[15] T. Balch, F. Dellaert, A. Feldman, A. Guillory, C. L. Isbell, Z. Khan, S. C.

Pratt, A. N. Stein, and H. Wilde, “How multirobot systems research will accel-

erate our understanding of social animal behavior,” Proceedings of the IEEE,

vol. 94, no. 7, pp. 1445–1463, 2006.

[16] D. K. Mellinger, C. W. Clark, “Recognizing transient low- frequency whale

sounds by spectrogram correlation,” J Acoust Soc Am , 107 (6), 3518-29, 2000.

[17] T. Burghardt and J. Calic, “Analysing animal behaviour in wildlife videos

using face detection and tracking,” Vision, Image and Signal Processing,

153(3):305 – 312, 2006.

[18] L. Gamble, S. Ravela, and K. McGarigal, “Multi-scale features for identifying

individuals in large biological databases: an application of pattern recognition

technology to the marbled salamander ambystoma opacum,” Journal of Applied

Ecology, 45(1):170–180, 2008.

Bibliography 111

[19] M. Lahiri, C. Tantipathananandh, R. Warungu, D. I. Rubenstein, and T. Y.

Berger-Wolf, “Biometric animal databases from field photographs: identification

of individual zebra in the wild,” In Proceedings of the 1st ACM International

Conference on Multimedia Retrieval, pages 6:1–6:8, 2011.

[20] D. Walther, D. R. Edgington, and C. Koch, “Detection and tracking of objects

in underwater video,” IEEE International Conference on Computer Vision and

Pattern Recognition, 1:544–549, 2004.

[21] N. Haering, R.J. Qian, and M.I. Sezan, “A semantic event-detection approach

and its application to detecting hunts in wildlife video,” IEEE Transactions on

Circuits and Systems for Video Technology, 10:857–868, 2000.

[22] D. Tweed and A. Calway, “Tracking multiple animals in wildlife footage,”

16th International Conference on Pattern Recognition, 2:24–27, 2002.

[23] D. Ramanan and D. A. Forsyth, “Using temporal coherence to build models of

animals,” 9th International Conference on Computer Vision, 1:338–345, 2003.

[24] M. R. Everingham and A. Zisserman, “Automated person identification in

video,” 3rd International Conference on Image and Video Retrieval, 1:289–298,

2004.

[25] D. Gibson, N. Campbell, and B. Thomas, “Quadruped gait analysis using

sparse motion information,” In International Conference on Image Processing.

IEEE Computer Society, 2003.

Bibliography 112

[26] S. L. Hannuna, N. W. Campbell, and D. P. Gibson, “Segmenting quadruped

gait patterns from wildlife video,” IEE Visual Information Engineering Confer-

ence, 2005.

[27] J. Calic, N. Campbell, M. Mirmehdi, B. Thomas, R. Laborde, S. Porter, and

N. Canagarajah, “multimedia management system for intelligent content based

retrieval,” In International Conference on Image and Video Retrieval, pages

601–609. Springer LNCS 3115, 2004.

[28] P. Viola and M. Jones, “Robust real-time object detection,” Second Interna-

tional Workshop on Statistical and Computational Theories of Vision, 2001.

[29] J. Calic, N. Campbell, A. Calway, M. Mirmehdi, T. Burghardt, S. Hannuna,

C. Kong, S. Porter, N. Canagarajah, and D. Bull, “Towards intelligent con-

tent based retrieval of wildlife videos,” 6th International Workshop on Image

Analysis for Multimedia Interactive Services, 2005.

[30] H. S. Parekh, D. G. Thakore, U. K. Jaliya, “A Survey on Object Detection and

Tracking Methods,” International Journal of Innovative Research in Computer

and Communication Engineering, Vol. 2, Issue 2, 2014

[31] A. Elgammal, R. Duraiswami, D. Harwood, and L. Anddavis, “Background

and foreground modeling using nonparametric kernel density estimation for vi-

sual surveillance,” Proceedings of IEEE, 90(7):1151–1163, 2002.

Bibliography 113

[104] F. Yi and I. Moon, “Image Segmentation: A Survey of Graph-cut Methods,”

International Conference on Systems and Informatics, 2012.

[33] J. Shi And J. Malik, “Normalized cuts and image segmentation,” IEEE Trans.

Patt. Analy. Mach. Intell, 22(8), pp.888–905, 2000.

[34] D. Comaniciu, and P. Meer, “Mean shift: A robust approach toward fea-

ture space analysis,” IEEE Trans. Patt. Analy. Mach. Intell, 24(5), pp.603–619,

2002.

[35] S. Saravanakumar, A. Vadivel and C.G. Saneem Ahmed, “Human object

tracking in video sequences,” Journal on Image and Video Processing, 2(1),

2011.

[36] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey. Acm Com-

puting Surveys,” CSUR, 38(4):13, 2006.

[37] R. K. Rout, “A Survey on Object Detectionand Tracking Algorithms,” De-

partment of Computer Science and Engineering National Institute of Technology

Rourkela Rourkela, 2008.

[38] S. Cheung and C. Kamath, “Robust techniques for background subtraction

in urban traffic video,” Proc. SPIE 5308, Visual Communications and Image

Processing, 881, 2004.

Bibliography 114

[39] K. Srinivasan, K. Porkumaran, and G. Sainarayanan, “Improved background

subtraction techniques for security in video applications,” Anti-counterfeiting,

Security, and Identification in Communication, pp.114,117, 2009.

[40] C. Kim and J.-N. Hwang, “Fast and automatic video object segmentation

and tracking for content-based applications,” Circuits and Systems for Video

Technology, IEEE Transactions on, 12(2):122–129, 2002.

[41] Z. Chaohui, D. Xiaohui, X. Shuoyu, S. Zheng, and L. Min, “An improved

moving object detection algorithm based on frame dierence and edge detection,”

Fourth International Conference on Image and Graphics, pages 519–523, 2007.

[42] C. Stauffer, And W. Grimson, “Learning patterns of activity using real time

tracking,” IEEE Trans. Patt. Analy. Mach. Intell, 22(8), pp.747–767, 2000.

[43] R. Bodor, B. Jackson, and N. Papanikolopoulos, “Vision-Based Human Track-

ing and Activity Recognition”, Proc. of the 11th Mediterranean Conf. on Control

and Automation, 1, 2003.

[44] C. Stauffer, And W. Grimson, “Learning patterns of activity using real time

tracking,” IEEE Trans. Patt. Analy. Mach. Intell, 22(8), pp.747–767, 2000.

[45] J. Rittscher, J. Kato, S. Joga, and A. Blake, “A probabilistic background

model for tracking,” European Conference on Computer Vision, 2, pp.336–350,

2000.

Bibliography 115

[46] B. Stenger, V. Ramesh, N. Paragios, F. Coetzee, and J. Burmann, “Topology

free hidden markov models: Application to background modeling,” In IEEE

International Conference on Computer Vision, pp.294–301, 2001.

[47] P. Viola, M. Jones, and D. Snow 2003, “Detecting pedestrians using patterns

of motion and appearance,” In IEEE International Conference on Computer

Vision, pp.734–741, 2003.

[48] C. Papageorgiou, M. Oren, and T. Poggio, “A general framework for object

detection,” In IEEE International Conference on Computer Vision, pp.555–562,

1998.

[49] W. T. Lee and H. T. Chen, “Histogram-based interest point detectors,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-

tion, pp. 1590-1596, 2009.

[50] A. Yilmaz, O. Javed and M. Shah, “Object Tracking: A Survey”, ACM Com-

puting Surveys, 38(4), 2006.

[51] W.-B. Yang, B. Fang, Y.-Y. Tang, Z.-W. Shang, D.-H. Li, “Sift features based

object tracking with discrete wavelet transform,” International Conference on

Wavelet Analysis and Pattern Recognition, pp. 380-385, 2009.

[52] S. L. evre and E. Bouton and T. Brouard and N. Vincent, “A new way

to use hidden markov models for object tracking in video sequences,” Image

Processing, 2003.

Bibliography 116

[53] Z. Han, Q. Ye, J. Jiao, “Online feature evaluation for object tracking Using

kalman filter,” Pattern Recognition, 2008.

[54] J. Zhu, Y. Lao, and Y. F. Zheng, “Object Tracking in Structured Environ-

ments for Video Surveillance Applications ,” IEEE Transactions On Circuits

And Systems For Video Technology, Vol. 20, No. 2, 2010.

[55] Z. H. Khan, I. Y. Gu, and A.G. Backhouse,“Robust Visual Object Tracking

Using Multi-Mode Anisotropic Mean Shift and Particle Filters,“ IEEE Trans-

actions On Circuits And Systems For Video Technology, Vol. 21, No. 1, January

2011.

[56] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human De-

tection,“CVPR, p. 886-893, 2005.

[57] C.C. Chang and C.J. Lin, “Libsvm: a library for support vector machines,“

ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no.

3, pp. 27, 2011.

[58] A.K. Chauhan, P. Krishan, “Moving Object Tracking Using Gaussian Mix-

ture Model And Optical Flow,“ International Journal of Advanced Research in

Computer Science and Software Engineering, April 2013.

[59] N. Dalal, B. Triggs, and C. Schmid,“Human Detection using Oriented His-

tograms of Flow and Appearance,“ 2011.

Bibliography 117

[65] H. Yang , L. Shao, F. Zheng , L. Wangd, and Z. Song,“Recent advances

and trends in visual tracking: A review,“ Elsevier Neurocomputing, no. 74, pp.

3823–3831, 2011.

[61] L. Wu,“Multiview Hockey Tracking with Trajectory Smoothing and Camera

Selection,“ 2005.

[62] S. Khire and J. Teizer,“Object Detection and Tracking,“ 2008.

[63] L. Vibha, C. Hegde, P. D. Shenoy, K. R. Venugopal , and L. M. Pat-

naik,“Dynamic Object Detection, Tracking and Counting in Video Streams for

Multimedia Mining,“ IAENG International Journal of Computer Science, Au-

gust 2008.

[64] S. Johnsen and A. Tews,“Real-Time Object Tracking and Classification Using

a Static Camera,“Proceedings of the IEEE ICRA 2009 Workshop on People

Detection and Tracking Kobe, Japan, May 2009.

[65] H. Yang , L. Shao, F. Zheng , L. Wangd, and Z. Song,“Recent advances

and trends in visual tracking: A review,“ Elsevier Neurocomputing 74, pp.

3823–3831, 2011.

[66] W.L. Lu and J.J. Little,“ Simultaneous Tracking and Action Recognition

using the PCA-HOG Descriptor,“ Proceedings of the 3rd Canadian Conference

on Computer and Robot Vision CRV), 2006.

Bibliography 118

[67] J. Pan, B. Hu, and J.Q. Zhang,“ Robust and Accurate Object Tracking Under

Various Types of Occlusions ,“ IEEE Transactions on Circuits and Systems for

Video Technology, Vol. 18, No. 2, February 2008.

[68] Y. Zhong, A.K. Jain, M.P. Dubuisson-Jolly,“Object Tracking Using De-

formable Templates“,IEEE transactions on pattern analysis and machine in-

telligence, vol. 22, no. 5, may 2000.

[69] X. Liu, L. Lin, S. Yan, H. Jin, and W. Jiang,“Adaptive Object Tracking by

Learning Hybrid Template Online,“IEEE Transactions On Circuits And Sys-

tems For Video Technology, Vol. 21, no. 11, November 2011.

[70] Y. Zheng1 and Y. Meng, “Object Detection And Tracking Using Bayes-

Constrained Particle Swarm Optimization,“ISBN,pp. 978-992, 2007.

[71] P. Viola, M. Jones, D. Snow,“Detecting Pedestrians Using Patterns of Motion

and Appearance,“ Proceedings of the International Conference on Computer

Vision (ICCV),Nice, France, October 2003.

[72] I. Strid and K. Walentin,“Block Kalman Filtering for Large-Scale DSGE

Models,“ Computational Economics (Springer),no. 33 vol. 3, pp. 277–304, April

2009.

[73] A. Kelly,“A 3D state space formulation of a navigation Kalman filter for

autonomous vehicles,“1994.

Bibliography 119

[74] H. Tanizaki,“Non-gaussian state-space modeling of nonstationary time se-

ries,“ J. Amer. Statist.Assoc.82 , pp.1032–1063, 1987.

[75] M. Isard and A. Blake, “Condensation - conditional density propagation for

visual tracking,“ Int. J. Comput. Vision no. 29, vol. 1, pp.5–28, 1998.

[76] R. Polana and R. C. Nelson,“ Detecting activities,“CVPR, 1993.

[77] A. F. Bobick and J. W. Davis,“ The recognition of human movement using

temporal templates,“ IEEE PAMI,pp. 23:257–267, 2001.

[78] M. Black and Y. Yacoob,“ Tracking and recognizing rigid and non-rigid facial

motions using local parametric models of image motion,“ ICCV, pp. 374–381,

1995.

[79] D. Ramanan and D. A. Forsyth,“ Automatic annotation of everyday move-

ments,“ NIPS, 2003.

[80] I. Junejo, E. Dexter, I. Laptev, and P. Perez,“ Cross-view action recognition

from temporal self-similarities,“ ECCV,v. 2, pp. 293–306, Marseille, France,

2008.

[81] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “ Learning realistic

human actions from movies,“ CVPR, pp. 1–8, Anchorage, Alaska, 2008.

[82] J. C. Niebles and L. Fei-Fei,“ A hierarchical model model of shape and ap-

pearance for human action classification“,CVPR, 2007.

Bibliography 120

[83] S. Savarese, A. D. Pozo, J. C. Niebles, and L. Fei-Fei,“Spatial-temporal cor-

relations for unsupervised action classification,“Motion and Video Computing,

2008.

[84] C. Buehler, S. J. Gortler, M.F. Cohen, and L. McMillan,“ Minimal surfaces

for stereo,“ ECCV (3), pp. 885–899, 2002.

[85] J. Carranza, C. Theobalt, M.A. Magnor, and H. P. Seidel,“ Free-viewpoint

video of human actors,“ACM SIGGRAPH 003, pp. 569–577, New York, USA,

2003.

[86] G.K.M. Cheung, S. Baker, and T. Kanade,“ Shape-from-silhouette of artic-

ulated objects and its use for human body kinematics estimation and motion

capture,“ CVPR, pp. 77–84, 2003.

[87] J.S Franco, M. Lapierre, and E. Boyer,“ Exact polyhedral visual hulls,“ In

Proceedings of the Fourteenth British Machine Vision Conference, pp. 329–338,

Norwich, UK, 2003.

[88] J.S Franco, M. Lapierre, and E. Boyer,“ Visual shapes of silhouette sets,“ In

Proceedings of the 3rd International Symposium on 3D Data Processing, Visu-

alization and Transmission, Chapel Hill, USA, 2006.

[89] W. Matusik, C. Buehler, R. Raskar, S.J. Gortler, and L. McMillan,“Image-

based visual hulls,“ Computer Graphics Proceedings ACM SIGGRAPH, pp.

369–374, Kurt, Akeley, 2000,

Bibliography 121

[90] P. Sand, L. McMillan, and J. Popovi,“ Continuous capture of skin deforma-

tion“, ACM SIGGRAPH. ACM Press, no. 3, pp. 578–586, New York, USA,

2003. .

[94] K. Man and G. Cheung,“ Visual Hull Construction, Alignment and Refine-

ment for Human Kinematic Modeling, Motion Tracking and Rendering,“ PhD

thesis, Carnegie Mellon University, 2003.

[92] A. Laurentini,“ The visual hull: A new tool for contour based image under-

standing,“, Proc. 7th Scandinavian Conf. Image Analysis, pp. 993-1002, 1991.

[93] A. Laurentini,“ The Visual Hull Concept for Silhouette- Based Image Under-

standing,“ IEEE Trans. Pattern Anal. Mach. Intell.,pp. 150–162, 1994.

[94] K. Man and G. Cheung,“ Visual Hull Alignment and Refinement Across

Time: A 3D Reconstruction Algorithm Combining Shape-From-Silhouette with

Stereo,“CVPR,no. 2 pp. 375– 382, 2003.

[95] K. Kolev, M. Klodt, T. Brox, S. Esedoglu, and D. Cremers,“ Continuous

global optimization in multiview 3d reconstruction,“ International Conference

on Energy Minimization Methods in Computer Vision and Pattern Recognition,

2007.

[96] R. Szeliski,“ Rapid octree construction from image sequences,“ Vision, Graph-

ics and Image Processing: Image Understanding,vo. 58, pp. 23-32, 1993.

Bibliography 122

[97] H. Noborio, S. Fukuda, and S. Arimoto,“ Construction of the Octree Approx-

imating a Three-Dimensional Object by Using Multiple Views,“ IEEE Trans.

Pattern Anal. Mach. Intell., pp. 769–782, 1988

[98] M. Potmesil,“ Generating octree models of 3D objects from their silhouettes

in a sequence of images,“ Comput. Vision Graph. Image Process., pp. 1–29, 1987

[99] N. Ahuja and J. Veenstra, “Generating Octrees from Object Silhouettes in

Orthographic Views,“ IEEE Trans. Pattern Anal. Mach. Intell, pp. 137–149,

1989

[100] A. Laurentini, “The visual hull concept for silhouette based image under-

standing,“ Pattern Analysis and Machine Intelligence, IEEE Transactions, vol.

16, no. 2, pp. 150–162, 1994.

[101] J.S. Franco and E. Boyer, “Exact polyhedral visual hulls,“ British Machine

Vision Conference (BMVC’03),, vol. 1, pp. 329–338, 2003.

[102] H. Noborio, S. Fukuda, and S. Arimoto,“Construction of the octree approx-

imating a threedimensional object by using multiple views,“ Pattern Analysis

and Machine Intelligence, IEEE Transactions, vol. 10, no. 6, pp. 769–782, 1988.

[103] K. Forbes, A. Voigt, and N. Bodika, “Visual hulls from single uncalibrated

snapshots using two planar mirrors,“ Proc. 15th Annual Symposium of the

Pattern Recognition Association of South Africa, 2004.

Bibliography 123

[104] F. Yi and I. Moon, “Image Segmentation: A Survey of Graph-cut Methods,“

International Conference on Systems and Informatics (ICSAI), 2012.

[105] M. Kass, A. Witkin, and D. Terzopoulos, ”Snakes: Active contour models,”

International Journal of Computer Vision, vol. 1, no.4,p. 321, 1988.

[106] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning

applied to document recognition,“Proc. of the IEEE, 1998.

[107] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with

deep convolutional neural networks,“NIPS , 2012.

[108] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei. “ImageNet

Large Scale Visual Recognition Competition,“ 2012.

[109] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImaageNet:

A large-scale hierarchical image database,“ CVPR, 2009.

[110] N.J. Gordon , D.J. Salmond , and A.F.M. Smith,“Novel approach to

nonlinear/non-Gaussian Bayesian state estimation,“IEEE-Proceedings, 1993. ,

140 , 107113

Automatic recognition of primate behaviors and social ... · 2 Recognizing and modeling social behaviors of animals has many applications, in-cluding: (1) improved understanding of

Documents