AUTOMATIC RECOGNITION OF PRIMATE BEHAVIORS AND SOCIAL INTERACTIONS FROM VIDEOS A Dissertation Presented By Nastaran Ghadar to The Department of Electrical & Computer Engineering Department in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the field of Electrical and Computer Engineering Northeastern University Boston, Massachusetts May 2015
124
Embed
Automatic recognition of primate behaviors and social ... · 2 Recognizing and modeling social behaviors of animals has many applications, in-cluding: (1) improved understanding of
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
AUTOMATIC RECOGNITION OF PRIMATE BEHAVIORS AND SOCIAL
INTERACTIONS FROM VIDEOS
A Dissertation Presented
By
Nastaran Ghadar
to
The Department of Electrical & Computer Engineering Department
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the field of
Electrical and Computer Engineering
Northeastern University
Boston, Massachusetts
May 2015
NORTHEASTERN UNIVERSITY
Abstract
College of Engineering
Department of Electrical and Computer Engineering
Doctor of Philosophy
Automatic Recognition of Primate Behaviors and Social Interactions
2.2 Environment set up and lens installation. . . . . . . . . . . . . . . . . 29
2.3 Primate research, group of four primates viewed from different cameras
in the pen with different setting than figure 2.1 . . . . . . . . . . . . 30
2.4 A sample image in norpix software from different . . . . . . . . . . . . . 31
3.1 This figure shows an example of static background subtraction algorithm.
This image is taken from [11] . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Background normalization using the static background image. . . . . . . 60
5.1 2D example of the visual hull approximation algorithm. C1, C2, C3
are different views with corresponding silhouettes S1, S2, S3. Theyellow area is the approximation of the visual hull; the area enclosedby black lines is the actual visual hull; and the blue shape in thecenter is the object. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.1 A sample image of locomotion activity. The primate that is shown with
the red box is moving but no other primate has motivated this movement. 87
6.2 These series of images from top right to bottom left show the chasing
and avoiding activities that are happening between the two primates
that are shown with red circles. . . . . . . . . . . . . . . . . . . . . . 88
6.3 These series of images from top right to bottom left show the avoiding
activity for the primate that is specified with the red circle. Note that
this activity is not a result of chasing in this case. . . . . . . . . . . . . 89
6.4 This figure shows the decision tree we used to evaluate our test set. Th
leaf nodes show the decision made based on the feature values. . . . . . 90
rapid opening and closing of mouth and lips), and affiliative behavior [7], where
certain behaviors may co-occur (e.g., animals can lipsmack and move). The main
advantage of video recordings is that there is no room for human intrusion and
it replaces direct observations, at the cost of viewing multiple perspectives and
videotapes. To monitor eating and drinking behaviors, there are mechanisms to
automatically log the time and quantity of intake, but so far no automatic solution
exists for evaluating the behaviors of individuals in groups. While observational
methodologies have not undergone major changes, ways to interpret the data have
evolved with statistical methods. From computing statistics related to duration,
frequency and latencies of behaviors, now analysis often includes context in terms
of preceding behavior. Sociograms provide another perspective of social behavior
and relationship, representing associations between individuals using lines whose
thickness depends on the strength of association [8]. In this research our interest
is in some of the specific behaviors presented in Table 1.1, where we utilize the
Chapter 1. Introduction 17
focal observation method to annotate the recordings and create a data set to train
and validate our models. Specifically, we focus on types of behaviors, that can be
interpreted from the animals, relative position.
Label Comment
Aggression rough behavior or bitingChase pursuitDisplace subject leaves when approachedExplore inspects objects other than foodFear grimace subject bares teethForage searching presumably for foodFreeze subject is inactive; may move eyesGroom with hands or mouthLipsmack rapid movement of lipsLocomotion motion of entire bodyPlay grunting, wrestling, jumping, etcStationary immobile, moving head or armThreat scream, lunge, ground beating, etcVigilant subject scans environment with eyes
Table 1.1: Example macaque behaviors often encoded in human observations.
To better understand some of these behaviors, we discuss them separately.
Dominance: Rhesus macaques naturally live in social groups and they establish a
linear ”dominance hierarchy” over time. This hierarchy may change over time and
depends on many factors (age, sex, aggression, intelligence perhaps), and also could
depend on the support of other primates in the group. As clear from the expression,
primates that have a higher rank in the hierarchy tend to be more dominant,
i.e. displace lower ranked individuals from resources (mates, space, food). They
tend to have higher reproductive success (either by mating more often, or by
Chapter 1. Introduction 18
having more resources to invest in their offspring). The rank is established through
play, interactions and affiliated interactions (and rather tautological, that’s exactly
how it’s measured too). It is interesting to know that this maintenance of social
position, and social knowledge of one’s rank is one of the claimed theories for why
humans have been forced to evolve large brains.
Grooming: One of the most common activities among primates is grooming.
Grooming other primates is an important mechanism that shows their affection
for each other. There are several reasons for why a primate might groom another
primate ,subordinate animals tend to groom more dominant ones; males groom
females for sexual purposes. Mothers groom for practical purposes, infants to keep
their fur clean; but one thing that is definite is that it strengthens links between
them and keeps the primate social structure together.
Communication: This includes scents, body postures, gestures, and vocaliza-
tions. Some of these appear to be autonomic responses indicating emotional states:
fear, excitement, confidence, anger. Others seem to have a more specific purpose:
loud ranging calls in indri, howler monkeys and gibbons; quiet contact calls in
lemurs to keep the group together; fear calls in lost infants, or on spotting preda-
tors. From our human perspective, we often find it easier to associate sounds with
specific meaning, but among non-human primates, gestures and actions are often
used. Presentation and mounting behavior are often used to diffuse potentially
aggressive situations. Yawns exposing teeth are often threats, as is direct eye
Chapter 1. Introduction 19
contact. Facial expression is important too. It is very obvious in chimps: their ex-
pression often appears all too human-like; but other primates also use stereotyped
eyelid flashes or lip slaps.
Aggressive and affiliative behavior: As mentioned before, many behaviors
exist to keep the group structure running smoothly for the members of the group.
There are occasions though when these behaviors (especially aggression) are di-
rected outside the group.
Distance related behaviors: Behaviors including locomotion (running, jump-
ing, walking and climbing) and specifics of foraging behavior.
1.3 Background and Related Work
In this section, we provide a selected review of closely related work. In the com-
puter vision community, many studies employ videos of animals as standard data
sets to develop new algorithms, especially for tracking or behavior recognition.
Most of the presented methodologies on animal analysis are conducted in highly
controlled environments, for instance, with a static camera, in a well-defined loca-
tion, with static background, and with no environmental factors interfering, such
as occlusion, different illumination conditions, and interfering objects [9, 10]. One
common scenario for a controlled environment would be monitoring applications,
Chapter 1. Introduction 20
where there is a static background and a static camera [12]. This setting makes it
straightforward to learn the static background and easily obtain the foreground by
looking for the devisions from the background. More sophisticated techniques have
also been introduced. Khan et al. [13] developed a system that can automatically
generate the three dimensional trajectory of primates in an outdoor environment.
Their purpose is to evaluate the navigational abilities of non-human primates.
Their system extracts primate kinematic features such as path length, speed, and
other variables impossible for an unaided observer to note. From trajectories, they
computed and validated a path length measurement and proposed a method for
automatic behavior detection. Also, their system is used to examine the gender
differences in spatial navigation of rhesus primates. They set the environment
in a way to avoid occlusion, i.e. an open environment with minimal perturba-
tions, and they did not analyze the social interactions between primates, but put
their focus on individual actions.Chaumont et al. [14] proposed a computerized
method and a software called Mice Profiler, that uses geometrical primitives to
model and track social interactions in mice. Their system monitors a comprehen-
sive repertoire of behavioral states and temporal evolution, which is utilized for
identifying the key events that trigger social contact. Balch et al. [15] proposed an
automated labeling system to study social insect behaviors. Their ultimate goal
is to automatically create executable models of animal behavior. An algorithm
proposed by Burghardt and Calic [17] detects animal faces using Haar features
Chapter 1. Introduction 21
and then track animals; such algorithms would not work for animals whose faces
are not visible or hard to track. Other approaches [18, 19] have the user mark
or extract the location of the animal by hand. This, of course, is extraordinarily
time-consuming. Khorami et. al [10] proposed an approach that is able to detect
multiple types of animals in an entirely unsupervised scenario. The goal of their
system is to detect multiple types on animals in an unsupervised manner. Walther
et al. [20] apply saliency maps to minimize multi-agent tracking of low-contrast
translucent targets in underwater footage. Haering et al. [21] use neural network
algorithms to detect high-level events, such as hunts, by classifying and tracking
moving object blobs. Tweed and Calway [22] proposed an approach that achieves
multiple object tracking by developing a periodic model of animal motion and
exploit conditional density propagation to track flocks of birds. Ramanan and
Forsyth [23] proposed an interesting method, where they use low-level detectors
and a mean shift construct to create an appearance model for the animal and
use it to detect the animal in future frames. Their method takes into account
temporal coherency when building appearance models of animals. While they
present very good results in their paper, they only deal with three different animal
species and with cases that have no occlusion. Everingham et al. [24] proposed an
approach that combines a minimal manually labeled set with an object tracking
technique to gradually improve the detection model; however, they only deal with
human faces. Gibson et al. [25] and Hannuna et al. [26] try to address the issue
Chapter 1. Introduction 22
of animal behavior classification by detecting and classifying animal gait by ap-
plying statistical analysis on a sparse motion information extracted from wildlife
footage. Burghardt et al.[17] presents an algorithm that tracks animal faces in
wildlife rushes and populates a database [27] with appropriate semantics defining
their basic locomotive behavior. Their detection algorithm is an adapted version
of a human face detection method that exploits Haar-like features and the Ad-
aBoost classification algorithm [28] the Kanade-Lucas-Tomasi method, fusing it
with a specific interest model applied to the detected face region. They achieved
reliable detection and temporally smooth tracking of animal faces. Furthermore,
the tracking information is exploited to classify locomotive behavior of the tracked
animal, e.g. lion walking left or trotting towards the camera. Finally, the extracted
metadata about the presence of the animal, together with its locomotive behav-
ior, creates a strong prior in the process of learning animal models as well as in
extracting the additional semantic information about the animal’s behavior and
environment. The presented algorithm is a part of a large content-based retrieval
system [29] within the ICBR project that focuses on the computer vision research
challenges in the domain of wildlife documentary production. This algorithm is
close to what we are presenting in this project.
Chapter 1. Introduction 23
1.4 Description of Framework of Dissertation
In this dissertation, we developed a general framework for detecting, localizing,
tracking, and reconstructing images of social animals in a 3D observation environ-
ment. Finally, using these results, we were able to extract elementary behaviors
from videos.
As evident from the cited literature, the necessary components have developed
sufficiently in recent years to allow computational scientists to undertake the chal-
lenge of creating a framework for modeling and recognizing behavior of individuals
in their social groups. The structure of this dissertation is as follows:
1. Recording behaviors with multi-channel audio and video data: In
Chapter 2, I will discuss the details of data collection and how we acquired
our data for our experiments.
2. Detecting individual primates in the pen: In Chapter 3, I will start
with the definition of object detection. There are several algorithms currently
available in the literature for object detection and each has their advantages
and disadvantages. After introducing these algorithms and discussing where
they work best, I define the framework of our detection algorithm and why
we chose the proposed methods.
Chapter 1. Introduction 24
3. Tracking individuals over time: In Chapter 4,I will introduce some of
the most common algorithms for object tracking and when we would expect
to get a good performance out of them. Finally I will discuss the details of
our tracking algorithm.
4. Calibration and 3D visual hull reconstruction of primates: In Chap-
ter 5, I will explain the details necessary for us to obtain a 3D silhouette of
the primates in the pen and decide whether having a 3D system is helpful
or not.
5. Recognizing individual behaviors: In Chapter 6, I will discuss the activ-
ities we are interested in. After that, I will describe an algorithm to recognize
them.
6. Experimental results: In Chapter 7, I will present the results of each
section separately and discuss the results
7. Conclusion, discussion, and future work: In Chapter 8, I will discuss
the pros and cons of our algorithm and how one can improve it in terms of
efficiency and performance.
Chapter 2
Data Collection and Preparation
2.1 Acknowledgement
This section is completely done by our collaborators at OHSU. All the data was
collected by the OHSU team, which was led by Dr. Shafran. I would like to
acknowledge Alireza Bayesteh Tashk, Guillaume Thibault, and Meysam Asgari
for the grunt work they did for two years collecting the data. I would like to
acknowledge Dr.Kristine Coleman, Nicola Robertson and Megan McClintik for
conducting the animal studies, and Dr. Kathy Grant for her input in the process.
25
Chapter2. Data Collection and Preparation 26
2.2 Experimental Setup
Overall five groups of animals were observed and each group consisted of 4 or 6
rhesus macaques held in a pen (approximately 12 ft (length) x 7 ft (deep) x 7 ft
(high)) at the Oregon National Primate Research Center (ONPRC) using a pro-
tocol approved by the OHSU’s Institutional Review Board.
Individuals from isolated cages were put into the pen and their behavioral activi-
ties were recorded for two days from about 7am to 7pm, and there was no recording
when the lights were off. After a week their behavior was recorded again for two
days. By this time they have established a dominance hierarchy, i.e. a stable
phase. Two more sessions of two days were recorded to observe the effect of an
escalating series of perturbations, i.e. a perturbed phase. Major perturbations ap-
plied were: 1) Human Impostor (introduce an unfamiliar human near the cage or
pen for 15 minutes), 2) Resource Competition (modulate certain resources, for in-
stance preferred resting areas, toys, and treats), and 3) Social Instability( removal
of the most dominant individual for the entire last week). These perturbations cre-
ated the chance to observe interactions that establish social dominance hierarchies.
Chapter2. Data Collection and Preparation 27
Camera-‐1
Camera-‐4
1
2
3
4
1
2
3
1
2
3
4
3
1
Camera-‐3
2
Camera-‐2 4
3
3
1
1
1
2
Figure 2.1: Primate research, group of four primates viewed from differentcameras in the pen
2.3 Recording Behaviors with Multi-channel Au-
dio and Video Data
Automating recognition of behaviors requires capturing all the information rele-
vant for detecting individuals in the pen, tracking their movement over time and
recognizing their vocalization.
In the video domain, to avoid occlusions and to maximize coverage of the entire vol-
ume of the pen, we recorded behaviors using cameras from multiple perspectives,
Chapter2. Data Collection and Preparation 28
three cameras (GC1380CH, 2/3” CCD) with wide aperture lenses (Optron 5mm
f/2.8) on three corners of the pen and a forth camera (GC1380CH, 2/3” CCD)
with a wide-angle fisheye lens (Edmunds Optics NT62-274, focal length 1.8mm,
F1.4,185 x 185 degrees) on the top of the pen. Ideally, the pen should be uniformly
illuminated to avoid blotchy over-exposed and dark under-exposed regions in the
imag, but this is very difficult to achieve. We minimized the illumination variation
by relying on several overhead incandescent tube lights which was supplemented
by a light box mounted at the floor level. The lights were programmed to switch
off during the night hours, about 7pm to 7am. Figure 2.2 shows the camera setup
and figures 2.3 and 2.1 show a typical camera frame from four views for the two
groups of primates.
Additionally, for simplifying the task of identifying the individuals in the video
recordings, we color-coded the collars on the monkeys in a group. Collars were
powder coated with one of the six colors: purple, green, orange, blue, red and
yellow for the group of six monkeys and green, yellow, black, and red for the
group of four monkeys.
Obtaining high-level synchronization of frames from the four cameras was done by
triggering the cameras to capture each frame by a common trigger signal (National
Instrument Pulse Generation Module). The trigger signals were controlled and
programmed by a high-level software, the StreamPix 5, from Norpix on a dedicated
data collection workstation. StreamPix is NorPix’s flagship software product.
Measurement function to apply to the state estimate Q to get our expect new
measurement:
C =
∣∣∣∣∣∣∣∣1 0 0 0
0 1 0 0
∣∣∣∣∣∣∣∣Predict next state of the primate with the last state and predicted motion:
Chapter 4. Object Tracking 74
Qestimate = A×Qestimate + B × u
Predict next covariance : P = A× P × A′ + Ex
Predicted primate measurement covariance: P
Kalman Gain: K = P × C ′ × inv(C × P × C ′ + Ez)
Update the state estimate:
Qestimate = Qestimate + K × (Qmeasurement − C ×Qestimate)
Using this method, the tracking accuracy improves. The advantage of this algo-
rithm is that when a primate leaves the scene and comes back, it can track it
using the color information from the primate’s collar; However without using this
information by only using NN correspondence, or kalman filter, it will be very hard
to accurately track an object that leaves the scene and comes back at a different
location with a different direction of movement.
Chapter 5
Calibration and 3D
Reconstruction
5.1 Camera Calibration
It has been more than a decade that researchers in computer vision have been
interested in digitizing time-varying events that have been recorded by video cam-
eras from multiple viewpoints to 3D scenes. Usually the events in the videos are
human activities and the ultimate goal is to let the observer view the event from
any arbitrary viewpoint. This is called free-viewpoint video. Some of t he applica-
tions of converting a scene into 3D models are: 1) 3D tele-immersion, 2) digitizing
rare cultural performances, 3) sports action, and 4) generating content for 3D
75
Chapter 5. Calibration and and 3D Reconstruction 76
video-based realistic training and demos for surgery, medicine and other technical
fields.
Currently in all multi-camera systems [84–90], calibration and synchronization
must be done during an offline calibration stage before the actual video is cap-
tured. A person has to go to the scene with a calibration object such as a planar
calibration grid or a point LED, and different shots from different angles are taken
from the person with the calibration object. This offline step makes the calibration
process hard, as if the cameras move constantly and there is a need for calibration
more than once, this task has to be done every time.
5.1.1 Explicit Camera Calibration
Physical camera parameters are commonly divided into extrinsic and intrinsic pa-
rameters. Extrinsic parameters are needed to transform object coordinates to a
camera centered coordinate frame. In multi-camera systems, the extrinsic pa-
rameters also describe the relationship between the cameras. The pinhole camera
model is based on the principle of co-linearity, where each point in the object space
is projected by a straight line through the projection center into the image plane.
The intrinsic camera parameters include the effective focal length, the scale factor,
and the image center. This information is usually provided by the company, which
is building the cameras.
Chapter 5. Calibration and and 3D Reconstruction 77
5.2 Visual Hull
The earliest attempts in reconstruction of 3D models from images used the sil-
houettes of objects as sources of shape information. A 2D silhouette is the set
of close contours that outline the projection of the object onto the image plane.
Segmentation of the silhouettes from the rest of the image and combination with
silhouettes taken from different views provide a Shape-From-Silhouette(SFS). The
result of the SFS construction is an upper bound of the real object’s shape in
contrast to a lower bound, which is a big advantage for obstacle avoidance in the
field of robotic or visibility analysis in navigation. One of the advantages in using
SFS technique is the easy implementation of calculation for the silhouettes in sim-
ple situations, such as an indoor environment with static illumination and static
cameras (without these assumptions it can be difficult to calculate an accurate sil-
houette out of the images, because of shadows or moving backgrounds). Another
application of SFS estimation is the field of motion capturing [94].On the other
hand there are also disadvantages for these techniques. Usually these algorithms
are slow, which is an issue for real-time applications. The silhouette calculations
are relatively sensitive to noise such as bad camera calibration, which makes the
resulting 3D shapes inaccurate. Furthermore, the result of each SFS algorithm is
just an approximation of the actual object’s shape, especially if there are only a
limited number of cameras and therefore this approach is not practical for appli-
cations like detailed shape recognition or realistic shape reconstruction of objects
Chapter 5. Calibration and and 3D Reconstruction 78
[94].
Laurentini introduced the term of the Visual Hull in 1991 [92]. If the camera
intrinsic and extrinsic parameters are known from calibration, then the visual
hull of objects [100, 101, 103] can be computed by intersecting the visual cones
corresponding to silhouettes captured from multiple views. The visual hull of a
3D object S is the maximal volume consistent with silhouettes of S. A formal
definition of Visual Hull (VH) is first introduced by Laurentini [100] as following:
“The visual hull V H(S,R) of an object S relative to a viewing region
R is a region of E3 such that, for each point P ∈ V H(S,R) and each
viewpoint V ∈ R, the half-line starting at V and passing through P
contains at least a point of S.”[100]
If we consider these definitions it is easy to see, that S < V H(S,R). Directly
building visual hulls by intersecting the visual cones is very difficult in practice
due to the curved and irregular surface of objects, which results in a complex
geometrical representation for its cones. Therefore approximation methods are
preferred. Polyhedral shape based approach [101] and volume based approach
[102] are normally used for this purpose. We adopt the latter approach for its
efficiency. Algorithm 5.2 shows a pseudocode of the approach.
[h]
Chapter 5. Calibration and and 3D Reconstruction 79
1. Divide the 3D space of interest into N×N×N discrete voxels vn, n = 1, .., N3.
voxels
2. Initialize all the N3 voxels as object voxels
3. For n = 1 to N3 {
— For k = 1 to K {
—— Project vn into the kth image plane by the projection function P k;
—— If the projected area P k(vn) lies completely outside Sk, then classify vn
as non-object voxels;
— }
}
4. The visual hull V H is approximated by the union of all the object voxels.
Another more efficient way to calculate an approximation of the visual hull is
a volume based approach [96–99].Even though this technique is very easy and
fast, it has a big disadvantage; The resulting shape is significantly larger then the
true object shape, which makes it only feasible for application in which only an
approximation is used [94]. The modern approaches use surface-based represen-
tations instead of the volumetric representation of the scene, which allows using
regularization in an energy minimization framework. These techniques result in a
higher robustness to outliers and erroneous camera calibration. Furthermore these
approaches try to overcome the inability to reconstruct concavities, due to the fact
Chapter 5. Calibration and and 3D Reconstruction 80
𝐶1 𝐶2
𝐶3
𝑆1 𝑆2
𝑆3
Visual Hull Approximation
Actual Visual Hull
Figure 5.1: 2D example of the visual hull approximation algorithm. C1, C2, C3
are different views with corresponding silhouettes S1, S2, S3. The yellow areais the approximation of the visual hull; the area enclosed by black lines is the
actual visual hull; and the blue shape in the center is the object.
that they do not affect the silhouettes by using in addition stereo-based methods.
They are used to repeatedly ignored inconsistent voxels and so result in smoother
reconstruction. So that in addition the aim is to archive a photo consistency [95].
Chapter 5. Calibration and and 3D Reconstruction 81
5.3 Calibration and Visual Hull Reconstruction
of Primates
5.3.1 Multiview Environment and Calibration
In order to determine the visual hull corresponding to a set of primate silhouettes,
the cameras that produced the images must be calibrated. This means that the
intrinsic camera parameters (such as focal length, principal point) and the pose
must be (at least approximately) known. So camera calibration is another nec-
essary step in building our 3D vision assisted observation environment. We use
four cameras from different views as a quantitative sensor to recover 3D quanti-
tative measures about the observed scene from 2D images. For our study, from a
calibrated camera we can measure how far a primate is from the camera, or the
height of the primate, etc. Here we briefly introduce the calibration algorithm we
applied in our system and some specifications about the environment.
The calibration algorithm we used is very similar to [? ] which estimates the
intrinsic parameters, including focal length, principal point, skew coefficient, and
distortions, and extrinsic parameters including rotations and translations.
Chapter 5. Calibration and and 3D Reconstruction 82
5.3.2 3D Visual Hull Reconstruction of Primates
After calibration, we used the primate detection results to reconstruct the 3D
visual hulls of the primates in the pen. For each view, we have a detection log
that gives us the bounding boxes around primates; combining the detection results
and the foregrounds obtained from the background subtraction technique, we can
get a better estimate of the location and shape of primates in 2D. For each frame,
we created a binary image with primates as foreground and the rest as background,
in each view. Finally, we used these images to create the approximate 3D visual
hull of primates. Since we only have four cameras obtaining an accurate 3D visual
hull of the primates was not feasible, therefore; we decided to proceed with the
processing of videos in 2D, and to fuse the information we get from each view
separately at the end.
Chapter 6
Activity Recognition Based on
Spatial Relation
6.1 Activity Recognition
Initial work on activity recognition involved extracting a huge description from a
video sequence. This could have been a table of motion scale, rate, and position
within a segmented figure [76] or a table of the presence of motion at each location
[77]. Both of these techniques were able to distinguish some range of activities,
but because they were individually the full descriptions of a video sequence, rather
than features extracted from a sequence, it was difficult to use them as the building
blocks of a more complicated system.
83
Chapter6. Activity Recognition Based on Spatial Relation 84
Another approach to activity recognition is the use of explicit models of the ac-
tivities to be recognized. Domains where this approach has been applied include
face and facial expression recognition [78] and human pose modeling [79]. These
techniques can be very effective, but by their nature, they cannot offer general
models of the information in video in the way that less domain-specific features
can. Furthermore, these type of activities require high quality shots of the face
with low number of occlusions which is not applicable in many scenarios such as
ours as the face of primates are dark which makes it very hard to distinguish their
facial expressions and even occluded in many frames.
Recent work in activity recognition has been largely based on local spatio-temporal
features. Many of these features seem to be inspired by the success of statistical
models of local features in object recognition. In both domains, features are first
detected by some interest point detector running over all locations at multiple
scales. Local maxima of the detector are taken to be the center of a local spatial
or spatio-temporal patch, which is extracted and summarized by some descrip-
tor. Most of the time, these features are then clustered and assigned to words
in a codebook, allowing the use of bag-of-words models from statistical natural
language processing. Since at this point our system relies on finding the identi-
ties and locations of primates at consecutive frames, recognizing spatio-temporal
activities would be the best direction to pursue.
Chapter6. Activity Recognition Based on Spatial Relation 85
6.2 Primate Activity Recognition
The task of primate activity recognition is to use the primate locations and iden-
tities given by the tracking output to detect interesting activities that we may
want to explore or monitor. Some of these interesting activities were mentioned in
table 1 in the first chapter. However, technically speaking, not all of the activities
can be detected or classified, even for human beings. For example, it is very hard
for the camera to detect activities related to tiny features such as lips or teeth of
primates. These features are small and easily subjected to occlusion. Some other
activities are too hard or too complex to be classified correctly as there are many
categories that could be interpreted as one action. For example, ”play”, which can
include moving, jumping, wrestling and grunting, which makes it so hard to be
classified correctly. Therefore, we put our focus on activities that are not subject
to interpretations and we can classify them ourselves without the need of experts
to validate our classifications for training data.
6.2.1 Velocity Measures
Fortunately, there are several interesting activities that are important and techni-
cally easy to detect and interpret. These activities include stationary, locomotion,
chasing and avoiding. All these activities can be defined only by the position
Chapter6. Activity Recognition Based on Spatial Relation 86
trajectories of the centers of the primates, which are available from the track-
ing outputs. Specifically, we assume there are two basic activities: stationary and
moving. Moving include self-moving and pairwise moving. We defined self-moving
as ”locomotion”, which associates with only one primate. Pairwise moving is de-
fined as the activities which involve two primates moving simultaneously and there
is a causal relationship between them. As there can be many interesting activities
in the pairwise moving class, we only consider chasing and avoiding as examples
in this paper. Each of the interesting activities is defined by a few heuristics which
we developed from our observation of the sample videos. In the following we will
give a detailed illustration of these heuristic features.
1. Stationary: velocity of a primate is smaller than a predefined threshold Th1
all the time.
2. Moving: velocity of a primate is greater than a predefined threshold Th1
for a predefined number of frames.
3. Locomotion: it is ”moving” but does not belong to any known pairwise
activities. Figure 6.1 shows an example of locomotion.
4. Chasing: Suppose there are two primates M1 and M2. The position trajec-
tories of them are defined as ~p1 and ~p2. We can compute their first derivative
~v1 and ~v2. Without loss of generality, we assume M1 is chasing M2, then we
have the following necessary conditions:
Chapter6. Activity Recognition Based on Spatial Relation 87
Figure 6.1: A sample image of locomotion activity. The primate that is shownwith the red box is moving but no other primate has motivated this movement.
F1 : ~v1 > Th1
F2 : ~v2 > Th1
F3 : arccos (~v1·( ~p2− ~p1)
| ~v1|·|( ~p2− ~p1)|) < Th2
F4 : |( ~p2− ~p1)| < Th3
where, Fi shows the heuristic feature we computed to determine the chasing
activity. The intuitions behind these heuristic constraints are obvious. The
first two equations ensure that both primates are moving. The third equation
hints that the chasing primate is trying to get close to the chased one. Finally,
the last equation constraints that the two involved primates should be not
too far from each other and the distance between them should relatively not
grow much as they are following each other. Figure 6.2 shows an example of
chasing.
Chapter6. Activity Recognition Based on Spatial Relation 88
Figure 6.2: These series of images from top right to bottom left show thechasing and avoiding activities that are happening between the two primates
that are shown with red circles.
5. Avoiding: avoiding heuristics can be defined similar to chasing. Again, if we
assume that primate M1 is chasing primate M2, here primate M2 is avoiding
primate M1. Avoiding can be explained by these equations.
F1 : ~v1 > Th1
F2 : ~v2 > Th1
F3 : arccos (~v2·( ~p2− ~p1)
| ~v1|·|( ~p2− ~p1)|) > Th2
Similar to chasing, the first two equations ensure that both primates are
moving.The third equation hints that the avoiding primate is trying to get
further from the chasing one. Figure 6.3 shows an example of avoiding.
Chapter6. Activity Recognition Based on Spatial Relation 89
Figure 6.3: These series of images from top right to bottom left show theavoiding activity for the primate that is specified with the red circle. Note that
this activity is not a result of chasing in this case.
With the heuristic features above, the primate activity recognition problem is
equivalent to a binary decision tree. For each primate we calculated the values
for all the heuristic features in the training set, and labeled them with different
activities. We then fed this information to a binary decision tree using MATLAB
to find the optimal cut points for each threshold. To avoid over-fitting, we used
kfold cross validation with k = 10. Figure 6.4 shows the decision tree we built for
our activity classification.
If F1 < Th1 then primate M1 is stationary.
— else if F2 < Th2 then primate M1 is locomotive.
—— else if F3 > Th3 then primate M1 is avoiding primate M2.
——— else if F4 < Th4 then primate M1 is chasing primate M2.
———— else primate M1 is locomotive.
Chapter6. Activity Recognition Based on Spatial Relation 90
stationary�
chasing
avoiding
locomo.on
locomo.on
F1<Th1 F1>Th1
F2<Th1 F2>Th1
F3<Th2 F3>Th2
F4<Th3 F4>Th3
Th1 = 9.3 Th2 = 0.86 Th3 = 318
Figure 6.4: This figure shows the decision tree we used to evaluate our testset. Th leaf nodes show the decision made based on the feature values.
Algorithm above shows the decision process for primate M1. This algorithm an-
swers the question of ”what is the activity of a given primate,M1, with a given set
of features, i.e. Fi ?”
Chapter 7
Experimental Results
7.1 Experiments
For our experiments we used a 2.39 GHz(2 processors) CPU with 48 GH RAM.
We run all of our experiments on MATLAB. We used OpenCV and c++ for the
detection algorithm and then used MEX files to call it from MATLAB.
There are several hours of recorded data from four cameras, which were recorded in
the primates’ pen. However these data are not annotated and for our experiments
we had to label them manually. We had to create training sets for each view
and also test sets to test our algorithm on them and compare the results of our
algorithm with the manually labeled test set. Since labeling primates is a very
time consuming event and we are not experts in recognizing all activities, for our
91
Chapter 7. Experimental Results 92
test set we used two different data sets. One with the first group of primates, and
the other one with the second group of primates. In each of these test sets, we
put our focus on activities that are related to relative position of primates to each
other as explained in the previous chapter.
The two data sets are named 20121026 (video 1) and 20130619 (video 2). The
data set 20121026 is a video of 400 frames. There are six primates observed in
this video. The 20130619 is a video of 700 frames. Four primates are observed
in this video. The second group of primates (group of four) was generally much
less hostile than the first group (group of six) and most of the times they were
sitting around. We looked for portions of video that contained the full number
of primates and chose portions which primates were moving and had interesting
activities. At the beginning we annotated primates from each of the four views
and tested our detection algorithm on four view, however as we will see in the
“Detection ” section, we realized that view 3 and view 4 do not carry much extra
information than the combination of view 1 and view 2, and furthermore because
of the structure of the pen and the benches, the primates were occluding each other
in many frames and we had to discard those frames. Therefore, for our tracking
and activity recognition algorithm we focused on view 1 and view 2. Figure 7.1
shows a sample image frame from four views.
Chapter 7. Experimental Results 93
Camera-1
Camera-3 Camera-4
Camera-2 1
2
3 4
1 2
3
4
1
2 3
2
3
4
Figure 7.1: Sample image from four views.
7.2 2D Primate Detection
The challenge of detection comes from multiple factors. Firstly, due to the settings
of the environment, the illumination varied in different locations, furthermore, it
may change from time to time, too. So we cannot simply rely on background
subtraction or illumination-sensitive features. Secondly, although the primates
wear collars of different color, these are easily occluded when they move, or become
indistinguishable when the illumination is low. The main challenge to detect
primates with HOG feature is the variable shape of the primate body. The reason
Chapter 7. Experimental Results 94
that HOG can successfully detect pedestrians, for instance, is that the contours
of all standing human beings look similar. The ratio between width and height is
almost constant. However, the contour of a crouching monkey is quite different
from that of a jumping one.
For each view we trained a separate detector. We used about 5000 positive training
samples (primates) and 2000 negative samples (non primates) for training each
view. We used the two test videos mentioned before, to evaluate the detector’s
performance. The results are shown in Table 7.1 and table 7.2 . TP stands for
true positive, FP stands for false positive and FN stands for false negative. The
PR curve in Figure 7.3 shows the relation between precision and recall rate with
SVM threshold varied. From Figure 7.3, we can see that view 2 and view 4 are
better than view 3 and View 1. It is reasonable because in view 2 and view 4
the background is simpler and the primates are usually separated. In view 1, the
background is strongly cluttered so there are many false positives. In view 3, the
primates on the benches often occlude each other and the illumination is low on
the floor area, so it is difficult to locate primates and therefore many false negatives
occur. Figure 7.2 is a good illustration for these points.
Chapter 7. Experimental Results 95
View 1
View 2
View 4
View 3
Figure 7.2: Primate detection in 2D. In column one, green boxes are theground truth; red boxes are the detection results. Column two shows the ex-tracted silhouettes by background subtraction over detected bounding boxes.
Chapter 7. Experimental Results 96
Table 7.1: 2D primate detection results from 4 views, video 1