-
1View-invariant action recognition based on ArtificialNeural
Networks
Alexandros Iosifidis, Anastasios Tefas, Member, IEEE, and
Ioannis Pitas, Fellow, IEEE
AbstractIn this paper, a novel view invariant action
recog-nition method based on neural network representation
andrecognition is proposed. The novel representation of action
videosis based on learning spatially related human body
postureprototypes using Self Organizing Maps (SOM). Fuzzy
distancesfrom human body posture prototypes are used to produce a
timeinvariant action representation. Multilayer perceptrons are
usedfor action classification. The algorithm is trained using data
froma multi-camera setup. An arbitrary number of cameras can beused
in order to recognize actions using a Bayesian framework.The
proposed method can also be applied to videos depictinginteractions
between humans, without any modification. The useof information
captured from different viewing angles leads tohigh classification
performance. The proposed method is the firstone that has been
tested in challenging experimental setups, afact that denotes its
effectiveness to deal with most of the openissues in action
recognition.
Index TermsHuman action recognition, Fuzzy Vector Quan-tization,
Multi-layer Perceptrons, Bayesian Frameworks.
I. INTRODUCTIONHuman action recognition is an active research
field, due
to its importance in a wide range of applications, such
asintelligent surveillance [1], human-computer interaction
[2],content-based video compression and retrieval [3],
augmentedreality [4], etc. The term action is often confused with
theterms activity and movement. An action (sometimes also calledas
movement) is referred to as a single period of a humanmotion
pattern, such as a walking step. Activities consistof a number of
actions/movements, i.e., dancing consists ofsuccessive repetitions
of several actions, e.g. walk, jump,wave hand, etc. Actions are
usually described by using eitherfeatures based on motion
information and optical flow [5], [6],or features devised mainly
for action representation [7], [8].Although the use of such
features leads to satisfactory actionrecognition results, their
computation is expensive. Thus, inorder to perform action
recognition at high frame rates, the useof simpler action
representations is required. Neurobiologicalstudies [9] have
concluded that the human brain can perceiveactions by observing
only the human body poses (postures)during action execution. Thus,
actions can be described assequences of consecutive human body
poses, in terms ofhuman body silhouettes [10], [11], [12].After
describing actions, action classes are, usually, learned
by training pattern recognition algorithms, such as
ArtificialNeural Networks (ANNs) [13], [14], [15], Support
VectorMachines (SVMs) [16], [17] and Discriminant
dimensionalityreduction techniques [18]. In most applications, the
camera
A. Iosifidis, A. Tefas and I. Pitas are with the Department of
Informatics,Aristotle University of Thessaloniki, Thessaloniki
54124, Greece. e-mail:faiosif,tefas,[email protected].
viewing angle is not fixed and human actions are observedfrom
arbitrary camera viewpoints. Several researchers havehighlighted
the significant impact of the camera viewingangle variations on the
action recognition performance [19],[20]. This is the so-called
viewing angle effect. To provideview-independent methods, the use
of multi-camera setupshas been adopted [21], [22], [23]. By
observing the humanbody from different viewing angles, a
view-invariant actionrepresentation is obtained. This
representation is subsequentlyused to describe and recognize
actions.Although multi-view methods address the viewing angle
effect properly, they set a restrictive recognition setup,
whichis difficult to be met in real systems [24]. Specifically,
theyassume the same camera setup in both training and
recognitionphases. Furthermore, the human under consideration must
bevisible from all synchronized cameras. However, an
actionrecognition method should not be based on such assumptions,as
several issues may arise in the recognition phase. Humansinside a
scene may be visible from an arbitrary numberof cameras and may be
captured from an arbitrary viewingangle. Inspired from this
setting, a novel approach in view-independent action recognition is
proposed. Trying to solvethe generic action recognition problem, a
novel view-invariantaction recognition method based on ANNs is
proposed in thispaper. The proposed approach does not require the
use ofthe same number of cameras in the training and
recognitionphases. An action captured by an arbitrary number N
ofcameras, is described by a number of successive human
bodypostures. The similarity of every human body posture tobody
posture prototypes, determined in the training phaseby a
self-organizing neural network, is used to provide atime invariant
action representation. Action recognition isperformed for each of
the N cameras by using a Multi-LayerPerceptron (MLP), i.e., a
feed-forward neural network. Actionrecognition results are
subsequently combined to recognizethe unknown action. The proposed
method performs view-independent action recognition, using an
uncalibrated multi-camera setup. The combination of the recognition
outcomesthat correspond to different viewing angles leads to
actionrecognition with high recognition accuracy.The main
contributions of this paper are: a) the use of
Self Organizing Maps (SOM) for identifying the basic
postureprototypes of all the actions, b) the use of cumulative
fuzzydistances from the SOM in order to achieve
time-invariantaction representations, c) the use of a Bayesian
frameworkto combine the recognition results produced for each
camera,d) the solution of the camera viewing angle
identificationproblem using combined neural networks.The remainder
of this paper is structured as follows. An
Transactions on Neural Networks and Learning SystemsIEEE 2012
,Volume:23,Issue:3
-
2overview of the recognition framework used in the
proposedapproach and a small discussion concerning the action
recog-nition task is given in Section I-A. Section II presents
detailsof the processing steps performed in the proposed
method.Experiments for assessing the performance of the
proposedmethod are described in Section III. Finally, conclusions
aredrawn in Section IV.
A. Problem Statement
Let an arbitrary number of NC cameras capturing a scene ata
given time instance. These cameras form a NC-view camerasetup. This
camera setup can be a converging one or not. Inthe first case, the
space which can be captured by all the NCcameras is referred as
capture volume. In the later case, thecameras forming the camera
setup are placed in such positionsthat there is not a space which
can be simultaneously capturedby all the cameras. A converging and
a non-converging camerasetup is illustrated in Figure 1. Nt video
frames from aspecific camera fi; i = 1; :::; Nt, form a single-view
videof = [fT1 ; f
T2 ; :::; f
TNt]T .
(a) (b)
Fig. 1. a) A converging and b) a non-converging camera
setup.
Actions can be periodic (e.g., walk, run) or not (e.g.,
bend,sit). The term elementary action refers to a single human
actionpattern. In the case of periodic actions, the term
elementaryaction refers to a single period of the motion pattern,
suchas a walk step. In the case of non-periodic actions, the
termelementary action refers to the whole motion pattern, i.e.,
abend sequence. Let A be a set of NA elementary action classesfa1;
:::; aNAg, such as walk, jump, run, etc. Let a personperform an
elementary action aj ; 1 j NA, captured byN < NC cameras. This
results to the creation of N single-view action videos fi = [fTi1;
f
Ti2; :::; f
TiNtj
]T ; i = 1; :::; N
each depicting the action from a different viewing angle.Since
different elementary actions have different durations,the number of
video frames Ntj ; j = 1; :::; NA that consistelementary action
videos of different action classes varies.For example, a run period
consists of only 10 video frames,whereas a sit sequence consists of
40 video frames in averageat 25 fps. The action recognition task is
the classificationof one or more action videos, depicting an
unknown personperforming an elementary action in one of the known
actionclasses specified by the action class set A.In the following,
we present the main challenges for an
action recognition method: The person can be seen from N NC
cameras. The caseN < NC can occur either when the used camera
setup isnot a converging one, or when using a converging
camerasetup, the person performs the action outside the
capturevolume, or in the case of occlusions.
During the action execution, the person may changemovement
direction. This affects the viewing angle he/sheis captured from
each camera. The identification ofthe camera position with respect
to the person bodyis referred as the camera viewing angle
identificationproblem and needs to be solved in order to perform
view-independent action recognition.
Each camera may capture the person from an arbitrarydistance.
This affects the size of the human body projec-tion in each camera
plane.
Elementary actions differ in duration. This is observedin
different realizations of the same action performed bydifferent
persons or even by the same person at differenttimes or under
different circumstances.
The method should allow continuous action recognitionover
time.
Elementary action classes highly overlap in the videoframe
space, i.e., the same body postures may appear indifferent actions.
For example many postures of classesjump in place and jump forward
are identical. Further-more, variations in style can be observed
between twodifferent realizations of the same action performed
eitherby the same person or by different persons. Consideringthese
observations, the body postures of a person per-forming one action
may be the same to the body posturesof another person performing a
different action. More-over, there are certain body postures that
characterizeuniquely certain action classes. An action
representationshould take into account all these observations in
orderto lead to a good action recognition method.
Cameras forming the camera setup may differ in resolu-tion and
frame rate and synchronization errors may occurin real camera
setups, that is, there might be a delaybetween the video frames
produced by different cameras.
The use of multi-camera setups involves the need ofcamera
calibration for a specific camera setting. Theaction recognition
algorithms need to be retrained evenfor small variations of the
camera positions.
The objective is the combination of all available
informationcoming from all the available cameras depicting the
personunder consideration to recognize the unknown action. The
pro-posed method copes with all the above mentioned issues
andconstraints. According to the best of the authors knowledgethis
is the first time where an action recognition method istested
against all these scenarios with very good performance.
II. PROPOSED METHODIn this section, each step of the proposed
method is de-
scribed in detail. The extraction of posture data, used as
inputdata in the remaining steps, is presented in subsection II-A.
The use of a Self Organizing Map (SOM) to determinehuman body
posture prototypes is described in subsection II-B. Action
representation and classification are presented insubsections II-C
and II-D, respectively. A variation of theoriginal algorithm, that
exploits the observations viewingangle information is presented in
subsections II-E and II-F.Finally, subsection II-G presents the
procedure followed inthe recognition phase.
-
3A. Preprocessing Phase
As previously described, an elementary action is capturedby N
cameras in elementary action videos consisting of Ntj ,1 j NA,
video frames that depict one action period.The number Ntj may vary
over action classes, as well as overelementary action videos coming
from the same action class.Multi-period action videos are manually
split in elementaryaction videos, which are subsequently used for
training andtesting in the elementary action recognition case. In
the caseof videos showing many action periods (continuous
actionrecognition), a sliding window of possibly overlapping
videosegments having suitably chosen length Ntw is used
andrecognition is performed at each window position, as will
bedescribed in Section III-D. In the following, the term
actionvideo will refer to an elementary action video.Moving object
segmentation techniques [25], [26] are ap-
plied to each action video frame to create binary
imagesdepicting persons body in white and the background in
black.These images are centered at the persons center of
mass.Bounding boxes of size equal to the maximum boundingbox
enclosing persons body are extracted and rescaled toNH NW pixels to
produce binary posture images of fixedsize. Eight binary posture
images of eight actions (walk,run, jump in place, jump forward,
bend, sit, fall andwave one hand) taken from various viewing angles
are shownin Figure 2.
Fig. 2. Posture images of eight actions taken from various
viewing angles.
Binary posture images are represented as matrices andthese
matrices are vectorized to produce posture vectors p 2RD; D = NH NW
. That is, each posture image is finallyrepresented by a posture
vector p. Thus, every action videoconsisting of Ntj video frames is
represented by a set ofposture vectors pi 2 RD; i = 1; :::; Ntj .
In the experimentspresented in this paper the values NH = 64, NW =
64 havebeen used and binary posture images were scanned
column-wise.
B. Posture prototypes Identification
In the training phase, posture vectors pi; i = 1; :::; Np,
Npbeing the total number of posture vectors consisting all theNT
training action videos, having Ntj video frames each, areused to
produce action independent posture prototypes withoutexploiting the
known action labels. To produce spatially relatedposture
prototypes, a SOM is used [27]. The use of SOM leadsto a
topographic map (lattice) of the input data, in which thespatial
locations of the resulting prototypes in the lattice areindicative
of intrinsic statistical features of the input postures.The
training procedure for constructing the SOM is based onthree
procedures:1) Competition: For each of the training posture
vectors
pi, its Euclidean distance from every SOM weight, wSj 2RD; j =
1; :::; NS is calculated. The winning neuron is the
one that gives the smallest distance:
j = arg minjk pi wSj k2: (1)
2) Cooperation: The winning neuron j indicates the cen-ter of a
topological neighborhood hj . Neurons are exciteddepending on their
lateral distance, rj , from this neuron. Atypical choice of hj is
the Gaussian function:
hjk(n) = exp(r2jk
22(n)); (2)
where k corresponds to the neuron at hand, n is the iterationof
the algorithm, rjk is the Euclidean distance betweenneurons j and k
in the lattice space and is the effectivewidth of the topological
neighborhoood. is a function ofn: (n) = 0 exp( nN0 ), where N0 is
the total number oftraining iterations. (0) = lw+lh4 in our
experiments. lw andlh are the lattice width and height,
respectively.3) Adaptation: At this step, each neuron is adapted
with
respect to its lateral distance from the wining neuron
asfollows:
wSk(n+ 1) = wSk(n) + (n)hjk(n)(pi wSk(n)); (3)where (n) is the
learning-rate parameter: (n) =(0) exp( nN0 ). (0) = 0:1 in our
experiments.The optimal number of update iterations is determined
by
performing a comparative study on the produced lattices. In
apreliminary study, we have conducted experiments by using avariety
of iteration numbers for the update procedure. Specif-ically, we
trained the algorithm by using 20, 50, 60, 80, 100and 150 update
iterations. Comparing the produced lattices,we found that the
quality of the produced posture prototypesdoes not change for
update iterations number greater than 60.The optimal lattice
topology is determined using the cross-validation procedure, which
is a procedure that determinesthe ability of a learning algorithm
to generalize over data thatwas not trained on. That is, the
learning algorithm is trainedusing all but some training data,
which are subsequently usedfor testing. This procedure is applied
multiple times (folds).The test action videos used to determine the
optimal latticetopology were all the action videos of a specific
person notincluded in the training set. A 12 12 lattice of
postureprototypes produced using action videos of action
classeswalk, run, jump in place, jump forward, bend, sit,fall and
wave one hand captured from eight viewing angles0o, 45o, 90o, 135o,
180o, 225o, 270o and 315o(with respect to the persons body) is
depicted in Figure 3.As can be seen, some posture prototypes
correspond to body
postures that appear in more than one actions. For
example,posture prototypes in lattice locations (1; f); (1; g); (7;
d)describe postures of actions jump in place, jump for-ward and
sit, while posture prototypes in lattice locations(3; k); (6; l);
(8; l) describe postures of actions walk andrun. Moreover, some
posture prototypes correspond to pos-tures that appear to one only
action class. For example,posture prototypes in lattice locations
(1; i); (10; i); (12; e)describe postures of action bend, while
posture prototypesin lattice location (4; f); (3; i); (5; j)
describe postures of
-
4Fig. 3. A 1212 SOM produced by posture frames of eight actions
capturedfrom eight viewing angles.
action wave one hand. Furthermore, one can notice thatsimilar
posture prototypes lie in adjacent lattice positions.This results
to a better posture prototype organization. Toillustrate the
advantage given by the SOM posture prototyperepresentation in the
action recognition task, Figure 4 presentsthe winning neurons in
the training set used to produce the12 12 lattice presented in
Figure 3 for each of the actionclasses. In this Figure, only the
winning neurons are shown,while the grayscale value of the
enclosing square is a functionof their wins number. That is, after
determining the SOM, thesimilarity between all the posture vectors
belonging to thetraining action videos and the SOM neurons was
computedand the winning neurons corresponding to each posture
vectorwas determined. The grayscale value of the enclosing
squaresis high for neurons having large number of wins and small
forthose having a small one. Thus, neurons enclosed in squareswith
high grayscale value correspond to human body posesappearing more
often in each action type. As can be seen,posture prototypes
representing each action are quite differentand concentrate in
neighboring parts of the lattice. Thus,one can expect that the more
non-overlapping these mapsare the more discriminant representation
they offer for actionrecognition.
(a) (b) (c) (d)
(e) (f) (g) (h)
Fig. 4. Winning neurons for eight actions: a) walk, b) run, c)
jump in place,d) jump forward, e) bend, f) sit, g) fall and h) wave
one hand.
By observing the lattice shown in Figure 3, it can be seenthat
the spatial organization of the posture prototypes definesareas
where posture prototypes correspond to different viewingangles. To
illustrate this, Figure 5 presents the wining neurons
for all action videos that correspond to a specific
viewingangle. It can be seen that the wining neurons
correspondingto different views are quite distinguished. Thus, the
samerepresentation can also be used for viewing angle
identifica-tion, since the maps that correspond to different views
arequite non-overlapping. Overall, the SOM posture
prototyperepresentation has enough discriminant power to provide
agood representation space for the action posture
prototypevectors.
(a) (b) (c) (d)
(e) (f) (g) (h)
Fig. 5. Wining neurons for eight views: a) 0o, b) 45o, c) 90o,
d) 135o, e)180o, f) 225o, g) 270o and h) 315o.
C. Action Representation
Let posture vectors pi; i = 1; :::; Ntj ; j = 1; :::; NA
consistan action video. Fuzzy distances of every pi to all the
SOMweights wSk; k = 1; :::; NS are calculated to determine
thesimilarity of every posture vector with every posture
prototype:
dik = (k pi wSk k2) 2m1 ; (4)where m is the fuzzification
parameter (m > 1). Its optimalvalue is determined by applying
the cross-validation proce-dure. We have experimentally found that
a value of m = 1:1provides satisfactory action representation and,
thus, this valueis used in all the experiments presented in this
paper. Fuzzydistances allow for a smooth distance representation
betweenposture vectors and posture prototypes.After the calculation
of fuzzy distances, each posture
vector is mapped to the following distance vector di =[di1; di2;
:::; diNS ]
T . Distance vectors di; i = 1; :::; Ntj arenormalized to
produce membership vectors ui = dikdik ; ui 2RNS , that correspond
to the final representations of the pos-ture vectors in the SOM
posture space. The mean vectors = 1Ntj
PNtji=1 ui; s 2 RNS of all the Ntj membership
vectors comprising the action video is called action vectorand
represents the action video.The use of the mean vector leads to a
duration invariant
action representation. That is, we expect the normalized
cu-mulative membership of a specific action to be invariant tothe
duration of the action. This expectation is enhanced bythe
observation discussed in Subsection II-B and illustrated inFigure
4. Given that the winning SOM neurons correspondingto different
actions are quite distinguished, we expect that thedistribution of
fuzzy memberships to the SOM neurons willcharacterize actions.
Finally, the action vectors representing all
-
5NT training action videos sj ; j = 1; :::; NT are normalized
tohave zero mean and unit standard deviation. In the test phase,all
the N action vectors sk; k = 1; :::; N that correspond to Ntest
action videos depicting the person from different viewingangles are
normalized accordingly.
D. Single-view Action Classification
As previously described, action recognition performs
theclassification of an unknown incoming action captured byN NC
action videos, to one of the NA known action classesaj ; j = 1;
:::; NA contained in an action class set A. Using theSOM posture
prototype representation, which leads to spatiallyrelated posture
prototypes, and expecting that action videos ofevery action class
will be described by spatially related posturevectors, a MLP is
proposed for the action classification taskconsisting of NS inputs
(equal to the dimensionality of actionvectors s), NA outputs (each
corresponding to an action classaj ; j = 1; :::; NA) and using the
hyperbolic tangent functionfsigmoid(x) = tanh(bx), where the values
= 1:7159 andb = 23 were chosen [28], [29].In the training phase,
all NT training action vectors si; i =
1; :::; NT accompanied by their action labels are used todefine
MLP weightsWA using the Backpropagation algorithm[30]. Outputs
corresponding to each action vector, oi =[oi1; :::; oiNA ]
T , are set to oik = 0:95 for action vectorsbelonging to action
class k and oik = 0:95 otherwise,k = 1; :::; NA. For each of the
action vectors si, MLP responseo^i = [o^i1; :::; o^iNA ]
T is calculated by:
o^ik = fsigmoid(sTi wAk) (5)
where wAk is a vector that contains the MLP weights
corre-sponding to output k.The training procedure is performed in
an on-line form,
i.e., adjustments of the MLP weights are performed for
eachtraining action vector. After the feed of a training action
vectorsi and the calculation of the MLP response o^i, the
modificationof weight that connects neurons i and j follows the
updaterule:
WAji(n+ 1) = cWAji(n) + j(n)yi(n); (6)
where j(n) is the local gradient for the j-th neuron, yi isthe
output of the i-th neuron, is the learning-rate, c is apositive
number, called momentum constant ( = 0:05 andc = 0:1 in the
experiments presented in this paper) and nis the iteration number.
Action vectors are introduced to theMLP in a random sequence. This
procedure is applied untilthe Mean Square Error (MSE) falls under
an acceptable errorrate ":
E[(1
NT(o^i oi)2) 12 ] < ": (7)
The optimal MSE parameter value is determined performingthe
cross-validation procedure using different threshold valuesfor the
mean square error (MSE) " and the number of iterationsparameters of
the algorithm Nbp. We used values of " equalto 0:1, 0:01 and 0:001
and values of Nbp equal to 100, 500and 1000. We found the best
combination to be " = 0:01 andNmax = 1000 and used them in all our
experiments.
In the test phase, a set S of action vectors si; i = 1; :::;
Ncorresponding to N action videos captured from all the Navailable
cameras depicting the person is obtained. To classifyS to one of
the NA action classes aj specified by the actionset A, each of the
N action vectors si is fed to the MLP andN responses are
obtained:
o^i = [o^i1; :::; o^iNA ]; i = 1; :::; N: (8)
Each action video is classified to the action class aj ; j =1;
:::; NA that corresponds to the MLP maximum output:
a^i = argmaxj
o^ij ; i = 1; :::; N; j = 1; :::; NA: (9)
Thus, a vector a^ = [a^1; :::; a^N ]T 2 RN containing allthe
recognized action classes is obtained. Finally, expectingthat most
of the recognized actions a^i will correspond to theactual action
class of S, S is classified to an action class byperforming
majority voting over the action classes indicatedin a^.Using this
approach, view-independent action recognition
is achieved. Furthermore, as the number N of action
vectorsforming S may vary, a generic multi-view action
recognitionmethod is obtained. In the above described procedure,
noviewing angle information is used in the combination
ofclassification results a^i; i = 1; :::; N , that correspond to
eachof the N cameras, to produce the final recognition result.
Asnoted before, actions are quite different when they are
capturedfrom different viewing angles. Thus, some views may be
morediscriminant for certain actions. For example, actions walkand
run are well distinguished when they are captured froma side view
but they seem similar when they are captured fromthe front view. In
addition, actions wave one hand and jumpin place are well
distinguished when they are captured fromthe front or back view,
but not from the side views. Therefore,instead of majority vote, a
more sophisticated method can beused to combine all the available
information and producethe final action recognition result by
exploiting the viewingangle information. In the following, a
procedure based on aBayesian framework is proposed in order to
combine the actionrecognition results from all N cameras.
E. Combination of single-view action classification resultsusing
a Bayesian Framework
The classification of an action vector set S consisting ofN NC
action vectors si; i = 1; :::; N , each correspondingto an action
video coming from a specific camera used forrecognition, in one of
the action classes aj ; j = 1; :::; NA ofthe action class set A,
can be performed using a probabilisticframework. Each of the N
action vectors si of S is fedto the MLP and N vectors containing
the responses o^i =[o^i1; o^i2; :::; o^iNA ]
T are obtained. The problem to be solved isto classify S in one
of the action classes aj given these obser-vations, i.e., to
estimate the probability P (aj jo^T1 ; o^T2 ; :::; o^TNC )of every
action class given MLP responses. In the case whereN < NC , N NC
MLP outputs o^i will be set to zero, as norecognition result is
provided for these cameras. Since MLPresponses are real valued, P
(aj jo^T1 ; o^T2 ; :::; o^TNC ) estimation
-
6is very difficult. Let a^i denote the action recognition
resultcorresponding to the test action vector si representing
theaction video captured by the i-th camera, taking values in
theaction class set A. Without loss of generality, si; i = 1; :::;
Nis assumed to be classified to the action class that providesthe
highest MLP response, i.e., a^i = argmax
jo^ij . That is,
the problem to be solved is to estimate the probabilitiesP (aj
ja^1; a^2; :::; a^NC ) of every action class aj , given the
dis-crete variables a^i; i = 1; :::; NC . Let P (aj) denote the a
prioriprobability of action class aj and P (a^i) the a priori
probabilityof recognizing a^i from camera i. Let P (a^1; a^2; :::;
a^NC ) bethe joint probabilities of all the NC cameras observing
oneof the NA action classes aj . Furthermore, the
conditionalprobabilities P (a^1; a^2; :::; a^NC jaj) that camera 1
recognizesaction class a^1, camera 2 recognizes action class a^2,
etc.,given that the actual action class of S is aj , can be
calculated.Using these probabilities, the probability P (aj ja^1;
a^2; :::; a^NC )of action class aj ; j = 1; :::; NA, given the
classification resultsa^i can be estimated using the Bayes
formula:
P (aj ja^1; a^2; :::; a^N ) = P (a^1; a^2; :::; a^N jaj) P
(aj)PNAl=1 P (a^1; a^2; :::; a^N jal) P (al)
: (10)
In the case of equiprobable action classes, P (aj) = 1NA .
Ifthis is not the case, P (aj) should be set to their real values
andthe training data should be chosen accordingly. Expecting
thattraining and evaluation data come from the same distributions,P
(a^1; a^2; :::; a^NC jaj) can be estimated during the
trainingprocedure. In the evaluation phase, S can be classified to
theaction class providing the maximum conditional probability,i.e.,
a^S = argmax
jP (aj ja^1; a^2; :::; a^NC ). However, such a
system cannot be applied in a straightforward manner, sincethe
combinations of all the NC cameras providing NA + 1action
recognition results is enormous. The case NA + 1refers to the
situation where a camera does not provide anaction recognition
result, because the person is not visiblefrom this camera. For
example, in the case of NC = 8 andNA = 8, the number of all
possible combinations is equalto (NA + 1)NC = 43046721. Thus, in
order to estimate theprobabilities P (a^1; a^2; :::; a^NC jaj) an
enormous training dataset should be used.To overcome this
difficulty, the action classification task
could be applied to each of the N available cameras
indepen-dently and the N classification results could subsequently
becombined to classify the action vector set S to one of the
actionclasses aj . That is, for camera i, the probability P (aj
ja^i) ofaction class aj given the recognized action class a^i can
beestimated using the Bayes formula:
P (aj ja^i) = P (a^ijaj) P (aj)PNAl=1 P (a^ijal) P (al)
: (11)
As previously described, since the person can freely move,the
viewing angle can vary for each camera, i.e., if a cameracaptures
the person from the front viewing angle at a giventime instance, a
change in his/her motion direction may resultthat this camera
captures him/her from a side viewing angleat a subsequent time
instance. Since the viewing angle hasproven to be very important in
the action recognition task,
the viewing angle information should be exploited to improvethe
action recognition accuracy. Let P (a^i; v^i) denote the
jointprobability denoting that camera i recognizes action classa^i
captured from viewing angle v^i. Using P (a^i; v^i), theprobability
P (aj ja^i; v^i) of action class aj , given a^i and v^ican be
estimated by:
P (aj ja^i; v^i) = P (a^i; v^ijaj) P (aj)PNAl=1 P (a^i; v^ijal)
P (al)
: (12)
The conditional probabilities P (aj ja^i; v^i) are estimated
inthe training phase. After training the MLP, action
vectorscorresponding to the training action videos are fed to
theMLP in order to obtain its responses. Each action vectoris
classified to the action class providing the highest MLPresponse.
Exploiting the action and viewing angle labelsaccompanying the
training action videos, the conditional prob-abilities P (a^i;
v^ijaj) corresponding to the training set arecalculated. Finally, P
(aj ja^i; v^i) are obtained using Equation(12). In the recognition
phase, each camera provides an actionrecognition result. The
viewing angle corresponding to eachcamera should also be estimated
automatically to obtain v^i.A procedure to this end, which exploits
the vicinity propertyof the SOM posture prototype representation is
presentedin the next subsection. After obtaining the a^i and v^i, i
=1; :::; N , the action vector set S is classified to the
actionclass providing the maximum probability sum [31], i.e., a^S
=argmax
j
PNi=1 P (aj ja^i; v^i).
In (12), the denominatorPNA
j=1 P (a^i; v^ijaj) P (aj) =P (a^i; v^i) refers to the
probability that action vectors cor-responding to action videos
captured from the recognizedviewing angle v^i belong to the
recognized action class a^i. Thisprobability is indicative of the
ability of each viewing angleto correctly recognize actions. Some
views may offer moreinformation to correctly classify some of the
action vectors toan action class. In the case where v^i is capable
to distinguisha^i from all the other action classes, P (a^i; v^i)
will be equalto its actual value, thus having a small impact to the
finaldecision P (aj ja^i; v^i). In the case where v^i confuses
someaction classes, P (a^i; v^i) will have a value either higher,
orsmaller than its actual one and will influence the decisionP (aj
ja^i; v^i). For example, consider the case of recognizingaction
class bend from the front viewing angle. Becausethis action is well
distinguished from all the other actions,when it is captured from
the front viewing angle, P (a^i; v^i)will not influence the final
decision. The case of recognizingaction classes jump in place and
sit from a 270o sideviewing angle is different. Action videos
belonging to theseaction classes captured from this viewing angle
are confused.Specifically, all the action videos recognized to
belong toaction class jump in place actually belong to this class,
whilesome action videos recognized to belong to action class
sitbelong to action class jump in place. In this case P (a^i;
v^i)will be of high value for the action class sit, thus providinga
low value of P (aj ja^i; v^i), while P (a^i; v^i) will be of
lowvalue for the action class jump in place, thus providing ahigh
value of P (aj ja^i; v^i). That is, the recognition of
actionclasses sit from the 270o side viewing angle is
ambiguous,
-
7as it is probable that the action video belongs to action
classjump in place, while the recognition of the action class
jumpin place from the same viewing angle is of high confidence,as
action videos recognized to belong to action class jump inplace
actually belong to this class.The term P (a^i; v^ijaj) in (12)
refers to the probability of
the i-th action vector to be detected as action video
capturedfrom viewing angle v^i and classified to action class
a^i,given that the actual action class of this action vector isaj .
This probability is indicative of the similarity betweenactions
when they are captured from different viewing anglesangles and
provides an estimate of action discriminationfor each viewing
angle. Figure 6 illustrates the probabilitiesP (a^i; v^ijaj); j =
1; :::; NA; i = 1; :::; NC ; a^i = aj , i.e., theprobabilities to
correctly classify an action video belongingto action class aj from
each of the viewing angles v^i for anaction class set A = fwalk,
run, jump in place, jumpforward, bend, sit, fall, wave one handg
producedusing a 12 12 SOM and an 8-camera setup, in which
eachcamera captures the person from one of the eight viewingangles
V = f0o; 45o; 90o; 135o; 180o; 225o; 270o; 315og. Inthis Figure, it
can be seen that action vectors belonging toaction classes jump in
place, bend, and fall are almostcorrectly classified by every
viewing angle. On the other hand,action vectors belonging to the
remaining actions are moredifficult to be correctly classified for
some viewing angles. Aswas expected, the side views are the most
capable in terms ofclassification for action classes walk, run,
jump forwardand sit, while in the case of action class wave one
handthe best views are the frontal and the back ones.
Fig. 6. Single-view action classification results presented as
input to theBayesian framework for eight actions captured from
eight viewing angles.
F. Camera Viewing Angle Identification
The proposed method utilizes a multi-camera setup. Theperson
that performs an action can freely move and thisaffects the viewing
angles he/she is captured by each camera.Exploiting the SOM posture
prototype representation in Figure
3a and the observation that posture prototypes correspondingto
each viewing angle lie in different lattice locations aspresented
in Figure 5, a second MLP is proposed to identifythe viewing angle
v^i the person is captured from each camera.Similarly to the MLP
used in action classification, it consistsof NS input nodes (equal
to the dimensionality of actionvectors s), NC outputs (each
corresponding to a viewingangle) and uses the hyperbolic tangent
function as activationfunction. Its training procedure is similar
to the one presentedin Subsection II-G. However, this time, the
training outputs areset to oik = 0:95; k = 1; :::; NC , for action
vectors belongingto action videos captured from the k-th viewing
angle andoik = 0:95 otherwise.In the test phase, each of the N
action vectors si consisting
an action video set S that corresponds to the same
actioncaptured from different viewing angles, is introduced to
theMLP and the corresponding to each action vector viewingangle v^i
is recognized based on the maximum MLP response:
v^i = argmaxj
o^ij ; i = 1; :::; N; j = 1; :::; NC : (13)
G. Action Recognition (test phase)
Let a person performing an action captured from N NCcameras. In
the case of elementary action recognition, thisaction is captured
in N action videos, while in the case ofcontinuous action
recognition, a sliding window consisted ofNtw video frames is used
to create the N action videosused to perform action recognition at
every window location.These videos are preprocessed as discussed in
Section II-Ato produce N Nt posture vectors pij ; i = 1; :::; N; j
=1; :::; Nt, where Nt = Ntj or Nt = Ntw in the elementary andthe
continuous action recognition tasks, respectively. Fuzzydistances
dijk from all the test posture vectors pij to everySOM posture
prototype wSk; k = 1; :::; NS , are calculatedand a set S of N test
action vectors, si, is obtained. Thesetest action vectors are fed
to the action recognition MLPand N action recognition results a^i
are obtained. In the caseof majority voting, the action vector set
S is classified tothe action class aj that has the most votes. In
the Bayesianframework case, the N action vectors si are fed to the
view-ing angle identification MLP to recognize the
correspondingviewing angle v^i and the action vector set S is
classified to theaction class aj that provides the highest
cumulative probabilityaccording to the Bayesian decision. Figure 7
illustrates theprocedure followed in the recognition phase for the
majorityvoting and the Bayesian framework cases.
III. EXPERIMENTS
The experiments conducted in order to evaluate the perfor-mance
of the proposed method are presented in this section.To demonstrate
the ability of the proposed method to correctlyclassify actions
performed by different persons, as variationsin action execution
speed and style may be observed, theleave-one-person-out
cross-validation procedure was appliedin the i3DPost multi-view
action recognition database [32] andthe experiments are discussed
in subsection III-C. SubsectionIII-D discusses the operation of the
proposed method in
-
8Fig. 7. Action recognition system overview (test phase).
case of multi-period action videos. Subsection III-E presentsits
robustness in the case of action recognition at differentframe
rates between training and test phases. In III-F acomparative study
that deals with the ability of every viewingangle to correctly
classify actions is presented. The case ofdifferent camera setups
in the training and recognition phasesis discussed in Subsection
III-G, while the case of actionrecognition using an arbitrary
number of cameras at the testphase is presented in subsection
III-H. The ability of theproposed approach to perform action
recognition in the caseof human interactions is discussed in
Subsection III-I.
A. The i3DPost multi-view database
The i3DPost multi-view database [32] contains 80 highresolution
image sequences depicting eight persons performingeight actions and
two person interactions. Eight camerashaving a wide 45o viewing
angle difference to provide 360o
coverage of the capture volume were placed on a ring of
8mdiameter at a height of 2m above the studio floor. The studiowas
covered by blue background. The actions performed in 64video
sequences are: walk (wk), run (rn), jump in place(jp), jump forward
(jf), bend (bd), fall (fl), sit on a chair(st) and wave one hand
(wo). The remaining 16 sequencesdepict two persons that interact.
These interactions are: shakehand (sh) and pull down (pl).
B. The IXMAS multi-view database
The INRIA (Institut National de Recherche en Informa-tique et
Automatique) Xmas Motion Acquisition Sequencesdatabase [22]
contains 330 low resolution (291 390 pixels)image sequences
depicting 10 persons performing 11 actions.Each sequence has been
captured by five cameras. The personsfreely change position and
orientation. The actions performedare: check watch, cross arm,
scratch head, sit down, getup, turn around, walk in a circle, wave
hand, punch,
kick, and pick up. Binary images denoting the personsbody are
provided by the database.
C. Cross-validation in i3DPost multi-view database
The cross-validation procedure described in Subsection II-Bwas
applied to the i3DPost eight-view database, using the ac-tion video
sequences of the eight persons. Action videos weremanually
extracted and binary action videos were obtained bythresholding the
blue color in the HSV color space. Figure8a illustrates the
recognition rates obtained for various SOMlattice topologies for
the majority voting and the Bayesianframework cases. It can be seen
that high recognition rateswere observed. The optimal topology was
found to be a 1212lattice. A recognition rate equal to 93:9% was
obtained forthe majority voting case. The Bayesian approach
outperformsthe majority voting one, providing a recognition rate
equal to94:04% for the view-independent approach. As can be
seen,the use of viewing angle information results to an increase
ofthe recognition ability. The best recognition rate was found tobe
equal to 94:87%, for the Bayesian approach incorporatingthe viewing
angle recognition results. The confusion matrixcorresponding to the
best recognition result is presented inTable I. In this matrix,
rows represent the actual action classesand columns the recognition
results. As can be seen, actionswhich contain discriminant body
postures, such as bend,fall and wave right hand are perfectly
perfectly classified.Actions having large number of similar body
postures, suchas walk-run, or jump in place-jump forward-sit,
aremore difficult to be correctly classified. However, even
forthese cases, the classification accuracy is very high.
(a) (b)
Fig. 8. a) Action recognition rates vs various lattice
dimensions of the SOM.b) Recognition results in the case of
continuous action recognition.
D. Continuous action recognition
This section presents the functionality of the proposedmethod in
the case of continuous (multiple period) actionrecognition. Eight
multiple period videos, each correspondingto one viewing angle,
depicting one of the persons of thei3DPost eight-view action
database were manually created byconcatenating single period action
videos. The algorithm wastrained using the action videos depicting
the remaining sevenpersons using a 12 12 lattice and combining the
classifica-tion results corresponding to each camera with the
Bayesianframework. In the test phase, a sliding window of Ntw =
21video frames was used and recognition was performed at
everysliding window position. A majority vote filter, of size
equalto 11 video frames, was applied at every classification
result.
-
9Figure 8b illustrates the results of this experiment. In
thisFigure, ground truth is illustrated by a continuous line
andrecognition results by a dashed one. In the first 20 frames
noaction recognition result was produced, as the algorithm needs21
frames (equal to the frames of the sliding window NtW )to perform
action recognition. Moreover, a delay is observedin the
classification results, as the algorithm uses observationsthat
refer to past video frames (t; t 1; :::; t Nt + 1). Thisdelay was
found to be between 12 and 21 video frames. Onlyone recognition
error occurred at the transition between actionssit and fall.
TABLE ICONFUSION MATRIX FOR EIGHT ACTIONS.
wk rn jp jf bd st fl wowk 0.95 0.05rn 0.05 0.95jp 0.92 0.02
0.06jf 0.05 0.9 0.05bd 1st 0.13 0.87fl 1wo 1
E. Action recognition in different video frame rates
To simulate the situation of recognizing actions usingcameras of
different frame rates, between training and testphases, an
experiment was set as follows. The cross-validationprocedure using
a 1212 lattice and the Bayesian frameworkwas applied for different
camera frame rates in the test phase.That is, in the training phase
the action videos depicting thetraining persons were used to train
the algorithm using theiractual number of frames. In the test
phase, the number offrames consisting the action videos were fewer,
in order toachieve recognition at lower frame rate. That is, for
actionrecognition at the half frame rate, the test action
videosconsisted from the even-numbered frames, i.e., ntmod 2 =
0,where nt is the frame number of each video frame and modrefers to
the modulo operator. In the case of a test frame rateequal to 1=3
of the training frame rate, only frames with framenumber nt mod 3 =
0 were used, etc. In the general case,where the test to training
video frame rate ratio was equalto 1K , the test action videos
consisted of the video framessatisfying nt mod K = 0. Figure 9a
shows the results forvarious values ofK. It can be seen that the
frame rate variationbetween the training and test phases does not
influence theperformance of the proposed method. In fact, it was
observedthat, for certain actions, a single posture frame that
depicts awell distinguished posture of the action is enough to
producea correct classification result. This verifies the
observationmade in Subsection II-D that human body postures of
differentactions are placed at different positions on the lattice.
Thus, thecorresponding neurons are responsible to recognize the
correctaction class. To verify this, the algorithm was tested
usingsingle body posture masks that depict a person from
variousviewing angles. Results of this experiment are illustrated
inFigure 10. In this Figure, it can be seen that the MLP can
correctly classify action vectors that correspond to singlehuman
body postures. Even for difficult cases, such as Walk0o, or Run
315o, the MLP can correctly recognize the actionat hand.
(a) (b)
Fig. 9. a) Recognition results for different video frame rates
between trainingand test phase. b) Action recognition rates vs
various occlusion levels.
Fig. 10. MLP responses for single human body posture images as
input.
F. Actions versus viewing angleA comparative study that
specifies the action discrimination
from different viewing angles is presented in this
subsection.Using a 12 12 lattice and the Bayesian framework,
thecross-validation procedure was applied for the action
videosdepicting the actions in each of the eight viewing anglesf0o,
45o, 90o, 135o, 180o, 225o, 270o, 315og. That is, eightsingle-view
elementary action recognition procedures wereperformed. Figure 11
presents the recognition rates achievedfor each of the actions. In
this Figure, the probability tocorrectly recognize an incoming
action from every viewingangle is presented, e.g., the probability
to correctly recognizea walking sequence captured from the frontal
view is equalto 77:7%. As was expected, for most action classes,
the sideviews are the most discriminant ones and result in the
bestrecognition rates. In the case of action wave one hand,
thefrontal and the back views are the most discriminant ones
andresult in the best recognition rates. Finally, well
distinguishedactions, such as bend and fall, are well recognized
fromany viewing angle. This can be explained by the fact that
thebody postures that describe them are quite distinctive at
anyviewing angle.Table II presents the overall action recognition
accuracy
achieved in every single-view action recognition experiment.In
this Figure, the probability to correctly recognize an incom-ing
action from any viewing angle is presented. For example,
-
10
the probability to correctly recognize one of the eight
actionscaptured from the frontal view is equal to 77:5%. As can
beseen, side views result in better recognition rates, because
mostof the actions are well discriminated when observed by
sideviews. The best recognition accuracy is equal to 86; 1%
andcomes from a side view (135o). It should be noted that the useof
the Bayesian network improves the recognition accuracyas the
combination of these recognition outputs leads to arecognition rate
equal to 94:87%, as discussed in SubsectionIII-C.
Fig. 11. Recognition rates of different actions when observed
from differentviewing angles.
TABLE IIRECOGNITION RATES OBTAINED FOR EACH VIEWING ANGLE.
0o 45o 90o 135o
77:5% 80:9% 82:4% 86:1%180o 215o 260o 315o
74:1% 80:5% 79:2% 80:2%
G. Action recognition using reduced camera setups
In this subsection, a comparative study between differentreduced
camera setups is presented. Using a 1212 lattice,
thecross-validation procedure was applied for different reducedtest
camera setups. In the training phase, all action videosdepicting
the training persons from all the eight cameras wereused to train
the proposed algorithm. In the test phase, onlythe action videos
depicting the test person from the camerasspecified by the reduced
camera setup were used. Becausethe movement direction of the eight
persons varies, camerasin these experiments do not correspond to a
specific viewingangle, i.e., camera #1 may or may not depict to the
personsfront view. Figure 12 presents the recognition rates
achievedby applying this procedure for eight different camera
setups.As can be seen, a recognition rate equal to 92:3% was
achievedusing only 4 cameras having a 90o viewing angle difference
toprovide 360o coverage of the capture volume. It can be seen
that, even for 2 cameras placed at arbitrary viewing angles
arecognition rate greater than 83% is achieved.
H. Recognition in occlusion
To simulate the situation of recognizing actions using
anarbitrary number of cameras an experiment was set as follows.The
cross-validation procedure using a 12 12 lattice andthe Bayesian
framework was applied for a varying number ofcameras in the test
phase. That is, in the training phase theaction videos depicting
the training persons from all the eightcameras were used to train
the algorithm. In the test phase,the number and the capturing view
of the testing personsaction videos were randomly chosen. This
experiment wasapplied for a varying number of cameras depicting the
testingperson. The recognition rates achieved in these
experimentscan be seen in Figure 9b. Intuitively, we would expect
theaction recognition accuracy to be low when using a smallnumber
of cameras in the test phase. This is due to the viewingangle
effect. By using a large number of cameras in the testphase, the
viewing angle effect should be addressed properly,resulting to an
increased action recognition accuracy. Usingone arbitrarily chosen
camera, a recognition rate equal to 79%was obtained, while using
four arbitrarily chosen cameras, therecognition rate was increased
to 90%. Recognition rates equalto 94:85% and 94:87% were observed
using five and eightcameras, respectively. As it was expected, the
use of multiplecameras resulted to an increase of action
recognition accuracy.This experiment illustrates the ability of the
proposed approachto recognize actions at high accuracy, in the case
of recognitionusing an arbitrary number of cameras that depict the
personfrom arbitrary view angles.
Fig. 12. Recognition rates for the eight different camera
setups.
I. Recognition of human interactions
As previously described, the action recognition task refersto
the classification of actions performed by one person.
Todemonstrate the ability of the proposed approach to
correctlyclassify actions performed by more than one persons,
i.e.,human interactions, the cross-validation procedure was
appliedto the i3DPost eight-view database including the action
videosthat depict human interactions, e.g.: shake hands and
pulldown. A recognition rate equal to 94:5% was observed for
alattice of 13 13 neurons and the Bayesian framework. Anexample of
1313 lattice is shown in Figure 13. The confusionmatrix of this
experiment can be seen in Table III.
-
11
Fig. 13. A 13 13 SOM produced by posture frames of actions
andinteractions.
To illustrate the continuous action recognition functionalityof
a system that can recognize interactions, an experimentwas set up
as follows: the algorithm was trained using theaction videos
depicting the seven persons of the I3DPostaction dataset including
the two interactions (shake handand pull dawn) using a 13 13
lattice topology and theBayesian framework. The original action
videos depicting theside views of the eight person performing these
interactionswas tested using a sliding window of NtW = 21 video
frames.Figure 14 illustrates qualitative results of this procedure.
Whenthe two persons were separated, each person was tracked
atsubsequent frames using a closest area blob tracking algorithmand
the binary images depicting each person were fed to thealgorithm
for action/interaction recognition. When the twopersons interacted,
the whole binary image was introducedto the algorithm for
recognition. In order to use the remainingcameras, more
sophisticated blob tracking methods could beused [33], [34]. As can
be seen, the proposed method can beextended to recognize
interactions in a continuous recognitionsetup.
TABLE IIICONFUSION MATRIX FOR EIGHT ACTIONS AND TWO
INTERACTIONS
wk rn jp jf bd hs pl st fl wowk 0.95 0.05rn 0.05 0.95jp 0.81 0.1
0.02 0.07jf 0.05 0.95bd 1hs 1pl 1st 0.13 0.87fl 1wo 0.03 0.05
0.92
Fig. 14. Continuous recognition of human interactions.
J. Comparison against other methods
In this section we compare the proposed method with stateof the
art methods, recently proposed in the literature, aiming
to view-independent action recognition using multi-camerasetups.
Table IV illustrates comparison results with three meth-ods by
evaluating their performance in the i3DPost multi-viewaction
recognition database using all the cameras consistingthe database
camera setup. In [35], the authors performedthe LOOCV procedure in
an action class set consisting ofthe actions walk, run, jump in
place, jump forwardand bend. The authors in [36] included action
wave onehand in their experimental setup and performed the
LOOCVprocedure using six actions and removed action run, inorder to
perform the LOOCV procedure by using five actions.Finally, the
authors in [37] applied the LOOCV procedure byusing all the eight
actions appearing in the databse. As canbe seen in Table IV, the
proposed method outperforms all theaforementioned methods.
TABLE IVCOMPARISON RESULTS IN THE I3DPOST MULTI-VIEW ACTION
RECOGNITION DATABASE.
5 actions 8 actions 5 actions 6 actionsMethod [35] 90% - -
-Method [37] - 90:88% - -Method [36] - - 97:5% 89:58%Proposed
method 94:4% 94:87% 97:8% 95:33%
In order to compare our method with other methods us-ing the
IXMAS action recognition datase we performed theLOOCV procedure by
using the binary images provided in thedatabase. In an off-line
procedure, each image sequence wassplit in smaller segments, in
order to produce action videos.Subsequently, the LOOCV procedure
has been performed byusing different SOM topologies and the
Bayessian frameworkapproach. By using a 13 13 SOM an action
recognition rateequal to 89:8% has been obtained. Table V
illustrates compar-ison results with three methods evaluating their
performancein the IXMAS multi-view action recognition database.
Ascan be seen, the proposed method outperforms these
methodsproviding up to 8:5% improvement on the action
classificationaccuracy.
TABLE VCOMPARISON RESULTS IN THE IXMAS MULTI-VIEW ACTION
RECOGNITION DATABASE.
Method [38] Method [39] Method [40] Proposed method81:3% 81%
80:6% 89:8%
IV. CONCLUSIONA very powerful framework based on Artificial
Neural Net-
works has been proposed for action recognition. The
proposedmethod highlights the strength of ANN in representing
andclassifying visual information. SOM human body posture
rep-resentations is combined with Multilayer Perceptrons. Actionand
viewing angle classification is achieved independently forall
cameras. A Bayesian framework is exploited in order toprovide the
optimal combination of the action classificationresults, coming
from all available cameras. The effectivenessof the proposed method
in challenging problem setups has
-
12
been demonstrated by experimentation. According to
authorsknowledge, there is no other method in the literature that
candeal with all the presented challenges in action
recognition.Furthermore, it has been shown that the same framework
canbe applied for human interaction recognition between
persons,without any modification.
ACKNOWLEDGMENT
The research leading to these results has received fundingfrom
the Collaborative European Project MOBISERV FP7-248434
(http://www.mobiserv.eu), An Integrated IntelligentHome Environment
for the Provision of Health, Nutrition andMobility Services to the
Elderly.
REFERENCES
[1] L. Weilun, H. Jungong, and P. With, Flexible human behavior
analysisframework for video surveillance applications,
International Journal ofDigital Multimedia Broadcasting, vol. 2010,
Article ID 920121, 9 pages,2010.
[2] P. Barr, J. Noble, and R. Biddle, Video game values:
Human-computerinteraction and games, Interacting with Computers,
vol. 19, no. 2, pp.180195, Mar. 2007.
[3] B. Song, E. Tuncel, and A. Chowdhury, Towards a
multi-terminal videocompression algorithm by integrating
distributed source coding withgeometrical constraints, Journal of
Multimedia, vol. 2, no. 3, pp. 916, 2007.
[4] T. Hfllerer, S. Feiner, D. Hallaway, B. Bell, M. Lanzagorta,
D. Brown,S. Julier, and Y. B. nad L. Rosenblum, User interface
managementtechniques for collaborative mobile augmented reality,
Computers andGraphics, vol. 25, no. 5, pp. 799810, Oct. 2001.
[5] S. Ali and M. Shah, Human action recognition in videos
usingkinematic features and multiple instance learning, Pattern
Analysis andMachine Intelligence, IEEE Transactions on, vol. 32,
no. 2, pp. 288303,feb. 2010.
[6] J. Hoey and J. Little, Representation and recognition of
complex humanmotion, in Proceedings of IEEE Conference on Computer
Vision, vol. 1.IEEE, 2000, pp. 752759.
[7] Y. Wang and G. Mori, Hidden part models for human action
recogni-tion: Probabilistic versus max margin, Pattern Analysis and
MachineIntelligence, IEEE Transactions on, vol. 33, no. 7, pp.
13101323, july2011.
[8] H. J. Seo and P. Milanfar, Action recognition from one
example,Pattern Analysis and Machine Intelligence, IEEE
Transactions on,vol. 33, no. 5, pp. 867882, May 2011.
[9] M. Giese and T. Poggio, Neural mechanisms for the
recognition ofbiological movements, Nature Reviews Neuroscience,
vol. 4, no. 3, pp.179192, Mar. 2003.
[10] N. Gkalelis, A. Tefas, and I. Pitas, Combining fuzzy vector
quantizationwith linear discriminant analysis for continuous human
movementrecognition, IEEE Transactions on Circuits Systems Video
Technology,vol. 18, no. 11, pp. 15111521, Nov. 2008.
[11] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R.
Basri, Actions asspace-time shapes, IEEE Transactions on Pattern
Analysis and MachineIntelligence, pp. 22472253, 2007.
[12] A. Iosifidis, A. Tefas, and I. Pitas, Activity based person
identificationusing fuzzy representation and discriminant learning,
IEEE Transac-tions on Information Forensics and Security, no.
99.
[13] M. Ursino, E. Magosso, and C. Cuppini, Recognition of
abstract objectsvia neural oscillators: interaction among
topological organization, asso-ciative memory and gamma band
synchronization, IEEE Transactionson Neural Networks, vol. 20, no.
2, pp. 316335, 2009.
[14] M. Schmitt, On the sample complexity of learning for
networks of spik-ing neurons with nonlinear synaptic interactions,
IEEE Transactions onNeural Networks, vol. 15, no. 5, pp. 9951001,
2004.
[15] B. Ruf and M. Schmitt, Self-organization of spiking neurons
usingaction potential timing, IEEE Transactions on Neural Networks,
vol. 9,no. 3, pp. 575578, 1998.
[16] C. Schuldt, I. Laptev, and B. Caputo, Recognizing human
actions: Alocal svm approach, in Proceedings of International
Conference onPattern Recognition, vol. 3. IEEE, pp. 3236.
[17] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, A biologically
inspiredsystem for action recognition, Proceedings of IEEE
InternationalConference on Computer Vision, vol. 1, 2007.
[18] A. Iosifidis, A. Tefas, N. Nikolaidis, and I. Pitas,
Multi-view humanmovement recognition based on fuzzy distances and
linear discriminantanalysis, Computer Vision and Image
Understanding, 2011.
[19] S. Yu, D. Tan, and T. Tan, Modeling the effect of view
angle variationon appearance-based gait recognition, in Proceedings
Asian Conf.Computer Vision, vol. 1, Jan. 2006, pp. 807816.
[20] O. Masoud and N. Papanikolopoulos, A method for human
actionrecognition, Image and Vision Computing, vol. 21, no. 8, pp.
729743,Aug. 2003.
[21] M. Ahmad and S. Lee, Human action recognition using shape
and clg-motion flow from multi-view image sequences, Pattern
Recognition,vol. 41, no. 7, pp. 22372252, July 2008.
[22] D. Weinland, R. Ronfard, and E. Boyer, Free viewpoint
action recog-nition using motion history volumes, Computer Vision
and ImageUnderstanding, vol. 104, no. 23, pp. 249257, Nov./Dec.
2006.
[23] M. Ahmad and S. Lee, Hmm-based human action recognition
usingmultiview image sequences, Proceedings of IEEE International
Con-ference on Pattern Recognition, vol. 1, pp. 263266, 2006.
[24] F. Qureshi and D. Terzopoulos, Surveillance camera
scheduling: Avirtual vision approach, in Proceedings Third ACM
International Work-shop on Video Surveillance and Sensor Networks,
vol. 12, Nov. 2005,pp. 269283.
[25] Y. Benezeth, P. Jodoin, B. Emile, H. Laurent, and C.
Rosenberger, Re-view and evaluation of commonly-implemented
background subtractionalgorithms, in 19th International Conference
on Pattern Recognition,2008. ICPR 2008. IEEE, 2009, pp. 14.
[26] M. Piccardi, Background subtraction techniques: a review,
in IEEEInternational Conference on Systems, Man and Cybernetics,
vol. 4.Ieee, 2005, pp. 30993104.
[27] T. Kohonen, The self-organizing map, Proceedings of the
IEEE,vol. 78, no. 9, pp. 14641480, 2002.
[28] S. Haykin, Neural networks and learning machines, Upper
SaddleRiver, New Jersey, 2008.
[29] Y. Le Cun, Efficient learning and second order methods, in
Tutorialpresented at Neural Information Processing Systems, vol. 5,
1993.
[30] P. Werbos, Beyond regression: New tools for prediction and
analysisin the behavioral sciences, 1974.
[31] J. Kittler, M. Hatef, R. Duin, and J. Matas, On combining
classifiers,Pattern Analysis and Machine Intelligence, IEEE
Transactions on,vol. 20, no. 3, pp. 226239, 1998.
[32] N. Gkalelis, H. Kim, A. Hilton, N. Nikolaidis, and I.
Pitas, The i3dpostmulti-view and 3d human action/interaction
database, in 6th Conferenceon Visual Media Production, Nov. 2009,
pp. 159168.
[33] N. Papadakis and A. Bugeau, Tracking with occlusions via
graph cuts,Transactions on Pattern Analysis and Machine
Intelligence, pp. 144157, 2010.
[34] O. Lanz, Approximate bayesian multibody tracking,
Transactions onPattern Analysis and Machine Intelligence, pp.
14361449, 2006.
[35] N. Gkalelis, N. Nikolaidis, and I. Pitas, View indepedent
humanmovement recognition from multi-view video exploiting a
circularinvariant posture representation, in IEEE International
Conference onMultimedia and Expo. IEEE, 2009, pp. 394397.
[36] M. B. Holte, T. B. Moeslund, N. Nikolaidis, and I. Pitas,
3D Hu-man Action Recognition for Multi-View Camera Systems, in
FirstJoint 3D Imaging Modeling Processing Visualization
Transmission(3DIM/3DPVT) Conference. IEEE, 2011.
[37] A. Iosifidis, N. Nikolaidis, and I. Pitas, Movement
recognition exploit-ing multi-view information, in International
Workshop on MultimediaSignal Processing. IEEE, 2010, pp.
427431.
[38] D. Weinland, E. Boyer, and R. Ronfard, Action recognition
fromarbitrary views using 3d exemplars, in Proceedings
International Con-ference Computer Vision. IEEE, 2007, pp. 17.
[39] D. Tran and A. Sorokin, Human activity recognition with
metriclearning, Computer VisionECCV 2008, pp. 548561, 2008.
[40] F. Lv and R. Nevatia, Single view human action recognition
usingkey pose matching and viterbi path searching, in IEEE
Conference onComputer Vision and Pattern Recognition. IEEE, 2007,
pp. 18.