Faculty of Engineering Master Degree in Artificial Intelligence and Robotics Person-tracking and gesture-driven interaction with a mobile robot using the Kinect sensor Supervisor Candidate Prof. Luca Iocchi Taigo Maria Bonanni Academic Year 2010/2011
91
Embed
Person-tracking and gesture-driven interaction with a mobile robot using the Kinect sensor
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
After having introduced various feasible solutions for the target representa-
tion, now we describe a set of possible answer to the second question: Which
image features should be used? The choice of the feature which describes the
target is the key point in the implementation of a tracker: on the one side,
one should choose the feature with respect to the target representation used,
on the other hand, the feature should be chosen for its uniqueness, to easily
detect the target in the feature space. As for the target representation, in
the following we propose some well known solutions:
Color : it provides relevant informations for the recognition of the target,
usually coupled with a histogram-based representation. There are dif-
ferent color spaces, as: RGB, HSV and HSL. The choice of which one
to use is related to its robustness against changes in both illumination
and surface orientation of the target (especially for geometric complex
shapes);
Texture: it describes the target properties, as regularity and smoothness,
measuring the intensity variations of a surface. The target is fractioned
into a mosaic of different texture regions, which can be used for infor-
mation search and retrieval. Compared to the color features, textures
are less sensitive to changes in light conditions;
Edges : target boundaries generate strong changes in the intensity of an im-
age: these changes are identified through edge detection. As textures,
edges result less sensitive to illumination changes with respect to color
features. This also represents a good feature selection when tracking
the boundaries of the target;
Optical Flow : it provides a dense set of motion vectors defining the trans-
lation of the pixels in a region; for each pixel in a frame, optical flow
associates a vector pointing towards the position of the same pixel in
the next frame. This association is performed using a constraint on the
brightness, assuming constancy of corresponding pixels in consecutive
17
Background 2.3 Tracking
frames. This feature is commonly used for motion-based segmentation
and tracking applications.
2.3.3 Object Detection
At this point, a tracking algorithm requires a method to detect the target.
To this end, we can distinguish two approaches: either the detection is based
on the information one can extract from a single frame or one may rely on
temporal information, obtained analysing sequences of frames; the second
case is a little more complex but it is more robust and reliable than the first
one, reducing the chances of false detections. The simplest form for the ex-
traction of sequences of information is to compare two consecutive frames,
highlighting all the regions resulting different (this procedure is called frame
differencing); then, the tracker (see Section 2.3.4), matches the correspon-
dences of the target from one frame to the following one.
Point Detectors : used to find interest points in the frames, like the corners
of the objects, showing a meaningful texture. These points of interest
should be invariant with respect to both the pose of the camera and
light condition changes. Two examples of point detectors are Harris
Corner Detection algorithm, (Harris and Stephens, 1988), an improve-
ment of Moravec’s interest operator presented described in Moravec
(1979), and SIFT detector (Lowe, 2004);
Supervised Learning : the system learns to detect the target using training
sets, composed of different views of the same object. Given this set,
supervised-learning algorithms compute a matching function, mapping
the input to the desired output. In the object detection scenario, train-
ing samples consist of pairs of object features associated to an object
class, manually defined. Feature selection is very critical for achieving
a good classification, hence the choice should be done in such a way
that features discriminate a class from the others;
Background Subtraction: the detection is performed by building a represen-
tation of the scene, called background model and then, for each image,
18
Background 2.3 Tracking
looking for differences from that model: relevant changes, not small
changes which may depend on the noise, identify a moving object.
Then, the modified regions are clustered, if possible, in connected com-
ponents which correspond to the target. Frame differencing can be
performed in several ways, for example using color-based or spatial-
based informations of the scene;
Segmentation: in this approach, the frame is segmented into regions which
are perceived as similar. The goal is to simplify how the image is
represented, in a fashion way which is easier to analyse. Once the pixels
are clustered in regions, target can be located by searching particular
features, as color intensities, textures or edges.
2.3.4 Object Tracking
This represents the last step for the implementation of a tracking algorithm;
the goal of a tracker is to locate in every frame the position of the target.
In this section, we finally provide the answer to the last question proposed:
How should the motion, appearance and shape of the object be modeled? This
last step can be performed in two different ways: in the first case, for each
frame, the detection phase returns possible target regions and the tracker
matches the target in the image; in the second case, target regions and their
correspondences are directly estimated, updating the location of the previous
frame. In both cases, the model representing the target restrains the type of
motions that can be applied to it. For example, if the target is described using
a point, then only a translational motion could be considered, while more
complex representations for the target lead to a more accurate description
for its motion.
Point Tracking : the target detected in consecutive frames is described us-
ing significant points; the association of these points with the target is
based on the state of the previous frame, which can include target po-
sition and motion. This approach requires an external object detector
to locate the targets in every frame;
19
Background 2.4 Gesture Recognition
Kernel Tracking : the target is represented through a rectangular or an
elliptical model, also called kernel. Objects are tracked by computing
the motion of the kernel in consecutive frames;
Silhouette Tracking : this can be considered as a particular form of ob-
ject segmentation, because, once computed the model, the silhouette is
tracked by either shape matching or contour evolution. A silhouette-
based target tracker looks for the object region in each frame, using a
model generated according to the previous frames, through color his-
togram, object edges or the object contour.
2.4 Gesture Recognition
Gesture recognition is a relevant topic in both language technology and com-
puter science, whose aim is to comprehend human gestures through different
possible approaches, presented further on. We define a gesture, (Mitra and
Acharya, 2007), as a meaningful motion physically executed by, as example:
face, head, hands, arms or body. The importance of defining systems capa-
ble of understanding gestures, performed by one or more users, is related to
what they represent for us: an innate and simple communication mean, by
which we can easily express significant information and interact with the en-
vironment; hence, gesture recognition is needed to process this information,
not conveyed through more common means as speaking.
Gesture recognition is the milestone of a full variety of applications,
(Lisetti and Schiano, 2000), for example in the following fields:
Sign language recognition: design of techniques for translating the symbols
expressed by sign language into text (analogous to speech recognition
tools for computers);
Virtual and Remote control : gestures represent an alternative mean for
systems’ control, for example to select content on a television or to
manipulate a virtual environment;
20
Background 2.4 Gesture Recognition
Video games : players’ gestures are used within video games, instead of key-
boards and other devices, to offer a more entertaining and interactive
experience;
Patient rehabilitation: robots assist patients, for example for posture reha-
bilitation, analysing the readings of sensors installed on particular suits
the patients wear;
Human-robot and Human-computer interaction: in the former, gestures are
used to command a robot, more generally to influence its behaviour, or
to interact with it as a peer; in the latter, gestures substitute common
input devices as keyboard and mouse.
The main issue to face in gesture recognition is the intrinsic ambiguity of
the gestures humans perform, which may depend on different languages or
cultures or on the particular domain of application. For example, we can
enumerate at least three different ways to perform a ”stop” gesture: closing
the hand in a fist, waving both hands over the head or raising a hand with the
palm facing forward. Furthermore, similar to handwriting and speech, ges-
tures are usually performed differently between individuals and even by the
same individual between different instances. Moreover, gestures can be static,
in this case we define the problem as posture recognition, or dynamic, con-
sisting of three phases called respectively pre-stroke, stroke and post-stroke.
In some domains, as sign language recognition, gesture can be made of both
static and dynamic elements.
Gestures can be classified into three main different categories, clearly
related to the field of application:
• Hand and arm gestures : recognition of hand poses and sign languages;
• Head and face gestures : recognition of head-related motions, such as:
a) nodding or shaking of head; b) direction of eye gaze; c) raising the
eyebrows; d) opening the mouth to speak; e) winking; f) flaring the
nostrils; g) expression of emotions;
21
Background 2.4 Gesture Recognition
• Body gestures : estimation of full body motion, as in: a) tracking move-
ments of people interacting; b) navigation of virtual environments; c)
body-pose analysis for medical rehabilitation and athletic training.
Obviously, gesture recognition needs a sensing subsystem for perceiving body
position and orientation, configuration and movements, in order to accom-
plish its goal. These perceptions are usually acquired either through gestural
interfaces or using video sensors. Despite how the acquisition of meaningful
data is performed, gesture recognition can be implemented through several
equivalent techniques, presented in the following sections.
2.4.1 Hidden Markov Model
The HMM is a statistical process in which the system modeled is a Markov
process with hidden states. The main difference between a regular Markov
model and a hidden Markov model depends on the observability: in the for-
mer the state is visible to the observer, and therefore the state transition
probabilities are the only parameters. In the latter only the output, depen-
dent on the state, is visible, and each state is characterized by a probability
distribution over the possible output tokens. Transitions between states are
represented by a pair of probabilities, defined as follows:
1. Transition probability, providing the probability for undergoing the
transition;
2. Output probability, defining, given a state, the conditional probability
of outputting symbol from a finite alphabet.
A generic HMM λ = (A,B,Π), shown in Figure 2.3, is described as follows:
• a set of observation O = O1, ..., OT , where t = 1, . . . , T ;
• a set of N states s1, ..., sN ;
• a set of k discrete observation symbols v1, ..., vk;
22
Background 2.4 Gesture Recognition
• a state-transition matrix A = aij, where aij is the transition probability
from state si at time t to state sj at time t+ 1:
A = aij = P (sj at t+ 1|sj at t), for 1 ≤ i, j ≤ N
• an observation symbol probability matrix B = bjk, where bjk is the
probability of generating symbol vk from state sj;
• an initial probability distribution for the states:
Π = πj, j = 1, 2, . . . , N, where πj = P (sj at t = 1)
Figure 2.3: HMM for gesture recognition composed of five states
Each HMM is built up to recognize a single gesture, involving elegant and
efficient algorithms to perform the following steps:
1. Evaluation: determines the probability that the observed sequence is
generated by the HMM, using Forward-Backward algorithm;
2. Training : adjusts the parameters to refine the model, using Baum-
Welch algorithm;
3. Decoding : recovers the sequence of the states, using Viterbi algorithm.
A global gesture recognition system consists of a set of HMMs (λ1, λ2, . . . , λM),
where λi is the HMM model for a generic gesture and M is the total number
23
Background 2.4 Gesture Recognition
of gestures being recognized. Yamato et al. (1992) is the first work addressing
the problem of hand gesture recognition, using a discrete HMM to recognize
six classes of tennis strokes. Starner and Pentland (1995) and Weaver et al.
(1998) is presented a HMM-based, real-time system to recognize sentence-
level American sign language, without using an explicit model of the fingers.
2.4.2 Finite State Machine
Gestures are modeled through FSMs as ordered state sequences in a spatio-
temporal configuration space. The number of states composing the FSM is
variable among the different recognizers, depending on the complexity of the
gestures performed by the users. Gestures, represented through set of points
(e.g. sampled positions of the hand, head or body) in a 2D plane, are rec-
ognized as a trajectory from a continuous stream of sensor data constituting
an ensemble of trajectories. The training of the model is performed off-line,
using data sets as rich as possible in order to derive and refine the parameters
for each state in the FSM. Once trained, the finite state machine can be used
as well for real-time gesture recognition. When the user performs a gesture,
the recognizer decides whether to remain at the current state of the FSM
or jump to the next state, with respect to the parameters of the input; if
the recognition system reaches the final state of the FSM, then the gesture
performed by the user has been recognized. The state-based representation
can be extended to accommodate multiple models for the representation of
different gestures, or even different phases of the same gesture. Member-
ship in a state is determined by how well the state models can represent the
current observation.
Davis and Shah (1994) presented a FSM model-based approach to recog-
nize hand gestures, modeling four distinct phases of a generic gesture switch-
ing between static positions and motion of hand and fingers. Gesture recogni-
tion is based on hand vector displacement between the input and the reference
gestures. Hong et al. (2000) presented another FSM-based approach for ges-
ture learning and recognition: each gesture is described by an ordered state
sequence, using spatial clustering and temporal alignment. In the first place,
24
Background 2.4 Gesture Recognition
state-machines are trained using a training set of images for each gesture,
then the system is used to recognize gestures from an unknown input image
sequence. In Yeasin and Chaudhuri (2000), a user performs gestures in front
of a camera. The gesture is executed from any arbitrary spatio–temporal
configuration and its trajectory is continuously captured by the sensor; then,
acquired data are temporally segmented into subsequences characterized by
uniform dynamics along single directions, so that meaningful gestures may
be defined as sequences of elementary directions. For example, a simple
sequence right-left-right-left can represent a waving gesture.
2.4.3 Particle Filtering
Particle filters are sophisticated model estimation techniques based on simu-
lation, usually used to estimate Bayesian models where the latent, or hidden,
variables are connected in a Markov chain, but typically where the state space
of the latent variables is continuous rather than discrete. Filtering refers to
determining the distribution of hidden variables at a given (e meglio specific
di given?) time, considering all the observations up to that time; particle
filters are so named because they allow for approximate ”filtering” using a
set of ”particles” (differently-weighted samples of the distribution). Repre-
senting an alternative to the Extended Kalman filter (EKF) or Unscented
Kalman filter (UKF), particle filters offer better performance than the pre-
vious approaches in terms of accuracy, given a sufficient number of samples.
The key idea for estimating the state of dynamic systems from sensors’ read-
ings, is to represent probability densities by set of samples. As a result,
particle filters exhibit the ability to represent a wide range of probability
densities, allowing real-time estimation of non-linear, non-Gaussian dynamic
systems (Arulapalam et al., 2001). The state of a tracked object at time t is
described by a vector Xt, where the vector Yt represents all the samples of ob-
servations y1, y2, . . . , yt. The probability density distribution is approximated
by a weighted sample set St = 〈x(i)t, w(i)t〉|i = 1, . . . , Np. Here, each sample
x(i)t represents a hypothetical state of the target, and w(i)t represents the
25
Background 2.4 Gesture Recognition
corresponding discrete sampling probability of the sample x(i)t, such that:
Np∑i=1
w(i)t = 1
The evolution of the sample set is iteratively described propagating each
sample, according to a model. Each sample is weighted in terms of the
observations, and Np samples are drawn with replacement by choosing a
particular sample with posterior probability w(i)t = P (yt|Xt = x(i)t). In
each step of iteration, the mean state of an object is estimated as:
E(St) =
Np∑i=1
w(i)t x
(i)t
Since particle filters model uncertainty using posterior probability density,
this approach provides a robust tracking framework suitable for gesture recog-
nition systems. For example, Black and Jepson (1998) presented a mixed-
state condensation algorithm, based on particle filtering, to recognize a huge
number of different gestures analysing their temporal trajectories.
2.4.4 Soft Computing Approaches
Soft computing is a set of techniques for providing adaptable information-
processing capability, to handle real-life ambiguous situations. It is aimed
to exploit the tolerance for imprecision, uncertainty, approximate reason-
ing, and partial truth in order to achieve tractability, robustness, and low-
cost solutions. Sensor outputs are often associated with an inherent uncer-
tainty. Relevant, sensor-independent, invariant features are extracted from
these outputs, followed by gesture classification. Recognition systems may
be designed to be fully trained when in use, or may adapt dynamically to
the current user. Soft computing tools, such as fuzzy sets, artificial neural
networks (ANNs), time-delay neural networks (TDNNs) and others, exhibit
overall good performance for effectively handling these issues. In particular,
the flexible nature of ANNs enable connectionist approaches to incorporate
26
Background 2.4 Gesture Recognition
learning in data-rich environment. This characteristic, coupled with the ro-
bustness of this approach, is useful to develop recognition systems.
Yang and Ahuja (1998) is an example of TDNN-based approach for hand
gesture recognition of American sign language. Rowley and Kanade (1998)
and Tian et al. (2001) are two multy-layers-perceptron-based approaches re-
spectively for facial expression analysis and face detection, used in face ges-
ture recognition.
27
Part II
Implementation
28
Chapter 3
Design and System
Architecture
3.1 Introduction
In Chapter 1 we described the motivations of this work: the definition of a
new kind of social robot, based on a novel human-robot interaction paradigm,
which reduces the human effort required, both in terms of knowledge and
skills the user has to exhibit, that is to be accessible and easy to use for
everyone, not only for system experts. To us, the best approach to achieve
this goal is to rely on a communication mean natural for everyone: gesturing.
Clearly, the key point is to find an approach as simple as possible for the ges-
ture recognition problem, in order to guarantee the simplicity we are looking
for; hence, with our solution we provide the implementation of a simple yet
robust vision-based, gesture-driven interaction, which does not require any
graspable interface, allowing a human operator to interact with the robot as
he would do with another person. To be vision-based, our architecture also
had to support the capabilities to identify the human the robot is interacting
with in a three dimensional space, following him over time, waiting for a
gesture to be performed.
Summarizing, for our ”human-friendlier” robot we provide a feasible so-
lution for two different problems:
29
Design and System Architecture 3.1 Introduction
Person-Tracking : the robot detects and tracks its target, in our case a
human, keeping it at the center of the camera’s frame while following
its movements, waiting for possible gestures to recognize;
Gesture-based Interaction: when the target performs a gesture, the robot
modifies its behaviour according to the recognized gesture.
We will describe our solutions for the Person-Tracking and the Gesture-based
Interaction problem, respectively in Chapter 4 and Chapter 5.
Start
Define Target
Model
Yes
Hand
Detection
Compute
position offset
Process
Gesture
Detect Target in
current frame
No Gesture
Recognized?
Switch
Gesture?
No Yes
Center
Target
Tracking
Enabled?
Process
Gesture
Yes
No
Figure 3.1: Complete schema of the application.
The application, whose diagram is shown in Figure 3.1, follows these
steps:
• a reference model of the target is defined;
30
Design and System Architecture 3.1 Introduction
• tracking is enabled by default, hence the target is tracked over time
actuating the Kinect;
• if a particular gesture defined switch gesture is performed, the gesture-
driven interaction subsystem is enabled, the robot is re-oriented accord-
ing to the final position of the Kinect and sensor is no longer actuated;
• the robot performs actions according to the gestures performed by the
user;
• if the switch gesture is executed again, the Kinect is re-actuated to
perform tracking, while the interaction subsystem is paused.
Further on, we present the different components our system is made of.
In Section 3.2, we discuss in detail all the devices composing the hardware,
shown in Figure 3.2, while in Section 3.3 we present the software for control-
ling our platform.
Figure 3.2: A view of the system architecture composed of Erratic, Kinectand a Pan-Tilt Unit.
31
Design and System Architecture 3.2 Hardware Components
3.2 Hardware Components
As previously mentioned, in this section we present the hardware of our
platform, while providing the reasons of such choices: the first part addresses
the robotic platform, in the second one we provide a detailed description of
the sensor for our vision-based gesture-driven interaction, while in the last
topic we present a device to actuate the sensor.
3.2.1 Erratic Robot
The Erratic, abbreviated ERA, is a differential drive mobile robotic plat-
form, named after the Latin word errare (which means to wander). The
ERA, shown in Figure 3.3, is a versatile and powerful system, capable of
withstanding a wide payload of robotics components; equipped with an on-
board PC, it supports a full range of different sensors, including sonars, laser
rangefinders, IR floor sensors, stereo-cameras and pan-tilt units. However,
the robotic platform does not represent the most important choice for our
system, that be relevant for the achievement of our goal. We have chosen the
Erratic, because it is suitable for indoor structured environments and robust
for the accomplishment of standard tasks such as social robots, patrolling,
surveillance and security, but we could have used several other robots, as
Magellan or Pioneer, that are equivalent to the one we used.
3.2.2 Kinect Sensor
The Kinect (Figure 3.4) is a commercial off-the-shelf device by Microsoft for
the Xbox 360 console, which represents a technological breakthrough that
brought the gaming experience to a completely new level (as this thesis and
other works proof, it is also useful for purposes different from the entertain-
ment). It is, alongside the known Wii-Remote and other devices, a so-called
Multi-modal Interface, which can be thought of as a multi-purpose bundle
of hardware, consisting of different sensors for data acquisition; in this case,
the Kinect features an RGB camera, a depth sensor and a multi-array mi-
crophone. The device, through the components previously mentioned, offers
32
Design and System Architecture 3.2 Hardware Components
Figure 3.3: A view of the ERA equipped with a Hokuyo URG Laser.
the players a new kind of interaction, a more natural interface based on user
motion, gestures and speech recognition.
The success of the Kinect, both in the videogames market and in the HRI
research, can be explained by two different reasons:
• Thanks to its capabilities, meaning gesture and speech recognition to-
gether with the motion capture system of multiple users, it represents
a technological milestone, presented as a consumer-level product;
• It constitutes a completely new type of user interface, which allows the
human, on the one side, to interact with the robotic system as with
another person and, on the other side, to have his hands free for other
interfaces, in the pursuit of more complex ways of interaction, requiring
the manipulation of a wide amount of different data.
33
Design and System Architecture 3.2 Hardware Components
Figure 3.4: A view of the Kinect.
RGB Camera
The RGB device installed in the Kinect consists of a traditional mono-
camera, similar to those used for web-cams and mobile phones, capable of
VGA resolution (640x480 pixels), operating at 30 frames per second.
Depth Sensor
The depth sensor is the most important device featured by the Kinect and
the main reason of its success. Based on the technology of a range camera
developed by PrimeSense, an Israel company committed to the research and
development of control systems graspable-device independent, it consists of
two different components: an infrared laser transmitter and a monochrome
CMOS receiver. Following a pattern, the former projects infrared beams to-
wards the environment (see Figure 3.5 and Figure 3.6); the latter captures
the rays when travelling back and, depending on their time of flight, calcu-
lates the depth of the 3D space, providing a high-quality reconstruction of
the scene. Furthermore, it is very important to point out that the sensor is
34
Design and System Architecture 3.2 Hardware Components
capable of computing depth data under any ambient light conditions, even
pitch black.
Figure 3.5: The infrared rays projection on the scene, recognizable by thebright dots, which identifies also the field of view of the Kinect.
Microphone Array
The microphone array consists of four microphone capsules and operates with
each channel processing 16-bit audio at a sampling rate of 16 kHz, used to
calibrate the environment through the analysis of the sound reflections on
walls and objects.
3.2.3 Pan-Tilt Unit
The pan-tilt unit (Figure 3.7) is a system used to supply motion to sensors
installed upon it, usually stereo or mono cameras. Despite its simplicity, it
consists of a small chassis with two actuators, this device is extremely useful.
To understand its importance, we provide the following example. Think of a
35
Design and System Architecture 3.2 Hardware Components
Figure 3.6: View of the projection pattern of the laser trasmitter.
mobile robot, equipped with a camera, patrolling an environment in which
the mobility of the platform is reduced (e.g. for debris or crowd); now, let
us define the task the robot has to accomplish, which is to perform data
acquisition of the surroundings. At this point, we assume the robot cannot
move: if it is provided of a pan-tilt, the sensor can be moved independently
of the motion of the platform, so the task will be accomplished; otherwise,
since the motion of the camera is dependant on the one of the robot, the
camera will not move and the task will not be completed.
Pan-tilt units, whose usefulness we hope we convinced the reader of, pro-
vide two additional degrees of freedom to the sensor installed upon it, through
the following movements:
Pan motion: rotation on the horizontal plane, also known as panning plane,
analogous to the yaw rotation of an aircraft;
Tilt motion: rotation on the vertical plane, defined tilting plane, similar to
the pitch rotation of an aircraft.
36
Design and System Architecture 3.2 Hardware Components
Figure 3.7: Pan-Tilt system equipped on our ERA.
The reason why we used a pan-tilt in our system is represented by our
necessities to decouple the motion of the sensor from the movement of the
robot. Hence, through the actuated Kinect we can perform tracking of the
human, maintaining him in the center of the sensor’s reference frame, while
the robot can roam in the scene for other purposes, for example moving in
circles around the target to mark him. Although the sensor has its own
motorized pivot we used an external pan-tilt for two distinct reasons: on the
one hand, for the impossibility to perform a movement on the panning plane,
since the pivot provides motion only on the tilting plane, and, on the other
hand, for the limitations of the framework used to communicate with the
Kinect, which does not any support pivot control.
37
Design and System Architecture 3.3 Software Components
3.3 Software Components
In this section, we present the different software components we used to
control our system: Player is a low-level framework used to control both the
robotic platform and the pan-tilt unit, OpenNI is one the best SDKs available
to communicate with the Kinect and NITE is a powerful middleware, fully
integrable in OpenNI, for the gesture recognition part.
3.3.1 Player
Player 1 is a worldwide known framework, which provides a simple interface
for the control of robotic platforms, both real and simulated (in the second
case it is used alongside Stage or Gazebo, respectively a 2D and a 3D multi-
robot simulator). Based on the Client/Server paradigm, Player ”accepts”
control software modules written in any programming language, as long as
TCP sockets are supported, and can be executed on any computer connected
to the robot that has to be controlled.
It supports a wide range of robots (e.g. Roomba, Erratic, Magellan, Pio-
neer and many others) and plenty of different sensors (e.g. sonars, lasers, in-
frared transmitters/receivers). On the server side, Player communicates with
the devices by means of predefined drivers, providing the client with simple
and reusable interfaces, called proxies. This feature guarantees complete
portability of the clients on whichever robot, equipped with any supported
sensor.
For example (see Figure 3.8 and Figure 3.9) Player’s server may run on
a Magellan robot equipped with a SICK LMS-200 laser, while the client will
simply access two proxies, one called laser and the other called position,
which refers to the mobile robot base; thanks to the portability offered by
the framework, the same client could be used for an Erratic robot equipped
with a Hokuyo URG laser, because the difference of mobile base and sensor
is handled on the server side by Player, which will provide to the client the
same interfaces named previously.
1http://playerstage.sourceforge.net/
38
Design and System Architecture 3.3 Software Components
The low-level control of a robot relies on the motherboard and its con-
troller, which reads data (e.g. through USB connection) acquired by the
sensors and sends commands to the actuators; the high-level control, pro-
vided by Player server is performed using proxies like the following:
Player
Server
Player
Client
Application
Communication
through drivers
Communication
through TCP
connection
Communication
through client
proxy
(a) Player connection to an Er-ratic robot
Communication
through drivers
Communication
through TCP
connection
Communication
through client
proxy
Player
Server
Player
Client
Application
(b) Player connection to a Mag-ellan robot
Figure 3.8: Two examples of possible connection with two different robots.It is worth noting that, client-side, the interface provided is the same.
position2d : basic service to control the motion of the robot and to read,
via dead reckoning, based on motor encoders, the position of the robot
itself;
39
Design and System Architecture 3.3 Software Components
ptz proxy : provides control for 3 hobby-type servos, for example to command
the actuators of a pan-tilt-zoom camera.
Compared to the other frameworks presented further on, chosen for their
strengths with respect to other products, Player is an obvious choice when
one wants a direct and simple interaction with a robot.
The other possible approach is the implementation of the drivers for all
Communication
through drivers
Communication
through TCP
connection
Communication
through client
proxy
Player
Server
Player
Client
Application
(a) Player interfaced with aHokuyo Urg Laser
Communication
through drivers
Communication
through TCP
connection
Communication
through client
proxy
Player
Server
Player
Client
Application
(b) Player interfaced with a SICKLaser
Figure 3.9: Two examples of connection with two different laser sensors.Either in this case the Player provides client-side the same interface for bothsensors.
40
Design and System Architecture 3.3 Software Components
the devices installed in the robot itself; clearly, this approach is extremely
time consuming, feasible only when dealing with highly critical scenarios,
where it is preferable to design ad-hoc software instead of relying on third-
party frameworks. Moreover, using Player we always have the possibility
of testing our application in different scenarios, like rescue robotics, simply
changing the robot, without worrying about modifications to the software of
our implementation.
3.3.2 OpenNI
As explained in section 2.2, both HRI and human-computer interaction
are focusing towards a novel interaction paradigm, through communication
means which have to be natural and intuitive for the humans, defining the
so-called Natural Interaction. This is the main purpose of OpenNI 2, where
NI stands for Natural Interaction, a cross-platform framework developed
by PrimeSense, which provides APIs for implementing applications, mostly
based on speech/gesture recognition and body tracking.
OpenNI enables a two-directional communication with, on the one hand:
• Video and audio sensors for perceiving the environment (have to be
compliant with the standards of the framework)
• Middlewares which, once acquired data from the aforementioned sen-
sors, return meaningful informations, for example about the motion of
a target
On the other hand, see Figure 3.10, OpenNI communicates with applica-
tions which, through OpenNI and middlewares, extract data from the sensors
and uses them for their purposes. OpenNI offers to the programmers the
portability of applications written using its libraries: a sensor used to per-
form video acquisition can be easily substituted, without the need of modify
the code.
Following the breakthrough of the Kinect, beyond OpenNI arose a broad
variety of frameworks, enabling the communication with the device, as OpenK-
2http://www.openni.org/
41
Design and System Architecture 3.3 Software Components
OpenNI
Application
Level
OpenNI
Interfaces
Sensor
Level
Middleware
Components
Application
Middleware
Component
A
Middleware
Component
B
Middleware
Component
C
Figure 3.10: Abstract view of the layers of OpenNI communication.
inect3 and Point Cloud Library4 (only to cite the most known). After a thor-
ough analysis of their strengths and weaknesses, we chose OpenNI, since it
was found out to be the most suitable framework for our application, both
in terms of usability and performance.
3.3.3 NITE
NITE Middleware is another multi-platform framework developed by Prime-
Sense, which offers different functionalities fully integrable in OpenNI (see
Figure 3.11). Consisting of several computer-vision algorithms and APIs for
gesture recognition, it is basically an engine demanded to understand how
the user interacts with the environment surrounding him.
NITE relies primarily on two control paradigms, which in turn are based
on the aforementioned computer vision and gesture recognition algorithms:
3http://openkinect.org/4http://pointclouds.org/
42
Design and System Architecture 3.3 Software Components
• Hand control : it occurs when a user interacts with his counterpart,
which can be a computer or a television, through hand gestures (e.g.
to browse media contents);
• Full body control : commonly associated with videogaming experiences,
the goal of this paradigm is the extraction of skeleton features to be
used as control inputs.
OpenNI
Application
Level
OpenNI
Interfaces
Sensor
Level
NITE
Engine
NITE
Controls Application
Figure 3.11: Layered view of NITE Middleware, focusing on its integrationwith OpenNI.
Instead of implementing a gesture-recognition algorithm, we decided to
use this framework for two reasons: on the one side, it is designed to com-
municate with the Kinect sensor, on the other side it provides an easy-to-use
and robust engine for the recognition of different gestures.
43
Design and System Architecture 3.3 Software Components
3.3.4 OpenCV
OpenCV 5, Open-source computer-vision library, is a very powerful frame-
work developed by Willow Garage, which offers several APIs mainly focused
towards real-time computer vision. It features a wide range of functions,
for many different purposes as: image transformations, machine-learning ap-
proaches for detection and recognition, tracking and features matching.
For the scope of our application, this framework has been used during
the tests of the person-tracking part of the application, to visualize the data
acquired by the Kinect and to output the results of the different algorithms
implemented.
5http://opencv.willowgarage.com/wiki/
44
Chapter 4
Person-Tracking
4.1 Introduction
One of the requirements for an effective human-robot interaction level is
the achievement of a significant degree of awareness between the entities
involved; from the machine perspective, a method to make a robot aware of
the environment is to provide it with sensors, to acquire data from the world,
and algorithms, to interpret these data in meaningful ways. In our case, on
the one hand the sensor is the Kinect device, already introduced in Chapter
3. On the other hand, a set of computer-vision based algorithms guarantees
the awareness of robot’s counterpart, the human.
In this chapter we present our tracking subsystem, shown in Figure 4.1,
through the investigation of three different approaches, analysing which tech-
nique exhibits optimal performance in terms of person tracking success rate,
according to the novelty of the hardware configuration presented in the pre-
vious chapter. In Section 4.2 we discuss our first approach, based on the
tracking of the user’s center of mass. Section 4.3 addresses a modified ver-
sion of the previous implementation, by adding a proportional controller to
command the pan-tilt actuators. Finally, in Section 4.4, we detail a com-
pletely different approach, based on blob tracking.
45
Person-Tracking 4.2 CoM Tracking
Start
Define Target
Model
Detect Target in
current frame
Compute
position offset
Center
Target
Figure 4.1: Main steps of the person-tracking subsystem.
4.2 CoM Tracking
In this first approach, we decided to rely upon OpenNI as much as possible,
for two distinct reasons: on the one hand, we wanted to fully assess the
real capabilities of the Kinect device, using the framework designed for it,
in situations quite different from the ones the sensor was intended for. On
the other hand, this approach allows to save time, on the programming side
using directly the APIs provided. The only assumption for this algorithm is
the following one:
A1 Due to physical limitations, given a Kinect and a pan-tilt system, only
46
Person-Tracking 4.2 CoM Tracking
one target can be tracked at a time (although there can be more than
one on the scene).
The com tracking algorithm (Algorithm 1, page 49), requires as initial step
the calibration of the body, in order to estimate the height of the user, the
length of his limbs and the position of the joints, having also the possibil-
ity to consider only regions of interest, like the torso, instead of the whole
body. Once the calibration is performed, using a set of functions provided
by OpenNI we can compute the projective coordinates of the center of mass,
with respect to the current frame f captured by the Kinect:
comf =
(xf
yf
)(4.1)
and then, using also the depth information acquired by the sensor, we calcu-
late the world coordinates,
COMf =
Xf
Yf
Zf
(4.2)
derived according to the following set of equations:
Xf =Zf (xf − (W/2))PS
FD(4.3)
Yf =Zf (yf − (H/2))PS
FD(4.4)
where
• Xf , Yf , Zf : 3D world coordinates of the center of mass. In particular,
Zf , is depth associated to the CoM read by the sensor;
• xf , yf : projective coordinates of the center of mass (see Figure 4.4);
• W,H,PS, FD: respectively width and height, in pixels, of the frame,
pixel size and focal distance of the sensor.
47
Person-Tracking 4.2 CoM Tracking
XZ
Y
Figure 4.2: Reference frame of the Kinect.
Once the spatial coordinates are computed, we need to compute new pan
and new tilt angles, new input commands of the motors, in order to re-orient
the Kinect according to the motion of the target. Considering the reference
frame of the Kinect, shown in Figure 4.2, and by means of basic geometry (see
Figure 4.3) the angles associated to the movements of the user are calculated
as follows:
∆Pan = atan(Xf , Zf ) (4.5)
∆Tilt = atan(Yf , Zf ) (4.6)
The final positions Panf and Tiltf , which determine the pointing bear-
ing, are obtained considering the initial positions of the pan-tilt, defined as
Panf−1 and Tiltf−1, and the angles computed in (4.5) and (4.6):
Pantf = Pant−1
f + ∆Pan (4.7)
Tilttf = Tiltt−1f−1 + ∆Tilt (4.8)
48
Person-Tracking 4.2 CoM Tracking
Algorithm 1: CoM tracking Algorithm
Input:F : current framePan: θf−1 (Pan angle at frame f − 1)Tilt : φf−1 (Tilt angle at frame f − 1)
Output:comf : projective coordinates center of mass userCOM f : spatial coordinates center of mass userPan: θf (desired pan value)Tilt : φf (desired tilt value)
/* For all the frames taken from the sensor */
1 forall the F do
/* Extract the user */
2 User f ← GetUser(F )
/* Find projective com of the user */
3 comf ← GetUserCoM(User f )
/* Convert coordinates from projective to spatial */
4 COM f ← ConvertProjectiveToRealWorld(comf )
/* Compute offset angles */
5 ∆θ ←atan(Xf , Zf )6 ∆φ←atan(Yf , Zf )
/* Compute desired values for the pan-tilt */
7 θf ← ∆θ + θf−1
8 φf ← ∆φ+ φf−1
/* Update current values of the pan-tilt */
9 θf−1 ← θf10 φf−1 ← φf
49
Person-Tracking 4.3 CoM Tracking with P Controller
Z
X
pan C
N
O
Figure 4.3: ∆Pan computation: CN represents the position offset of thetarget between previous and current frame, OC is the depth of the targetin the current frame. The angle is derived computing the arctangent of CNover OC. [∆Tilt is computed analogously, with respect to Y and Z axes]
4.3 CoM Tracking with P Controller
After several tests involving different people acting as targets, we discarded
the former approach due to an unexpected high percentage of target loss,
caused mostly by fast movements of the user, related to the nature of the
underlying level of algorithms the OpenNI functions are based on. The ra-
tionale assumption behind the design of these algorithms requires the Kinect
to be fixed on a surface (e.g. a table or a TV, where it is most likely to be
seen), or moving smoothly (e.g. 3D reconstruction of a static object). In our
case such a hypothesis is rejected by mounting the sensor on top of a pan-
tilt, controlling the motors towards the desired final position, without any
chance to slow down the execution of the displacement. Hence, the system
could not guarantee either a slow or a smooth movement, once commanded.
To solve the problem of the tracked target loss, we designed an alternative
version of the previous algorithm, called com tracking with P controller algo-
50
Person-Tracking 4.4 Blob Tracking
Figure 4.4: Result of user’s detection and computation of his center of mass,labeled by 1, using OpenNI.
rithm (Algorithm 2, page 52), adding a proportional controller, in order to
achieve the smoothness we were looking for and to reduce, possibly to zero,
the probability of target loss.
4.4 Blob Tracking
Although the idea of a controller appears as the optimal approach to solve
the target loss issue due to shaky movements, either in this case the out-
come was not so satisfying as we expected; rather than discarding the whole
approach, and its implementation, we attempted to further modify it, sub-
stituting the existing controller with a PID, proportional-integral-derivative,
then spending time to tune all the parameters of the algorithm and of the
controller.
However, these modifications did not guarantee us that high degree of
robustness we needed for our purposes, mainly due to limitations of the
framework (meaning the required static position of the Kinect). Therefore,
51
Person-Tracking 4.4 Blob Tracking
Algorithm 2: CoM tracking with P controller Algorithm
Input:F : current framePan: θf−1 (Pan angle at frame f − 1)Tilt : φf−1 (Tilt angle at frame f − 1)K p: proportional gain of the controller
Output:comf : projective coordinates center of mass userCOM f : spatial coordinates center of mass userPan: θf (desired pan value)Tilt : φf (desired tilt value)
/* For all the frames taken from the sensor */
1 forall the F do
/* Extract the user */
2 User f ← GetUser(F )
/* Find projective com of the user */
3 comf ← GetUserCoM(User f )
/* Convert coordinates from projective to spatial */
4 COM f ← ConvertProjectiveToRealWorld(comf )
/* Compute offset angles */
5 ∆θ ←atan(Xf , Zf )6 ∆φ←atan(Yf , Zf )
/* Compute desired values for the pan-tilt */
7 θf ← ∆θ + θf−1
8 φf ← ∆φ+ φf−1
9 while (|θf − θf−1| ≥ ε) do
/* Update current pan */
10 θf ← Kpθf−1
11 while (|φf − φf−1| ≥ ε) do
/* Update current tilt */
12 φf−1 ← Kpφf−1
/* Update current values of the angles */
13 θf−1 ← θf14 φf−1 ← φf
52
Person-Tracking 4.4 Blob Tracking
we discarded our initial ”conservative” OpenNI-based approach for the one
presented in this section.
This version of tracking is based on the extraction, for each frame acquired
by the sensor, of the most promising cluster of points, called blob, choosing
the one with the lowest average depth, whose centroid is then tracked. With
this algorithm we lost the capability to simply detect the users on the scene
and the precision in the estimation of the center of mass of the target, main
features of the former approaches. With respect to OpenNI APIs, our algo-
rithm is not able to:
• directly locate the users from the scene;
• distinguish between objects and people (even if OpenNI exhibits prob-
lems in some conditions as well).
To cope with this limitations, we need to introduce another assumption,
besides A1:
A2 The environment is wide enough to allow the target to be the nearest
entity to the robot, without obclusions (e.g. narrow walls).
On the one hand, this guarantees that the blob we start to track is really
related to the user, not to a desk or to a closet, so that the performance in
this simplified domain can be compared to the implementations previously
presented (still maintaining a lower precision in the extraction of the center
of mass). On the other hand, using the blob tracking algorithm (Algorithm 5,
page 59) we achieve the best performance in terms of reliability and robust-
ness with respect to the tracking problem. Finally, it is worth taking into
account how we can significantly relax Assumption A2 by adapting differ-
ent (or additional) heuristics to the one here proposed. This has not been
accomplished due to lack of time and will be considered as future work.
After this discussion and the brief comparison between the approaches
presented so far, here we propose an introductory sketch of the behaviour of
the main character of this section:
53
Person-Tracking 4.4 Blob Tracking
1. for each frame f , the algorithm looks for the pixel with minimum depth
in a region of interest (ROI ), defined around the center of the image
acquired by the Kinect;
2. background elimination of the scene is performed only in the ROI, by
segmenting the image and maintaining only the foreground points that
fall in a given distance threshold with respect to the minimum depth
computed before;
3. starting from the segmented frame obtained, the algorithm clusters the
foreground points creating different blobs (in the best scenario, only
one blob will be created);
4. the most promising blob is selected, whose centroid, represented anal-
ogously to a center of mass, as shown in (4.1) and (4.2), is computed
and then tracked.
Figure 4.5: Depth informations of the scene acquired by the Kinect.
Lowering the initial frame to the region of interest corresponds to a reduction
of the field of view of the sensor and is performed to avoid as much as possible
54
Person-Tracking 4.4 Blob Tracking
problems that may arise in case the assumption A2 does not completely hold.
The first two steps presented in the sketch are performed using the background
elimination algorithm (Algorithm 3, page 56), which takes as input the frame
captured by the Kinect, returning a cropped and segmented version of it, by
following these steps:
1. it takes the frame acquired by the sensor, shown in Figure 4.5, and
creates a new frame associated to the ROI of the original image;
2. it looks for the pixel of the new frame with the lowest depth;
3. it re-scans the segmented frame, separating the background pixels (set
to black) from the foreground ones (set to white) (see Figure 4.6).
Figure 4.6: Background elimination performed by the algorithm.
The image returned by the background elimination step is then used as in-
put for another algorithm, called blob expansion algorithm (Algorithm 4,
page 57), whose purpose is to perform clustering on all the points that lie in
the ROI, according to the following steps:
1. all the pixels are marked as unvisited : being this algorithm recursive,
this is done to avoid possible stack overflow problems that may arise
re-visiting always the same pixels;
2. it scans the segmented frame starting from the origin of the image;
3. every time it analyses an unvisited foreground pixel, it creates a blob,
setting the pixel as its centroid;
55
Person-Tracking 4.4 Blob Tracking
Algorithm 3: Background Elimination Algorithm
Input:F : current framepi,jf : k-th pixel of the current framedepthkf : depth associated to the k-th pixelminW : lower bound of the width of the ROIMaxW : upper bound of the width of the ROIminH : lower bound of the height of the ROIMaxH : upper bound of the height of the ROI
Output:SF : segmented frameI sf : set containing foreground pixels of the ROIOsf : set containing background pixels of the ROI
/* For all the frames taken from the sensor */
1 forall the F do
/* Copy ROI from the original frame */
2 SF← CopyROI(F )
/* Compute minimum depth in the ROI */
3 LeastDepthf ← GetLeastDepth(SF )
/* Scan all the pixels */
4 forall the pi,jf do
/* Check if the pixel lies into the ROI */
5 if ((minW ≤ i ≤ MaxW) ∧ (minH ≤ j ≤ MaxH)) then
/* Check the depth threshold */
6 if depthkf − LeastDepth ≤ ε then
/* Store the pixel in the foreground set */
7 I sf ← pi,jf
8 else
/* Store the pixel in the background set */
9 Osf ← pi,jf
/* Modify ROI */
10 SF f ← BuildFilteredImage(I sf , Osf )
56
Person-Tracking 4.4 Blob Tracking
Algorithm 4: Blob Expansion Algorithm
Input:SF : segmented framepksf : k-th pixel of the segmented framebisf : i-th blob createddepthksf : depth associated to the k-th pixelunsf
: n-th neighbour of the k-th pixeldepthnsf
: depth associated to the n-th neighbourI sf : set containing foreground pixels of the ROIOsf : set containing background pixels of the ROI
Output:B : set containing all the blobs clustered
/* For each segmented frame */
1 forall the SF do
/* Scan all the pixels */
2 forall the pksf do
/* Check if the pixel has been visited
/* and belongs to the foreground */
3 if ((pksf is Unvisited ) ∧ (pksf ∈Isf )) then
/* Create a new blob and set the centroid */
4 blobisf ← CreateBlob(pksf )
/* Check all the neighbours of pksf */
5 forall the unsfdo
/* Check the distance between pixels */
6 if |depthnsf − depthksf | ≤ ε then
/* Update size of the blob */
7 Grow(bisf )
/* Recursive expansion of the blob */
8 BlobExpansion(blobisf )
/* Store the blob */
9 B sf ←blobisf
57
Person-Tracking 4.4 Blob Tracking
4. starting from the centroid, the algorithm visits all its neighbours, try-
ing to recursively expand the blob as much as possible, maintaining a
rectangular/square shape;
5. when the blob cannot be further expanded, it looks for another unvis-
ited foreground pixel and, if found, it repeats the previous steps until
the whole image has been scanned;
6. it finally returns a set containing all the blobs created.
Figure 4.7: Approximation to a rectangle/square of the most promising blobreturned by the blob expansion algorithm.
After the execution of the aforementioned algorithm, we choose the blob
(if the execution returned more than one), with the lowest average depth,
computed with respect to the number of pixel belonging to that blob. Then,
since the blob is expanded approximating its shape to a rectangle/square (see
Figure 4.7) it is quite easy to geometrically derive the projective coordinates
of the centroid. At this stage, using (4.3) and (4.4) we can compute the
world coordinates, and finally, according to (4.5) and (4.6), we obtain the
commands for the pan-tilt system to re-align the Kinect in order to keep the
target in the center of its frame.
In Section 6.2 we detail the experiments, and relative results, performed to
assess the reliability of the tracking subsystem under static conditions of the
robot, only actuating the Kinect.
58
Person-Tracking 4.4 Blob Tracking
Algorithm 5: Blob Tracking Algorithm
Input:F : current framePan: θf−1 (Pan angle at frame f − 1)Tilt : φf−1 (Tilt angle at frame f − 1)
Output:centroid f : projective coordinates centroid best blob)CENTROIDf : spatial coordinates centroid best blobPan: θf (desired pan value)Tilt : φf (desired tilt value)
/* For all the frames taken from the sensor */
1 forall the F do
/* Background elimination */
2 SF← BackgroundElimination(F )
/* Blob Expansions */
3 B sf ← BlobExpansion(SF )
/* Best blob choice */
4 bestsf ← BestBlob(B)
/* Find projective centroid of the blob */
5 centroid f ← GetProjectiveCentroid(best)
/* Convert coordinates from projective to spatial */
6 CENTROIDf ← ConvertProjectiveToRealWorld(centroid f )
/* Compute offset angles */
7 ∆θ ←atan(Xf , Zf )8 ∆φ←atan(Yf , Zf )
/* Compute desired values for the pan-tilt */
9 θf ← ∆θ + θf−1
10 φf ← ∆φ+ φf−1
/* Update values of the angles */
11 θf−1 ← θf12 φf−1 ← φf
59
Chapter 5
Gesture-driven Interaction
5.1 Introduction
In Chapter 2 we presented the gesture recognition problem addressing a set
of different techniques for the implementation of a gesture classifier as well
as several application fields.
According to our motivations for a friendlier robot to interact with, which
provides simpler interfaces usable not only by system experts, in this chap-
ter we propose the implementation of our gesture-driven interaction system,
starting from the following premises:
A3 to achieve a simply usable gesture-based interaction system, we do not
make use of any graspable user interface, aiming to implement a vision-
based gesture recognizer. Of course we are aware that this requires
the camera to continuously point the user, but this is realistic for our
application;
A4 within the whole set of possible gestures, already discussed in Sec-
tion 2.4, our system is designed to recognize only hand gestures;
A5 we analyse only a small subset of all the possible hand gestures a user can
perform, mapping these gestures with actions the robot will execute.
We begin the investigation of this subsystem providing a mathematical rep-
resentation of gesture, suitable for a vision-based system:
Table 6.3: Performance analysis for the joint experiment: in Following andGesture columns we counted the number of failures for each run, meaning thefollowing of a wrong target (e.g. a wall or a desk) and the wrong recognitionof a gestures.
76
Experiments 6.4 Joint Evaluation
gesture recognition becomes highly difficult, due to the position of the Kinect,
approximately 40 centimeters from the ground, and the average position of
the hand of the target, circa 1.20 meters. Furthermore, with a too high
angular offset between the entities, also the tracking becomes less robust,
since is quite likely that the target gets out of the field of view of the sensor,
leading the robot to follow wrong targets.
77
Chapter 7
Conclusions
A challenging aim of human-robot interaction is to design desirable robotic
platforms, which can be perceived as mass consumption products, leading
to a worldwide diffusion. A recurring problem in HRI is represented by the
common interfaces employed by humans to communicate with robots, which
usually require a significant effort and skills, turning out to be utilizable
only by specialists. If this is an acceptable constraint in scenarios like rescue
robotics, which is unsuitable for inexperienced operators, due to its chal-
lenging conditions, other uncritical or less-critical scenarios require simpler
paradigms to drive interactive systems: this aspect is particularly relevant
when designing socially interactive robots.
This ease of control can be achieved defining new communication means
that reduce the human effort needed for the interaction, for example by
minimizing the complexity of user interfaces. To this end, in this work we
propose a new approach for a mobile social robot, which provides the user
a natural communication interface inspired to the interaction models hu-
mans have between themselves. This is achieved discarding any wearable or
graspable input device, rather equipping a robot with a video sensor, for a
vision-based gesture-driven interaction system. Through gestures, users can
easily interact with the robot as they would do with another human, relying
on a communication interface suitable for everyone, from the specialist to
the novice. Gesturing is an easy and expressive way people use to convey
78
Conclusions
meaningful information, hence a gesture-driven interaction system is an op-
timal choice for our purpose of designing a socially interactive robot which
may result friendlier, that has to be accessible by everyone and give users
the illusion of interacting with a peer of their.
Our implementation of a friendlier social robot exhibits good perfor-
mances with respect to the tasks presented in Chapter 2. As shown in
Chapter 6, the tracking algorithm is significantly robust within the range
of view of the Kinect, resulting a very reliable choice for indoor applications.
The experiments also confirm the achievement of a gesture-driven interaction
system usable by non-expert operators, which is one the aim of our work. It
is worth highlighting that our system allows the use of the Kinect on mo-
bile platforms, that is, one of the goals we set at the beginning of this work,
which is not achievable through the current approaches based on the different
frameworks meant for the Kinect. Although we accomplished our goals, the
approach detailed in this thesis presents some aspects which can be improved
as future works.
First, the Kinect proved to be a very reliable sensor indoor, but quite
useless for outdoor applications. A possible and interesting solution to this
problem is to install a stereo camera on the system, to make the platform
suitable also for outdoor environments, defining a switching-paradigm to al-
ternate data acquisition form Kinect and stereo camera when the robot moves
from indoor to outdoor environment, or vice versa.
Second, the reliability of the person-tracking subsystem can be increased
defining different (or additional) heuristics to the one we proposed in Sec-
tion 4.4, for example integrating adaptive techniques that modify the size of
the ROI of the frame acquired by the sensor, according to the distance of the
user.
Third, we already mentioned how issues arising from the cardinality of our
vocabulary may be solved defining complex sequences of gestures. Clearly,
this solution is not always feasible, because a sequence of gestures may be
excessively complex and exhibit unacceptable failure rates. In order to main-
tain a simple gesture-driven interaction together with a high success rate, a
good choice is to use frameworks different from NITE or to implement an
79
Conclusions
ad-hoc gesture recognition subsystem, even if this is a rather time-consuming
approach.
Fourth, an important improvement for the human-following interaction is
to implement a robust trajectory following algorithm, using PID controllers,
to cope with the motion of the target. Moreover, to make gesture recognition
more robust in such case, the best solution is to raise up the sensor and the
pan-tilt of at least one meter, to overcome problems related to too short
distances.
Finally, even if the topic is not addressed in this work, it would be interest-
ing to integrate additional human-oriented perception systems, for example
speech recognition. In this case, one could take advantage of the hardware
already installed on the robot, the array of microphones of the Kinect, to
define an even more immersive, natural and multimodal paradigm for the
interaction between humans and social robots.
80
Acknowledgements
First of all, I would like to thank my parents, Monica and Gabriele, for
raising me and making me what I am. Thank you for always being by my
side during this long, too long, journey.
A warm thanks goes to Luca Iocchi, Daniele Nardi and Giorgio Grisetti,
for giving me suggestions to go on and the chance to prove myself.
A big thanks to Gabriele Randelli, for tutoring me during this thesis,
becoming a friend, not only a mentor. Thank you for all the things you
taught me.
A special thanks to my girlfriend, Martina, and my closest buddies Gioia,
Danilo, Alessio and Riccardo. Just thank you, for everything. Words cannot
describe years spent together.
A hug an a thank you to my ”lab” friends: John, Scardax, Andrea ”En-
tropia” D., Andrea ”Penna” P., Mingo (aka ”Meravijosa”), Federica, Mara,