Dynamic Bayesian Networks for semantic …neithan.weebly.com/uploads/5/2/8/0/52807/tfdg_memory.pdfdizaje de redes Bayesianas est aticas y el clasi cador Na ve Bayes. (3) Aprendizaje

UNIVERSIDAD DE CASTILLA-LA MANCHA

ESCUELA SUPERIOR DE INGENIERIA INFORMATICA

GRADO EN INGENIERIA INFORMATICA

TRABAJO FIN DE GRADO

TECNOLOGIA ESPECIFICA DE COMPUTACION

Dynamic Bayesian Networks for semantic

localization in robotics

Fernando Rubio Perona

July 2014

UNIVERSIDAD DE CASTILLA-LA MANCHA

ESCUELA SUPERIOR DE INGENIERIA INFORMATICA

Departamento de Sistemas Informaticos

TRABAJO FIN DE GRADO

TECNOLOGIA ESPECIFICA DE COMPUTACION

Dynamic Bayesian Networks for semantic

localization in robotics

Author: Fernando Rubio Perona

Supervisors: Marıa Julia Flores Gallego

Jesus Martınez Gomez

Collaborators: Ann Nicholson

Alex Black

July 2014

Abstract

This project presents a solution based on Bayesian Artificial Intelligent for the

problem of semantic localization in Autonomous Robots. We have developed a met-

hodology that covers the following steps: (1) Image processing and discretization for

creating feature-based scene descriptors. (2) Learning of static Bayesian Networks and

Naıve Bayes classifier. (3) Learning of Dynamical Bayesian Networks. (4) Evaluation

of the models. (5) Comparison. We must pay attention to DBNs, which have proven

to be a solution to consider.

This process includes the use of different software tools, since it is not possible to

cover all these fields with only one. That implies a great effort, because we must first

know all the tools in order to solve the problem. Moreover, we have to implement our

own techniques for the task of tool integration, as well as, a discretization process for

histograms and a method of constructing DBNs.

All this process has been tested in a real case: the KTH-IDOL2 (Image Database

for rObot Localization) dataset for scene classification. Our experimental results show

that BN models obtain good accuracy values.

Resumen

Este proyecto presenta una solucion basada en la Inteligencia Artificial Bayesiana

para el problema de la localizacion semantica en Robotica Autonoma. Desarrollaremos

una metodologıa dividida en los siguientes pasos: (1) Procesamiento de imagenes y

discretizado para descriptores basados en la extracion de caracterısticas. (2) Apren-

dizaje de redes Bayesianas estaticas y el clasificador Naıve Bayes. (3) Aprendizaje de

redes Bayesianas dinamicas. (4) Evaluacion de modelos. (5) Comparacion. Deberemos

prestar atencion a las DBNs, las cuales han demostrado ser una solucion a considerar.

Este proceso incluye el uso de diferentes herramientas, ya que no es posible cu-

brir todos estos campos con solo una. Esto implica un gran esfuerzo, porque debemos

conocer todas ellas. Ademas, tenemos que implementar nuestros propios metodos pa-

ra la tarea de integrar dichas herramientas, ası como el proceso de discretizado para

histogramas y un constructor de DBNs.

Todo este proceso ha sido probado en un caso real: la base de datos KTH-IDOL2

para la clasificacion de escenarios. El resultado de nuestros experimentos muestran que

las redes Bayesianas obtienen buenas tasas de acierto.

A mis profesores, companeros, familia y en especial a Vanessa.

Agradecimientos

Agradecer a mis directores Julia y Jesus, que tanto me han ayudado en la realizacion

de este proyecto. Se que sin ellos no habrıa podido presentar un trabajo tan completo

y trabajado como este. Agradecer a ellos tambien la publicacion de mi primer artıculo.

A mi familia y companeros el apoyo que han prestado en la realizacion de este

proyecto y durante toda la carrera.

TABLE OF CONTENTS

LIST OF FIGURES IX

LIST OF TABLES XIII

1. Introduction 1

1.1. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2. Report structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Introduccion (en espanol) 5

1.1. Objetivos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2. Estructura de la memoria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2. Bayesian Artificial Intelligence 9

2.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2. The probability in our solution . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3. Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4. Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5. Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6. Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7. CaMML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3. Robot Vision and Localization 27

3.1. Image encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2. Local Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3. Global Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4. PHOG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5. Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4. Experimentation 37

4.1. Dataset: Image CLEF 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2. Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3. First step: Descriptor Generation . . . . . . . . . . . . . . . . . . . . . . . 41

vii

viii TABLE OF CONTENTS

4.4. Second step: Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5. Third step: Network Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 49

5. Results 51

5.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2. Variable reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3. Dynamic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6. Conclusions and further work 63

6.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2. Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Conclusiones y trabajo futuro (en espanol) 67

6.1. Conclusiones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.2. Trabajo Futuro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

REFERENCES 74

LIST OF FIGURES

2.1. Bayes Network with 3 variables and the probabilities associated to each

node in the form P (X ∣pa(X)). . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2. Relationships between variables in a Bayesian Network. . . . . . . . . . . 17

2.3. Bayesian Network example with 5 variables. . . . . . . . . . . . . . . . . . 18

2.4. General structure of a Dynamic Bayesian Network. . . . . . . . . . . . . 20

2.5. General structure for a Naıve Bayes Classifier. . . . . . . . . . . . . . . . 22

2.6. Example of two DAGs within the same SEC: (a) chain; (b) common cause 25

3.1. Passive sensors: Digital Temperature and Humidity sensor (left), Sound

Sensor Microphone (center), Canon VC-C4 Camera (right) . . . . . . . . 28

3.2. Active sensors: Ultrasonic sensor (left), Laser range finder (right) . . . . 28

3.3. Most commonly used resolutions. . . . . . . . . . . . . . . . . . . . . . . . 29

3.4. Example of different models for colour display. . . . . . . . . . . . . . . . 30

3.5. Visualization of the SIFT descriptor computation. For each (orientation-

normalized) scale invariant region, image gradients are sampled in a

regular grid and are then entered into a larger grid of local gradient

orientation histograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6. Process for visual word generation from a set of images. . . . . . . . . . 32

3.7. Histograms of frecuency: number of evidences (left) and percentage (right) 33

3.8. Example of a spatial pyramid with depth level 2 . . . . . . . . . . . . . . 33

3.9. Example of level depth in PHOG. . . . . . . . . . . . . . . . . . . . . . . . 34

4.1. Schema for the entire process, separated by steps. . . . . . . . . . . . . . 38

4.2. Robot platforms: Dumbo (left) and Minnie (right). . . . . . . . . . . . . 38

4.3. Semantic labels used in the IDOL2 dataset. . . . . . . . . . . . . . . . . . 39

4.4. Map of the IDOL2 environment. . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5. First step schema: data extraction. . . . . . . . . . . . . . . . . . . . . . . 42

4.6. Image taken by the robot (left). HOG descriptor (center). Histogram

with 360 variables from HOG (right). . . . . . . . . . . . . . . . . . . . . . 42

4.7. HOG histogram with 360 variables (left). HOG histogram with variable

reduction n = 30 (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

ix

x LIST OF FIGURES

4.8. Original Histogram (top right) extracted from sample image (top right).

Bottom: four histograms obtained with differents n values. The colours

in the bottom histograms represents the generated class labels or catego-

ries: low=red, medium/low=orange, medium/high=yellow and high=blue. 44

4.9. Example of arff file (left) and csv (right) . . . . . . . . . . . . . . . . . . 44

4.10. Example of a csv test file for DBNs . . . . . . . . . . . . . . . . . . . . . . 45

4.11. Second step schema: learning. . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.12. Example of two static networks with the same structure disconnected

between them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.13. Example of a Dynamic Bayesian Network with related classes but inde-

pendent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.14. Example of final Dynamic Bayesian Network obtained with our DBN

constructor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.15. Schema for Dynamic Naıve Bayes (DNB) classifier . . . . . . . . . . . . . 49

4.16. Third step schema: network evaluation. . . . . . . . . . . . . . . . . . . . 50

5.1. Sequence distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2. Example of BNs with 10 variables and the Class: CaMML (left) and

Naıve Bayes (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3. Graph of learning time for BNs (s) for the Cloudy case with different

number of variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.4. Rates of test under the same illumination conditions as the training set:

CaMML (left), Naıve (right). . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.5. Rates of test Night and Sunny under Cloudy illumination condition:

CaMML (left), Naıve (right). . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.6. Rates of test Cloudy and Sunny under Night illumination condition:


5.7. Rates of test Cloudy and Night under Sunny illumination condition:


5.8. Sequence distribution for “All” sequence. . . . . . . . . . . . . . . . . . . 57

5.9. Rate comparative for CaMML and Naıve Bayes training and test with

“All” sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.10. Dynamic Network with 5 variables created with CaMML. . . . . . . . . 59

5.11. Example of a DBN created with DBN constructor on a CaMML model

with 5 variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.12. Rate comparative for DBNs training and test with “All” sequence with

different class transition: ModelA (left), ModelB (right). . . . . . . . . . 60

5.13. Case All classification rates evolution when using dynamic and static

classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

LIST OF FIGURES xi

6.1. Grid example with size 4x4 in the IDOL2 environment. . . . . . . . . . . 64

LIST OF TABLES

2.1. Example table of probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2. Room Probability P (Room) . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3. Probability of variable Mirror P (Mirror) (left). Probability of variable

Fridge P (Fridge) (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4. Joint probability for Mirror and Fridge variables P (Mirror,Fridge) . . 12

2.5. Joint probability for Mirror and Room variables P (Room,Mirror) . . 12

2.6. Cond. Prob. of the rooms known the value for mirror: P (Room∣Mirror) 13

2.7. Cond. Prob. of the rooms known the value for lamp: P (Room∣Lamp) . 14

2.8. Cond. Prob. of the rooms known the value for Desk: P (Room∣Desk) . . 14

2.9. Cond. Prob. of the rooms known Desk and TV: P (Room∣Desk,TV ) . . 14

2.10. Conditional Probability of see a desk known the value for the room

P (Desk∣Room) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1. Example of distribution labels in a set of images. . . . . . . . . . . . . . 32

4.1. CPT for C1 before applying the link with C0 (left) and after (right). . . 47

4.2. Class transition table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3. CPT for C1 combined with class transition table (left) and normalized

(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1. Learning time for BNs (s) for the Cloudy case with different number of

variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2. Accuracy ( %) of CaMML model and Naıve Bayes classifier training and

test with Cloudy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


test with Night. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


test with Sunny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.5. Accuracy of CaMML model and Naıve Bayes classifier training with

Cloudy and test with Night and Sunny. . . . . . . . . . . . . . . . . . . . 55

xiii

xiv LIST OF TABLES


Night and test with Cloudy and Sunny. . . . . . . . . . . . . . . . . . . . 55


Sunny and test with Cloudy and Night. . . . . . . . . . . . . . . . . . . . 55

5.8. Accuracy of CaMML model and Naıve Bayes classifier training and test

with “All” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.9. Class transitions used for the dynamic models. . . . . . . . . . . . . . . . 58

5.10. Accuracy of DBNs training and test with “All” sequence. . . . . . . . . 60

Chapter 1

Introduction

Localization is one of the main problems in autonomous robots. The information

about its localization can be obtained with tools like a compass or a GPS device,

but this information is hard to be used by people. Some data like coordinates are

complicated to understand so we search semantic solutions. These semantic solutions

mean that the robot can identify the surrounding environment. It is able to know if

its localization is outside or inside or identify if it is in a park or in a bedroom, for

example.

In this project I provide the basis of a possible solution for this semantic localization

problem based on Bayesian Artificial Intelligence (Bayesian A.I.) [Korb & Nicholson,

2010]. This solution consists in obtaining a learned network based on the Bayesian con-

cepts. To make it we need to create variables from data and discover the relationships

between them. This project is focusing on two main themes, the first one is to know

the behaviour of variables and their discretization in the semantic localization problem.

The second one is to study the possibilities of a new dynamic approach in Bayesian

Networks [Russell & Norvig, 2003].

The robots are able to obtain many types of information from the environment with

different sensors. These sensors are called Exteroceptive. We will focus on a camera

which is a Passive Exteroceptive sensor. A Passive Exteroceptive sensor can measure

environmental energy. The camera obtains images of the environment and these are

the data we will use for creating the models, in our case Bayesian Networks (BNs)

[Jensen & Nielsen, 2007; Pearl, 1988]. But this visual information has to be processed

to obtain variables for the networks. There are many techniques to obtain information

from images, some more complex than others but the project does not have the aim of

exploring this field, so we explore one descriptor: the Pyramid Histogram of Oriented

Gradients or PHOG. We will only use PHOG for results because the main theme is the

behaviour of the variables in the same conditions and not the characteristics obtained

with different techniques.

1

2 CHAPTER 1. INTRODUCTION

The solutions obtained can not be understood without an explanation of some

important concepts of Bayesian Artificial Intelligence. In this project an extended ap-

proach is shown, Dynamic Bayesian Networks (DBNs). We also explain in detail the

basics of the algorithms used in the experimentation. First of them is Naıve Bayes, one

of the most famous and used classifiers based on Bayes’ Theorem. The second one is

CaMML, which attempts to learn the best causal network, using an MML metric and

an MCMC search [Korb & Nicholson, 2010; Wallace, 2005].

Once we have described all the previous concepts we can explain the programs used

in our development process and how to obtain the information that we need with them.

One of the main characteristics of this project is the use and integration of different

tools, each of which is responsible for various tasks. We define the dataset that we will

use and the results that we have obtained.

The last step consists in studying the results obtained along this project. We use

different techniques for graphical representation of data, such as charts and graphs.

These display methods allow us to have a more structured look of information.

1.1. Objectives

As far as we know, there is not a way to obtain a Bayesian Network from a set of

images using a simple procedure. So our first aim is to develop a process that allows us

generate a dataset from sets of images. Then, we will have to learn Bayesian Networks

from this dataset information and, finally, we will show and analyse the results obtained

by our experiments.

In order to fulfil this development process, it will be necessary to divide it into steps

and to store the intermediate results, because a set of images can generate different

datasets and this, at the same time, can create different networks. It is important to

store these intermediate results to avoid repeating the same operations and to prevent

a waste of time, that is, we will have to devise a consistent and efficient methodology,

which involves different stages and software platforms that we will interrelate.

Then, this development needs different tools, since it is not possible to perform the

entire process with only one due to the distinct nature of the domains here studied,

mainly robotics (image processing), supervised classification and Bayesian Networks

learning and inference. So, in order to perform this work, an initial objective is to

know how to use these tools and what their inputs and outputs are. As a result of this

study, we will have to design the format for those files storing the intermediate results,

so that they match, otherwise the process will be useless.

In each part of the development we need to create methods that automatize that

specific part of the process. This allows us launch many experiments at the same time.

1.2. REPORT STRUCTURE 3

This is essential in this project, since we need to compare many different cases in order

to understand correctly the results.

Once the development process is over, we have two more objectives related with the

obtained results. The first of them is studying the behaviour of the number of variables

in the Bayesian Networks and how that affects to our problem. This aim consists in

create networks from the datasets with different number of variables, prove them with

test and validate data and finally compare the results obtained and draw conclusions

from this information.

The second one is the study of DBNs for our problem. This consists in generating

different dynamic networks and comparing them with the static ones. This problem has

a strong temporal component, so this dynamic approach should get better results. The

process is the same as in the previous target, we need to generate different networks,

test them and compare results.

As an extra objective and a direct result of this project, we want to remark that

most of the work here presented, has been submitted, accepted and presented – in a

summarized version – in a conference paper, whose bibliographical reference is [Rubio

et al., 2014]1.

1.2. Report structure

This project is composed of six chapters organized as follows. The first two chapters

deal with the state of the art. This is divided into two parts because these are two very

different issues. The first of them gives an overall introduction to Bayesian Artificial

Intelligence, including the use of static and dynamic models. We also talk about the

probabilistic classifiers, more specifically Naıve Bayes and CAMML learning procedure

are introduced.

The second part of the state of the art is the Robot Vision. This chapter sum-

marizes some basic concepts of the artificial vision and image processing. We see the

representation of images, the extraction of local invariant features, the global descrip-

tors and the PHOG descriptor. Finally we explain the localization from two points of

view: topological and semantic.

Chapter 4 explains in detail all the development process and defines the first aim of

the project. The first section defines the dataset and the second one describes the tools

used. The remaining sections of this chapter represent a step in the experimentation

process, where is specified the input and output data and the corresponding tool used

on it. We include a scheme with the different functionalities in each step in order to

have a global vision of the entire process.

1This paper can be downloaded from http://waf2014.redaf.es/media/189.pdf

http://waf2014.redaf.es/media/189.pdf


Chapter 5 is totally devoted to the results. In this chapter we try to accomplish the

two objectives related with the information extraction. We visualize different graphs

and charts in order to obtain information about the behaviour of the number of varia-

bles and the Dynamic Bayesian Networks.

The last chapter corresponds to the conclusions and the future work we propose.

Introduccion

En la robotica autonoma la localizacion es uno de los principales problemas. La

localizacion topologica puede ser obtenida con herramientas como una brujula o un

dispositivo GPS, pero para mucha gente esta informacion puede resultar difıcil de

manejar. Algunos datos como las coordenadas son difıciles de entender, por lo que se

buscan soluciones semanticas que permiten al robot identificar el entorno que le rodea.

Un ejemplo de esto puede ser reconocer si su localizacion es dentro de un edificio o en

el exterior, o ser capaz de identificar si esta en un parque o en una habitacion.

En este proyecto explicamos las bases de una posible solucion para el problema

de la localizacion semantica, basada en la Inteligencia Artificial Bayesiana [Korb &

Nicholson, 2010]. Esta solucion consiste en la obtencion de una red basada en los

conceptos Bayesianos. Para ello necesitamos crear variables a partir de los datos y

descubrir las relaciones entre ellas. Este proyecto se centrara en dos temas principales: el

primero es conocer el comportamiento de las variables y su discretizacion en el problema

de la localizacion semantica; el segundo tema trata de estudiar las posibilidades de un

enfoque dinamico en las Redes Bayesianas [Russell & Norvig, 2003].

Los robots son capaces de obtener muchos tipos de informacion acerca del entorno

con diferentes sensores, los cuales se denominan Exteroceptivos. Nosotros nos centra-

remos en una camara, un sensor Exteroceptivo Pasivo capaz de medir la energıa del

entorno. La camara obtiene imagenes del espacio que le rodea y esos datos son los que

usaremos para crear modelos, que en nuestro caso seran Redes Bayesianas [Jensen &

Nielsen, 2007; Pearl, 1988]. Esta informacion visual tiene que ser procesada para poder

obtener las variables necesarias para las redes. Existen muchas tecnicas para obtener

la informacion de las imagenes, algunas mas complejas que otras, pero el objetivo de

este proyecto no es explorar este campo, ası que solo expondremos un descriptor: la

Piramide de Histogramas de Gradientes Orientados o PHOG. Con este unico descriptor

obtendremos los resultados, ya que el tema principal no es el analisis de las caracterısti-

cas obtenidas con diferentes tecnicas, sino analizar el comportamiento de las variables

en las mismas condiciones.

Las soluciones obtenidas no pueden comprenderse sin una explicacion de concep-

tos importantes sobre la Inteligencia Artificial Bayesiana. Ademas en nuestro proyecto

tambien mostraremos una extension a estos conceptos, las Redes Dinamicas Bayesianas

5


(DBNs). Explicaremos en detalle las bases de los modelos usados en la experimenta-

cion. El primero de ellos es Naıve Bayes o Bayes Ingenuo, uno de los clasificadores

mas utilizados basados en el Teorema de Bayes. El segundo es CaMML, que intenta

aprender la mejor red causal, usando una metrica MML y una busqueda MCMC [Korb

& Nicholson, 2010; Wallace, 2005].

Una vez que conocemos todos los conceptos previos, definiremos los programas em-

pleados durante todo el proceso de desarrollo realizado y como obtener la informacion

que necesitamos. Una de las caracterısticas principales que definen a este proyecto es

el uso e integracion de diferente Software, donde cada una de estas herramientas es

responsable de varias tareas. Tambien definiremos el conjunto de datos que vamos a

usar y el tipo de resultados que obtendremos.

El ultimo paso consiste en estudiar los resultdos obtenidos a lo largo de este pro-

yecto. Usaremos diferentes tecnicas para la representacion grafica de los datos, como

tablas y graficas. Estos metodos de visualizacion nos permiten realizar un resumen mas

detallado de la informacion.

1.1. Objetivos

Hasta donde sabemos, no hay ninguna forma de obtener una red Bayesiana de un

conjunto de imagenes usando un procedimiento simple. Ası, nuestro primer objetivo

es desarrollar un proceso que nos permita generar un conjunto de datos a partir de

conjuntos de imagenes. Despues, tendremos que aprender redes Bayesianas de la in-

formacion de este conjunto de datos, y finalmente, mostraremos y analizaremos los

resultados obtenidos a traves de nuestros experimentos.

Para realizar este proceso de desarrollo necesitaremos dividirlo en pasos y guardar

los resultados intermedios, porque un conjunto de imagenes puede generar diferentes

conjuntos de datos y estos, al mismo tiempo, pueden aprender diferentes redes. Es

importante guardar estos resultados intermedios para evitar repetir las mismas opera-

ciones y prevenir ası el gasto de tiempo innecesario. Esto quiere decir que tendremos

que crear una metodologıa consistente y eficiente que abarque las diferentes etapas del

proceso y la interrelacion entre las plataformas software.

Ası, este desarrollo necesita diferentes herramientas; puesto que no es posible rea-

lizar el proceso entero solo con una debido a la distinta naturaleza de los dominios

aquı estudiados; principalmente la robotica (procesamiento de imagenes), clasificacion

supervisada y el aprendizaje e inferencia de las Redes Bayesianas. Para ello, a fin de

realizar este trabajo, un objetivo inicial serıa saber usar estas herramientas, y cuales

son sus entradas y salidas de informacion. Como resultado de este estudio, tendremos

que disenar el formato para estos archivos, almacenando los resultados intermedios

1.2. ESTRUCTURA DE LA MEMORIA 7

para que coincidan; de otra forma el proceso serıa inutil.

En cada parte del desarrollo necesitamos crear metodos que automaticen cada una

de las partes especıficas del proceso. Esto nos permite ejecutar varios experimentos al

mismo tiempo. Esto es esencial en este proyecto, ya que necesitamos comparar diferen-

tes casos para poder entender correctamente los resultados.

Una vez finalizado el proceso de desarrollo, tenemos dos objetivos mas relacionados

con los resultados que obtendremos. El primero de ellos es el estudio del comportamien-

to del numero de variables en las redes Bayesianas y como afecta a nuestro problema.

Este objetivo consiste en aprender redes a partir de los conjuntos de datos con dife-

rentes numeros de variables, probarlas con datos de test y validacion y, finalmente,

comparar los resultados obtenidos y extraer conclusiones sobre esta informacion.

El segundo objetivo relacionado con los resultados es el estudio de las DBNs para

nuestro problema. Esto consiste en generar diferentes redes dinamicas y compararlas

con las estaticas. Este problema tiene una fuerte componente temporal, ası que esta

aproximacion dinamica deberıa obtener mejores resultados. El proceso es el mismo que

en el objetivo anterior, necesitamos generar diferentes redes, probarlas y comparar los

resultados.

Como un objetivo extra y un resultado directo de este proyecto, queremos remarcar

que la mayorıa del trabajo aquı presentado ha sido enviado, aceptado y presentado -en

una version resumida- en un artıculo de conferencia, cuya referencia bibliografica es

[Rubio et al., 2014]2.

1.2. Estructura de la memoria

Este proyecto se compone de seis capıtulos organizados de la siguiente forma. Los

dos primeros capıtulos tratan sobre el estado del arte. Esta parte esta dividida en

dos porque se habla de dos temas muy diferentes. El primero de ellos nos aporta una

introduccion general a la Inteligencia Artificial Bayesiana, incluyendo el uso de modelos

estaticos y dinamicos. Tambien hablamos sobre los clasificadores probabilısticos y, mas

especıficamente, introduciremos los procedimientos de aprendizaje de Naıve Bayes y

CaMML.

La segunda parte del estado del arte es la Vision Artificial. Este capıtulo resume

algunos conceptos basicos sobre la vision artificial y el procesamiento de imagenes.

Veremos la representacion de imagenes, la extraccion de caracterısticas invariantes

locales, el proceso de descripcion global y el descriptor PHOG antes mencionado. Por

ultimo, explicaremos brevemente la localizacion desde dos puntos de vista: el topologico

y el semantico.

2Este artıculo puede ser descargado de http://waf2014.redaf.es/media/189.pdf

http://waf2014.redaf.es/media/189.pdf


El capıtulo 4 explica en detalle todo el proceso de desarrollo y define el primer ob-

jetivo del proyecto. La primera seccion define el conjunto de datos y la segunda seccion

se encarga de las herramientas utilizadas. Las secciones restantes de este capıtulo re-

presentan cada uno de los pasos del proceso de experimentacion, en el cual se especifica

la entrada y salida de informacion y la herramienta utilizada. Incluimos un esquema

con la diferente funcionalidad en cada paso, para poder tener una vision global del

proceso entero.

El capıtulo 5 esta dedicado totalmente a los resultados. En este capıtulo intenta-

mos alcanzar los dos objetivos relacionados con la extraccion de la informacion. Vi-

sualizaremos las diferentes graficas y tablas para poder obtener informacion acerca del

comportamiento del numero de variables y las DBNs.

El ultimo capıtulo corresponde a las conclusiones y el trabajo futuro que nosotros

proponemos.

Chapter 2

Bayesian Artificial Intelligence

2.1. Notation

Random variables are represented by upper case roman letters: X, Y , etc. A set of

variables is written with the same letters, but in bold: XXX, YYY , etc. And the variables of

these sets are represented by the same letter, as random variables but with a subscript.

XXX = X1,X2, . . . ,Xn (2.1)

The set of values that a variable can take is written by the greek character Ω and

its arity by ∣Ω∣, it usually has a subscript indicates the variable that refers: ΩX

The values of a variable is represented by lower case letters: a, b, etc. And these

values below to a variable X if a ∈ ΩX

If the variable is binary we can write one of the values with the same character as

the variable in lower case and the opposite is represented with the same letter but with

a horizontal line above: ΩX = x,xThe probabilities are represented by the upper chase character P . P (X) represents

all the probabilities of each value that the variable X can take. For example we have

a ∈ ΩX , the probability of this specific value is represented as P (X = a) or P (a).The most common way to represent probability is a table, where each value for the

variable has a probability. For example we have a variable X with ΩX = a1, a2, . . . , am,

we can see in Table 2.1 how the probabilities are represented.

X Probability

a1 P (X = a1)a2 P (X = a2)⋮ ⋮am P (X = am)

Table 2.1: Example table of probabilities

9

10 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE

We represent the intersection between variables with a comma(,) or a ∩. This

represents the joint probability as we will see in the correspondent section.

The conditional relationship is represented by the vertical line( ∣ ). The term in the

left is the hypothesis we want to know and the right term is the events we already

know. A example for this notation is P (X ∣Y )The symbol á represents the independence between variables. This also is repre-

sented by the upper case I but the expressions are different depending on the character

used. We will see these differences later.

2.2. The probability in our solution

The solutions explained in this project and most concepts and techniques of Ba-

yesian A.I. are based on probabilistic inference. First of all, what is probabilistic infe-

rence? It is the ability to obtain some evidences with a certain probability degree by

the observation of other evidences in a set of variables. This means that if we see some

characteristic in an image, we can obtain a probability for the place the robot is. For

example, if we are in a house and we see a fridge, we can say with a high probability

the robot is in the kitchen, but it is possible to stay in other room. If we see a mirror,

the robot can stay in a bathroom or in a bedroom with the same probability or maybe

in other room with less probability.

To calculate the probability distribution for each variable we need to capture the

relationships between variables. As we will see later, for this purpose we will use models

able to represent those dependences and independences, which are Bayesian Networks

[Jensen & Nielsen, 2007; Korb & Nicholson, 2010; Pearl, 1988]. The construction of

these networks will normally need two phases: structure learning and parametric lear-

ning, since this second phase depends on the first one. The construction of a model

can be obtained from an expert using knowledge engineering, this process is difficult

and slow, and needs specific techniques for knowledge elicitation. Besides, it involves

the availability of an expert in the domain to be modelled. Because of this difficulties

and thanks to the increasing popularity and development of Machine Learning tech-

niques, automatic learning of Bayesian Networks from data is also possible, and it is

very usually done [Cooper & Herskovits, 1992; Heckerman et al., 1995; Neapolitan,

2003]. We will mainly focus on this second approach. Then, we will usually calculate

the probability distribution from previous experiences for example, these experienced

data is called training data.

In the following subsections, we will then introduce basic concepts about probability,

which are necessary to understand how Bayesian Networks work.

2.2. THE PROBABILITY IN OUR SOLUTION 11

2.2.1. Marginal Probability

This distribution is the probability that each value or state of the variable has to

appear or happen and it is calculated differently depending on the type of variable

(discrete or continuous).

Discrete variables have a probability for each value and we calculate it counting from

data how many times the value appears and divided it for the number of training data

cases. As the variables we work with are random these values can change depending

on the train data. So normally we work with frequencies.

When the variables are continuous, we calculate the probabilities with an estimation

approach like the Gaussian distribution.

An example of discrete variable could be to determinate where the robot is. We have

a training data of images and a label with five types of room, so the variable Room has

five values: kt(kitchen), be(bedroom), ba(bathroom), lr(living-room) and cr(corridor).

Now we obtain the probabilities counting how many times the room is in the labels

and divide it by the total of images. We can see in Table 2.2 that the robot has more

possibilities to stay in a big room like the living-room.

Room Probability

kt 0.20

be 0.20

ba 0.15

lr 0.30

cr 0.15

Table 2.2: Room Probability P (Room)

Now we see all the images and count how many times we see a mirror and a fridge.

The variable values are two: if we see the object or not.

But the margin probabilities are not enough to obtain results. As we can see in

Table 2.3 if there is a mirror in the image we don’t know where the robot is and the

same for the fridge. But we know if we see the fridge the robot have a high probability

of be in the kitchen, so we have to relate the room variable with the fridge probability.

Mirror Probability

m 0.70

m 0.30

Fridge Probability

f 0.90

f 0.10

Table 2.3: Probability of variable Mirror P (Mirror) (left). Probability of variable

Fridge P (Fridge) (right)


2.2.2. Joint Probability

The union of variables is calculated by the joint probability distribution. It consists

in the probability to observe several evidences at the same time. It is a probability dis-

tribution between the observed variables and it has as many values as each combination

of the possibles values of each variable.

In the previous example we have a mirror and a fridge, these are our variables and

each one has the value to stay or not. With this, the table of joint probability will be

like Table 2.4. We obtain this probability by counting how many times the objects are

in the same image, when only there is one of them and when there is none.

m m

f 0.61 0.29

f 0.09 0.01

Table 2.4: Joint probability for Mirror and Fridge variables P (Mirror,Fridge)

In this case there is only two variables, but the size of the table with N variables

grows exponentially with them, as we can see in equation 2.2. That means if we have

50 binary variables the size of the table will be 250 ≃ 1015 and this is impossible to

calculate and store which will be alleviated by factorisation in BNs. And now we can

relate the room and the objects information as we see in Table 2.5.

size of the table =N

∏k=1

∣Ωk∣ (2.2)

kt be ba lr cr

m 0.20 0.15 0.05 0.20 0.10

m 0 0.05 0.10 0.10 0.05

Table 2.5: Joint probability for Mirror and Room variables P (Room,Mirror)

2.2.3. Conditional Probability

We can measure the probability distribution of a variable when we know the values

of other variables by the conditional probability. For example, what is the probability

to stay in the bedroom when we see a mirror? The conditional probability is calculated

by the equation 2.3.

P (X ∣Y ) = P (X,Y )P (Y ) (2.3)

We try to answer the question before with the conditional probability to stay in a

room when we see or not an object, in this case a mirror. We observe in Table 2.6 this

2.2. THE PROBABILITY IN OUR SOLUTION 13

probabilities, these are obtained with the tables above. If we see a mirror, it is more

likely to be in the bathroom or in the living-room than in other rooms.

kt be ba lr cr

m 0.29 0.21 0.07 0.29 0.14

m 0 0.17 0.33 0.33 0.17

Table 2.6: Cond. Prob. of the rooms known the value for mirror: P (Room∣Mirror)

2.2.4. Chain Rule

This rule permits to calculate any joint distribution probability through conditio-

nal probabilities. If we have a set of variables XXX = X1,X2, . . . ,Xn we can link the

conditional probabilities with the joint by the equation 2.3 that we have seen before

and, then, we obtain this:

P (Xn, . . . ,X1) = P (Xn∣Xn−1, . . . ,X1) ∗ P (Xn−1, . . . ,X1) (2.4)

Now we repeat the process with Xn−1:

P (Xn−1, . . . ,X1) = P (Xn−1∣Xn−2, . . . ,X1) ∗ P (Xn−2, . . . ,X1) (2.5)

And repeat this until we have P (X1). Finally we join all this operations and obtain

the product:

P (Xn, . . . ,X1) = P (X1) ∗n

∏k=2

P (Xk∣Xk−1, . . . ,X1) (2.6)

In a case with three variables like the Room, Fridge and Mirror, to calculate the

joint probability of this three variables we can use the Chain Rule as we can see in the

equation 2.7. Notice that this way of ordering the variables to produce this chain is

not arbitrary, it will depend on the structure of dependences between variables, but,

as we will show in subsection 2.4.1, the fact that the graph underlying the Bayesian

Network is acyclic guarantees that a possible ordering can be found.

P (Room,Fridge,Mirror) = P (Room∣Fridge,Mirror)∗P (Fridge∣Mirror)∗P (Mirror)(2.7)

2.2.5. Independence

When we work with conditional probabilities a basic concept is the independence.

A variable X is independent of other variable Y when the knowledge of Y do not affect


to the probability of X(2.8).

XáY ≡ P (X ∣Y ) = P (X) (2.8)

This is known like marginal independence an also can be represented with I(X ∣0∣Y )An example of independence in our case will be to see a lamp with the probabilities of

Table 2.7. The lamp does not give us information about the robot is, the probabilities

are the same if we see a lamp or not.

kt be ba lr cr

l 0.20 0.20 0.15 0.30 0.15

l 0.20 0.20 0.15 0.30 0.15

Table 2.7: Cond. Prob. of the rooms known the value for lamp: P (Room∣Lamp)

It is not easy to find marginal independences between variables, so we need other

kind of independence. The conditional independence occurs when we observe a event

Y and this do not affect to the conditional probability of P (X ∣Z)

XáY ∣Z ≡ P (X ∣Y,Z) = P (X ∣Z) (2.9)

Like before the conditional independence can be represented like I(X ∣Z ∣Y ). In this

example we see a desk and a TV, and we have Tables 2.8 and 2.9. We can see how

knowing the value of TV does not affect the conditional probability of Room known

Desk.

kt be ba lr cr

d 0.28 0.07 0.25 0.21 0.19

d 0.09 0.37 0.02 0.42 0.10

Table 2.8: Cond. Prob. of the rooms known the value for Desk: P (Room∣Desk)

kt be ba lr cr

d,tv 0.28 0.07 0.25 0.21 0.19

d,tv 0.28 0.07 0.25 0.21 0.19

d,tv 0.09 0.37 0.02 0.42 0.10

d,tv 0.09 0.37 0.02 0.42 0.10

Table 2.9: Cond. Prob. of the rooms known Desk and TV: P (Room∣Desk,TV )

2.3. BAYES’ THEOREM 15

2.3. Bayes’ Theorem

Bayes’ Theorem is one of the most important formula in probability theory. This

theorem formulated by Reverend Thomas Bayes is the result of the mathematical

manipulation of conditional probabilities. If we have the P (X ∣Y ) and P (Y ∣X) we can

express it as the same joint probability as we see in Equations 2.10 and 2.11.

P (X ∣Y ) = P (Y,X)P (Y ) (2.10)

P (X ∣Y ) = P (Y,X)P (X) (2.11)

Then we equate both equation and we clear one of the conditional probabilities

obtaining the Bayes’ Theorem of Equation 2.12.

P (X ∣Y ) = P (Y ∣X)P (X)P (Y ) (2.12)

It asserts that the probability of a hypothesis X conditioned by Y is equal to its

likelihood P (Y ∣X) multiplies by the probability P (X), then it is normalized dividing

by P (Y ) to obtain a conditional probability that sums 1.

We will see the importance of Bayes’ Theorem with the next example. We suppose

that we haven’t accessed the train data and the information we have is given by an

expert. He says us the probabilities of seeing different objects known the room the

robot is and the probabilities the robot has to stay in each of these rooms 2.2. An

example of this is the desk information represented in Table 2.10.

d d

kt 0.80 0.20

be 0.20 0.80

ba 0.95 0.05

lr 0.40 0.60

cr 0.70 0.30

Table 2.10: Conditional Probability of see a desk known the value for the room

P (Desk∣Room)

This probabilities are more easy to obtain than the probabilities of Table 2.8. That

is why the Bayes’ Theorem is too important. It allows us to link the probabilities to

see an object in a room with the probabilities to be in a room when we see an object.

Other example of the importance of the Bayes’ Theorem is its use in the field

of medicine. It links the probability of the symptoms known the disease with the

probability to have a certain disease when we know the symptoms.


2.4. Bayesian Networks

A Bayesian Network is a Directed Acyclic Graph (DAG) where each node has

associated a conditional probability distribution.

The nodes are random variables and the directed arcs represent a direct dependency

between them. If we have the link X → Y , it means that the variable X is a parent of

Y . The CPTs include the conditional probability for the variable in the node known

its parents P (X ∣pa(X)). In Figure 2.1 we can see an example of a Bayesian Network

with the probabilities related to each node.

Figure 2.1: Bayes Network with 3 variables and the probabilities associated to each

node in the form P (X ∣pa(X)).

There are many methods to obtain these relationships and probability tables. The

main form to get them is to analyse a set of training data with different techniques. We

can also add links previously established by an expert to this process, or we can use

a hybrid approach where the learning algorithms accept expert information as priors

which can be input into the algorithm. We talk more about this methods in CaMML:

Learning Bayesian Networks section.

2.4.1. The Markov Property

In Figure 2.2 we see the different relationships between the variable A with the rest

in the graph representing a Bayesian Network. Variables in the green are, labelled as C,

are its parents and are represented by pa(A). Those in the blue zone (labelled as B) are

the non-descendants of the variable A without the parents and we identify them with

nonde(A). And those in red are its descendants, noted as de(A). Markov property says

that a variable is conditional independent of its non-descendants given its parents. We

could reach this reasoning using the concept of d-separation (see [Jensen & Nielsen,

2007] – chapter 2 or [Korb & Nicholson, 2010] – chapter 2, for further detail).

Aánonde(A)∣pa(A) (2.13)

2.4. BAYESIAN NETWORKS 17

Figure 2.2: Relationships between variables in a Bayesian Network.

2.4.2. Inference

As we know a joint probability can be expressed by conditional probabilities through

the chain rule (Equation 2.6). Once we know the relationships (links/edges in the

Bayesian Network) and given the Markov property, we can reduce the expression to

the equation 2.14, because a variable is independent from its non-descendent known

its parents.

P (Xn, . . . ,X1) =n

∏k=1

P (Xk∣pa(Xk)) (2.14)

We only say that a variable is independent for the non-descendent known the pa-

rents, but we erase also the descendent from the equation. We can do this because

the Bayesian Networks are DAGs and there will always be a configuration in the de-

composition into conditional probabilities that does not leave descendants in the right

side. We can see in Figure 2.3 a Bayesian Network. In this example we will find the

configuration that does not leave descendants in the right term. The best way to do

this is starting with the nodes that have no ascendant.

The first step is to apply the Chain Rule starting with the nodes without descendent

and then with their parents and so on:

P (A,B,C,D,E) = P (E∣D,C,B,A) ∗ P (D∣C,B,A) ∗ P (C ∣B,A) ∗ P (B∣A) ∗ P (A)(2.15)


Figure 2.3: Bayesian Network example with 5 variables.

Then we use the Markov property adding the parents to the right term and erase

the non-descendant:

P (A,B,C,D,E) = P (E∣D) ∗ P (D∣C,B) ∗ P (C ∣A) ∗ P (B∣A) ∗ P (A) (2.16)

If we suppose that all the variables are binary, now the biggest table than we have

to calculate have a size of 23, before we have a table with size 25. The reduction in

memory cost is significant and we only have 5 binary variables, from 25 = 32 to 3 × 4

(P (E∣D), P (C ∣A), P (B∣A) with four entries in the CPT) plus 8 (P (D∣C,B)) + 2

(P (A)) = 22.

This reduction is much clearer for larger networks. Suppose a network with 50 binary

variables, which can be considered small, with the following structure: 10 variables

have no parents (20 entries), 10 have 1 parent (10 × 4 = 40 entries), 10 have 2 parents

(10 × 8 = 80 entries), 10 have 3 parents (10 × 16 = 160 entries) and 10 have 4 parents

(10×32 = 320 entries). That means in total we need 620 entries/values to store 1, which

implies a huge reduction with respect to 250 ⋍ 1015. Indeed, the latter is not manageable.

This simple example shows how important and necessary a factorisation is, and how

Bayesian Networks succeed in performing this factorisation using the independences

the network structure is able to model.

If we remember, the conditional tables store in a Bayesian Network correspond to

the variable known the parents. If we have a case with different values for the variables

in the example above like P (a, b, c, d, e) we only have to find the correspondent values

in the tables and multiply them:

P (a, b, c, d, e) = P (e∣d) ∗ P (d∣c, b) ∗ P (c∣a) ∗ P (b∣a) ∗ P (a) (2.17)

Remark that usually we will perform queries to Bayesian Networks, so that we

1These computations omit that we can compute some derived values, as we know that some pro-

babilities sump up to 1.0. for example, given P (x) we can get P (x) as 1.0 − P (x)

2.5. DYNAMIC BAYESIAN NETWORKS 19

won’t directly ask for joint probabilities, but we will want to know to compute pos-

terior probabilities given some evidence or observations. Besides, these queries won’t

normally involve all variables (for big structures), thanks to independences. For that

purpose, Bayes’ Theorem will be used, and the computations involved will be optimi-

zed internally using inference techniques, whose description is out of the scope of this

work (see [Korb & Nicholson, 2010] – chapter 3).

2.5. Dynamic Bayesian Networks

The term dynamic refers to the temporal relationships between variables. In this

approach we consider that variables have different states in the time. In cases like

the robot localization or a meteorological problem could be very significant to have

temporal information. For example, if your robot is in a bedroom and it moves half

meter, probably is still in the same place or at most in the corridor.

Bayesian Networks are not able to model temporal relationships between variables.

One possible way for represent the temporal links is adding a copy of the variables that

represents a different time moment, changing their names so that we can identify the

variable and the time instant. We have to define a new domain of interpretation for

this propose.

If our domain have n variables V = V1, V2, ..., Vn, each one represented a node in

the static network. And the current time step is represented by t, the previous steps

are represented by t − 1, ..., t −m where t − (i + 1) is the immediate previous step of

t − i and the posterior steps are represented by t + 1, ..., t + r where t + (i + 1) is the

immediate posterior step of t + i. Each time step is called time-slice.

Once we have the nodes it is the turn for the arcs. Now we have two types of

relationships between nodes. First the relationship between variables in the same time-

slice, this are called intra-slices arcs, X ti → X t

j . Usually the intra-slices arcs are the

same in each time-slice, because the structure doesn’t usually change over time.

The second relationship between variables is called inter-slices or temporal arcs.

This includes the relationships between the same variable over time X ti → X t+1

i and

different variables over time X ti → X t+1

j . In most cases, the value of a variable at one

time affects the value of the same value in the next step.

There are some rules with this temporal arcs, you can’t have a variable of a posterior

time-slice as antecedent of a variable in a previous time-slice. It has no sense that a

previous state is modified by a posterior value. The other rule is that a variable can’t

span more than a single time step. This is because the state of the world at a particular

time depends only on the previous state and any action taken in it.

Then to obtain the Conditional Probability Table for a node X ti we can use the


Figure 2.4: General structure of a Dynamic Bayesian Network.

same method as Bayesian Networks, but now we have two types of arcs, for this we

have two types of parents, the inter-slices X ti − 1 and Zt−1

1 , ..., Zt−1r and the intra-slices

Y t1 , ..., Y

tm. The CPT is:

P (X ti ∣Y t

1 , ..., Ytm,X

t−1i , Zt−1

1 , ..., Zt−1r )

Once we have the relationships and the CPTs the inference process is the same as

in the Bayesian Networks.

2.6. Classification

This section is dedicated to the statistical classification in machine learning [Mit-

chell, 1997]. Classification is a problem that tries to categorize a new input case when

we have previous knowledge of the problem based in a training set whose category is

known. If we consider that all the categories are values of a variable, the problem tries

to predict the output of this variable when we know or observe the values of other va-

riables. These other variables (input values) are called attributes, features or predictive

variables. The variable to be predicted (output) is known as Class variable, it can be

seen as a labelling task, where the possible labels are the possible values or states that

the Class can take.

Algorithms that implement classification are called classifiers. This task of classi-

fication in machine learning is also known as supervised learning. This is a way to

distinguish it from unsupervised learning, or clustering. This supervised adjective co-

mes from the fact that classification algorithms (classifiers) can learn from previously

classified instances, and the possible values for the Class are previously known, while

in unsupervised learning, the algorithm would have to extract a way to group varia-

bles that is initially unknown. Then, in classification algorithms learn from a training

2.6. CLASSIFICATION 21

dataset where we know the value for the Class. Once a classifier is trained, it will be

able to give us a value/label for the Class when a new instance or case is input, and

this instance has observations only for the predictive attributes, and we ask about the

Class variable.

In order to prove the effectiveness of the classifier we use test data, already labelled

but not used for training the model, for the shake of fairness and evaluate generalization

– we have to avoid overfitting2. These data have different input cases. Test process

consists in obtaining a value for the Class and compare them with the real value of

the case. In this process we obtain information such as the accuracy (or hit rate) and the

confusion matrix. Hit rate represents the percentage of correctly classified instances.

Confusion matrix, or error matrix, gives these information in a more detailed way:

each column of the matrix represents the instances in a predicted class, while each row

represents the instances in an actual class.

2.6.1. Probabilistic Classifier

A probabilistic classifier gives us a probability distribution for the Class variable

when it has a sample input. The basis of the classifier is a conditional distribution

P (C ∣Y ) where the input Y is the right and known part and C is the variable Class

that we want to classify. The target of the classifier is to obtain the value for the

variable C that maximizes that probability. This will be the value predicted by the

classifier as we can see in the equation 2.18.

c = argmaxc(C = c ∣Y ) (2.18)

The Bayesian Networks can be used like a Probabilistic Classifier. The Bayesian

classifiers use the Bayes’ Theorem and the inference process of the networks to obtain

the probabilities of the class and give us the best value as we can see in the equation

2.19.

c = argmaxc (P (Y ∣C = c)P (C = c)

P (Y ) ) (2.19)

As the value of P(Y) is the same for all the cases of c and it is using for normalize,

we can remove this part of the operation leave us the next equation:

c = argmaxc(P (Y ∣C = c)P (C = c)) (2.20)

For example, given a binary class, if P (c∣YYY ) = 0.21 and P (c∣YYY ) = 0.79, YYY will be

assigned to class c.

2Overfitting occurs when a model begins to memorize training data rather than learning to gene-

ralize from trend.


2.6.2. Naıve Bayes Classifier

Naıve Bayes [Domingos & Pazzani, 1997] is the most simple classifier based in Bayes’

Theorem because it assumes that all variables are independent given the Class. Naıve

Bayes links the Class with all the variables and, therefore, it is parent of all them. The

graph structure of Naıve Bayes is like the one shown in Figure 2.5. As we can see, the

CPTs of the variables are very simple, since it involves a marginal probability for the

class variable (P (C)), and P (Xi∣C) for the rest. Then, the number of entries for these

tables is ∣ΩXi∣ × ∣ΩC ∣.

Figure 2.5: General structure for a Naıve Bayes Classifier.

This classifier is very simple and easy to implement. It reduces training time and

its results are good in some areas, like the spam detection [Sahami et al., 1998]. We use

Naıve Bayes to compare the results of more complex models, this is a good baseline

to test them. If the most simple classifier have better results suggest the networks

obtained with complex methods are useless.

In the Bayes Network section we have explained how the probability is obtained

and that is Naıve Bayes obtains its probabilities. If the variable Class is C and the

rest are X1,X2, . . . ,Xn; then P (X1,X2, . . . ,Xn∣C) is obtained by equation 2.21 and

replacing it in 2.20 we yield equation 2.22.

P (X1,X2, . . . ,Xn∣C) = P (X1∣C = c) ∗ P (X2∣C = c) ∗ . . . ∗ P (Xn∣C = c) (2.21)

c = argmaxc(P (X1∣C = c) ∗ P (X2∣C = c) ∗ . . . ∗ P (Xn∣C = c) ∗ P (C = c)) (2.22)

When learning Naıve Bayes, the structure is already fixed, which makes the process

much faster, since this structural learning is a complex task, where current research is

still being developed. Naıve Bayes only performs parametrical learning, which implies

the estimation for the values/parameters in the CPTs, which are already simple, as

indicated below.

2.7. CAMML 23

2.7. CaMML

As already introduced, Bayesian A.I. is a powerful framework, and Bayesian Net-

works (BNs) allows us make predictions, classification, study the behaviour of variables

and the relationships between them in a simple way. They have been broadly used, since

mid-eighties until nowadays, due to its double capacity: (1) knowledge and uncertainty

representation and (2) well-established algorithms for inference. That is why we have

chosen this approach.

Once this framework is accepted as a reasonably option for intelligent systems, we

have to work on the construction of a particular Bayesian Network able to represent

the problem domain we aim to model. One of the possibilities for BNs construction is

expert elicitation. However, experts do not usually know how to perform this modelling

process or give us useless or adverse information. We may need a complex and long

process of knowledge engineering in order to obtain a good model. Another possibility

is to use an algorithm able to learn the model, which involves the use of Machine

Learning techniques.

2.7.1. Learning Bayesian Networks

In order to learn a network (semi)automatically from data, the first needed element

is the dataset to learn from. The most common format of dataset consists of a list where

each row is a case. If our problem has n variables, each case in the dataset may have n

values that represent a record, concerning the particular value for every variable. This

values can be discrete or continuous, but CaMML [Wallace et al., 2005] algorithm is

only able to work with discrete data. In some cases is possible that a few values are

missing, we represents this with “?” or “*”. Usually the BN learner is capable of dealing

with these data using specific techniques. In our case, CaMML is not able to use any

of this techniques and it does not accept cases with missing values. Notice this is not a

critical problem, since there exist algorithms for imputing missing values [Farhangfar

et al., 2008].

As long as a BN has a Class variable (C), this can be used as a classifier, since

we can compute P (C ∣X) for all states of C (see Equation 2.18), being X = X1, ...Xnthe set of predictive variables. To construction a classifier, in this case a BN, we will

use a training dataset. As indicate before, other datasets can be use to evaluate the

performance of the learned classifier: test and validation data. Finally, the aim of a

classifier is to predict the class value for a new case whose label is unknown so that

the model can automatically classify new instances. So, the use of datasets in machine

learning is for initially learn the model, but this kind of the information will also be

used for future prediction, classification, validation, etc...


Algorithms for learning BNs are to provide techniques for learning the DAG struc-

ture and also mechanisms for estimating the parameters of the CPTs from data. There

is one key limitation when learning BNs from observational data only – there is usually

no unique BN that represents the joint distribution. More formally, two BNs in the

same statistical equivalence class (SEC) can be parametrized to give an identical joint

probability distribution. There is no way to distinguish between the two using only ob-

servational data (although they may be distinguished given experimental data). That is

why many algorithms based on search techniques use the SEC space, which obviously,

is also smaller and the search will be more efficient [Chickering, 1995].

BN structural learning algorithms can be classified into constraint-based and metric-

based. Constraint-based methods (e.g., PC [Spirtes et al., 2000], RAI [Yehezkel &

Lerner, 2009]) use information about conditional independences gained by performing

statistical significance tests on the data. Metric-based methods (e.g., K2 [Cooper &

Herskovits, 1992], CaMML [Wallace & Korb, 1999]) search for a BN to minimize or

maximize a metric; many different metrics have been used, (e.g. K2 uses the BDe

metric, CaMML uses an MML metric: [Korb & Nicholson, 2010, Ch 9]). Metric-based

BN structural learners also vary in the search method used and in what is returned

from the search; some learners (e.g., K2) return a DAG, others (e.g., GES [Chickering,

2003]) learn only the SEC.

The metric-based methods can incorporate expert knowledge about the relations-

hips between variables by using them as structural priors that alter the “score” given

to a BN. Here we use CaMML, as it provides more types of structural priors than any

other BN learner, metric or constraint based.

2.7.2. CaMML: a tool for learning BNs

CaMML3 attempts to learn the best causal structure to account for the data, using

a minimum message length (MML) metric [Wallace, 2005] with a two-phase search,

simulated annealing followed by Markov Chain Monte Carlo (MCMC) search, over the

model space. Both MML and the better known MDL are inspired by information theory,

and make a trade-off between prior probability (model complexity) and goodness of fit.

With both, the problem becomes one of encoding both the model and the data, and

the best model is then one that minimizes the message length for that encoding.

The differences between MDL and MML are largely ideological: MDL is offered

specifically as a non-Bayesian inference method, which eschews the probabilistic inter-

pretation of its code, whereas MML specifically is a Bayesian technique.

The full details of MML encoding are not required for this project, but we can write

3This software is downloadable from https://github.com/rodneyodonnell/CaMML/ [Last acces-

sed on 5th July 2014].

https://github.com/rodneyodonnell/CaMML/

2.7. CAMML 25

the relationship between the message length, the model and the data given the model

as:

msgLen∝ − log(P (Model)) − log(P (Data∣Model)). (2.23)

The CaMML metric is a combination of the message of the MML encoding of the

BN, incorporating three parts: (1) the network structure, (2) the parameters given this

structure, and (3) the data given the network structure and these parameters.

In contrast to other metric learners that use a uniform prior over DAGs or SECs

for their search, CaMML uses a uniform prior over Totally Ordered Models (TOMs).

A TOM is a DAG specified at a somewhat deeper level; it can be thought of as a DAG

together with a total ordering of its variables. Just as an SEC is a set of DAGs, a DAG

is a set of TOMs. In the figure 2.6 we see two different DAGs within the same SEC:

the chain has only a total ordering < A,B,C > while the common cause structure has

two – < B,A,C > and < B,C,A >.

Figure 2.6: Example of two DAGs within the same SEC: (a) chain; (b) common cause

By applying a uniform prior over the TOM space to represent an uninformed state

of knowledge, we are following the common practice in Bayesian inference, which uses

uniform distributions at the lowest available level of description.

CaMML also differs from other learners in using Metropolis sampling to estimate

a distribution over the model space. CaMML builds a hierarchy of models. It samples

TOM space moving from TOM to TOM with probabilistic pressure applied by the

MML metric. Every time a TOM is visited a visit to that TOM’s DAG and SEC is

recorded. CaMML also records a visit to the DAG’s “clean” representative — that is

the DAG with all spurious arcs removed — and to that clean DAG’s SEC. SECs are

also joined in a process analogous to cleaning.

2.7.3. Learning Dynamical Bayesian Networks

However, causal discovery is not limited to the discovery of static Bayesian Net-

works; Dynamic Bayesian Networks readily represent time series, without the spurious

correlations, and they can be learned as well.

Learning Dynamical Bayesian Networks is also possible, even though, if the struc-

ture gets more complex, the same happens with the learning algorithms and their


possibilities, which increase also enormously. Nowadays, there is a small and growing

literature on the subject. However, this is an issue still under development, there are

many algorithms for specific cases, but the most known tools do not provide algorithms

to learn DBNs specifically.

Recently, an extension of CaMML to learn DBNs has been included 4. Basically,

this extension will divide the process of learning a DBN in these three steps [Black et

al., 2013]:

1. Learn an order of variables within each time slice, assuming this order is identical

in all time slices.

2. Learn the intraslice arcs for the given variable order. In Figure 2.4 we see those

arcs refers to every time-slice t = i, those connecting X t=ik , that is in the dotted

framebox. DBNs assume they are the same for every slice.

3. Learn the temporal (or interslice) arcs. In Figure 2.4 we see those arcs refers to

arcs between slices t = i (previous) and t = i+1 (next), those connecting variables

of the previous instant to the following one.

If static CaMML used the space of TOMs for searching the graphical structure for

a BN, for DBNs authors have defined DTOMs (Dynamic TOMs), which is essentially

a TOM plus an NXN binary matrix specifying the presence/absence of arcs between

each of the N nodes in the first time slice and the N nodes in the second time slice.

Thus, it is required a structure metric to specify the code length for encoding a DTOM.

The first time slice parameters are learned from the data directly (given the intraslice

network structure learned during the search). The Metropolis search has then been

adapted to the dynamical environment, in learning a DBN, the original CaMML mu-

tation operations remain, however, they are used for modifying the variable order and

the intraslice arcs of a DBN only. As such, additional mutation types were required

for modifying the intraslice arcs: temporal arc change, double temporal arc change and

cross arc change.

4https://github.com/AlexDBlack/CaMML [Last accessed on 28-March-2014]

https://github.com/AlexDBlack/CaMML

Chapter 3

Robot Vision and Localization

The subject of this chapter discusses how robots obtain information through per-

ception. In the case of humans, perception is a process where the senses allows us to

receive, process and interpret information about our environment. We are not aware

of the amount and complexity of information processed by the brain when we perform

everyday actions. However, it is not enough to interpret the information, we need to

coordinate it with the actions we are developing at the time.

Everyday tasks involve an enormous complexity in a robotic system. It is necessary

to have a knowledge of the applicable sensory systems to robotics that allow us to know

which is best suited for the development of a particular task.

Sensors provide information about both the work environment and the internal

state of the robot. Propioceptive sensors measure values internal to the robot like motor

speed and wheel load. Sensors that retrieve information from the environment are called

Exteroceptive. These sensors acquire information from the robot’s environment, like

light intensity or distance measurements. Exteroceptive sensors are divided into two

groups:

1. Passive sensors measure ambient environmental energy entering the sensor. Tem-

perature probe, microphones and cameras are example of this type of sensors

(Figure 3.1).

2. Active sensors emit energy into the environment, then measure the environmental

reaction. Examples of active sensors include ultrasonic sensors and laser range-

finders (Figure 3.2).

Vision can be considered as the most powerful sense for humans. It provides us with

an enormous amount of information about our environment. For this reason, there have

been made great efforts to provide machines with sensors that mimic the human vision.

Regarding passive sensors, cameras are the most common since they are cheap devices

that give us a lot of information. Cameras acquire images of the environment in the

27

28 CHAPTER 3. ROBOT VISION AND LOCALIZATION

Figure 3.1: Passive sensors: Digital Temperature and Humidity sensor (left), Sound

Sensor Microphone (center), Canon VC-C4 Camera (right)

Figure 3.2: Active sensors: Ultrasonic sensor (left), Laser range finder (right)

same way like human vision. They interpret our environment by light rays that reach

the sensor and they convert the information into a digital image.

3.1. Image encoding

Digital images are usually stored as a set of pixels, those are the smallest element in

an display device. Each pixel contains information about colour, position in the image

and the encoded information. The resolution of an image is related to the number of

pixels the image have. In Figure 3.3 we can see an example of the most used resolutions

where values have a (width x height) format, in pixels.

The colour of the pixels is usually represented by the RGB (Red, Green and Blue)

colour model (Figure 3.4). This model is based on the idea that different colours can

be represented as a weighted sum of red, green and blue components. This model is

the most commonly used in computer graphics but there are several alternatives:

3.2. LOCAL FEATURES 29

Figure 3.3: Most commonly used resolutions.

1. CMYK: is a subtractive colour model based in Cyan, Magenta, Yellow and Key

(black) components used in printing devices.

2. YUV, YIQ, YCbCr and YPbPr: these luma-chroma models are used to trans-

mit the TV signal. Value Y represents luminance and the other two represent

chrominance.

3. HSV, HSL: In both models the two first values represent the Hue and Satura-

tion. In the first model the V component is the Value and in the second model the

L component is the Lightness. Both models pursue a more intuitive representation

of colour for users.

The information stored in pixels can be used as input for different problems such

as the semantic localization, but these data are difficult to handle due to their size.

Besides, the information given by pixels can be useless for complex tasks where rotation

(for example) matters, as object recognition. It is generally more effective to extract

features from the images and use them as input. Moreover, real-time solutions require

working with small descriptors.

3.2. Local Features

There are a huge variety of algorithms and techniques for image processing due

to the large amount of information that cameras catch. This set of algorithms and

techniques are called computer vision, and it is responsible for extracting features of

digital images like the human brain interprets the information that the eye catches.

Some of the different approaches for computer vision are: working on colour features

[Hsu et al., 2002], detecting edges [Ziou et al., 1998], visual words [Yang et al., 2007]

or using stereo information [Hirschmuller, 2005].


RGB model. CMYK model.

HSV model. YCbCr model with Y = 0.5.

Figure 3.4: Example of different models for colour display.

In this section some computer vision techniques are reviewed with a special focus

on local invariant features. Those features allow an application to find local image

structures in a repeatable fashion, and also to encode them in a representation that is

invariant to a range of image transformations, such as translation, rotation, scaling, and

affine deformation. The resulting features then form the basis of current approaches

for recognizing specific objects.

Local features are usually extracted by following pipeline:

1. Find a set of distinctive keypoints.

2. Define a region around each keypoint in a scale- or affine-invariant manner.

3. Extract and normalize the region content.

4. Compute a descriptor from the normalized region.

5. Match the local descriptors.

SIFT Descriptor. The Scale Invariant Feature Transform (SIFT) was originally

introduced by Lowe as combination of a Difference-of-Gaussian (DoG) interest region

3.3. GLOBAL DESCRIPTORS 31

detector [Lowe, 1999] and a corresponding feature descriptor. This descriptor aims to

achieve robustness to lighting variations and small positional shifts by encoding the

image information in a localized set of gradient orientation histograms.

Figure 3.5: Visualization of the SIFT descriptor computation. For each (orientation-

normalized) scale invariant region, image gradients are sampled in a regular grid and

are then entered into a larger grid of local gradient orientation histograms.

SURF Detector/Descriptor. Speeded-Up Robust Features (SURF) approach,

which has been designed as an efficient alternative to SIFT. SURF combines a Hessian-

Laplace region detector [Bay et al., 2006] with an own gradient orientation based feature

descriptor.

3.3. Global Descriptors

However the number of local invariant features depends on the image used as input

and therefore big variations can arise when using different images. So it is complicated

to compare the features when their numbers are different and we use Global descriptors

that allow us contrast this information easily. In this section we will explain some

properties of these descriptors.

The basic concept of a global descriptor is to classify the different features into

categories. In an example of semantic localization an image can be categorized with a

label Kitchen if we see features like a fridge or a microwave. This reasoning is similar

to that performed by humans: we usually link the elements that we find within a room

with its purpose.

Some features encode information capable of being represented in the frequency

space using histograms. On the other hand in some cases there are not categories for


the features. However there are techniques such as Bag-of-words based in clustering

that assigns labels or words to the different clusters (Figure 3.6).

Figure 3.6: Process for visual word generation from a set of images.

3.3.1. Histograms

A histogram is a graphical representation of the distribution of data. It is a repre-

sentation of tabulated frequencies, shown as adjacent rectangles, erected over discrete

intervals (bins), with an area proportional to the frequency of the observations in the

interval. If we have a set of 1000 images and they are labelled with the distribution of

Table 3.1, we can represent the information with the histograms of Figure 3.7.

label no of evidences

kitchen 200

bedroom 200

bathroom 150

living − room 300

corridor 150

Table 3.1: Example of distribution labels in a set of images.

3.4. PHOG 33

Figure 3.7: Histograms of frecuency: number of evidences (left) and percentage (right)

3.3.2. Pyramid

In some cases the images show large spaces where the features come from different

objects. They can be mixed in the bag-of-words or a histogram. That is, we lose infor-

mation in this process. A spatial pyramid [Lazebnik et al., 2006] tries to prevent this

problem. It is a collection of orderless feature histograms computed over cells defined

by a multi-level recursive image decomposition. At level 0, the decomposition consists

of just a single cell, and the representation is equivalent to a standard bag-of-words. At

level 1, the image is subdivided into four quadrants, yielding four feature histograms,

and so on. In Figure 3.8 we see an example of this process.

Figure 3.8: Example of a spatial pyramid with depth level 2

3.4. PHOG

The Pyramid Histogram of Oriented Gradients (PHOG) obtains the angle of highest

invariance of an initial set of key-points in the image. This descriptor firstly encodes the

images in a Grey Scale format. Then some key-points are selected using edge detection


[Canny, 1986].

PHOG performs the spatial pyramid process, allows us select the level of depth.

Each level divides the image into 4 squares with the same size. This means that if we

select n level of depth we have 4n squares. As we can see in Figure 3.9, level 0 has the

complete image, in level 1 we divide the image by 4 and in level 2 the 4 squares from

level 1 are divided by 4, giving us 16 squares.

Figure 3.9: Example of level depth in PHOG.

Then the angles of highest invariance are obtained and the descriptor categorized

with values from 0 to 360. PHOG gives us a 360-sized histogram for each square, where

each entry represents the frequency of that angle in the square. The frequency of data

is shown as percentage. PHOG concatenates each histogram from the squares of the

level in a histogram, where level n has a histogram of size 4n ∗ 360. As we can see in

the bottom of the figure 3.9, level 0 has a histogram of 360 values, the histogram for

level 1 has 4 ∗ 360 values and so on.

When you select a depth level in PHOG, the algorithm also performs the descriptor

for the previous levels. Once it has obtained all the histogram from the different levels,

concatenates again the histograms and return their values. The size of the returned

histogram is:

size of histogram =n

∑k=0

(4k ∗ 360) (3.1)

3.5. LOCALIZATION 35

3.5. Localization

The generic problem of localization consists in answering the question Where am I?

[Leonard & Durrant-Whyte, 1991]. That is, a robot has to estimate its location within

the environment given an specific environment representation. The localization problem

can be tackled from two points of view: metric localization and semantic localization.

The first type is related to the estimation of < x, y, θ > locations and assumes the

use of a map as environment representation. On the other hand, semantic localization

describes the environment using semantic labels instead of coordinates. There are se-

veral types of label that can be used to define de environment, but scene categories

are one of the most common. Using this approach, semantic terms as “corridor” or

“kitchen” are directly associated to environment localizations.

Chapter 4

Experimentation

Our solution for the semantic localization problem consists of a visual place classifi-

cation in an indoor environment. So, we are interested in extracting semantic informa-

tion (room categories) instead of localizations (environment coordinates). To perform

this task we will use the PHOG descriptor, or rather, the HOG descriptor since we will

only use level 0 of the PHOG. In order to do this classification task, we will use two

different models of Bayesian Networks: Naıve Bayes classifiers and CaMML-learned

models. A process of variable reduction and discretization will be made, and we will

analyse the performance of the different networks we have learned together with the

problem parametrizations in the next chapter.

The localization problem, as we said before, has a strong temporal component.

So, we have also used and developed some techniques for obtaining different Dynamic

Bayesian Networks that we will compare later with the results given by the static

approach.

This chapter presents the dataset that we have used: the KTH-IDOL2 [Luo et al.,

2006a], and the tools used during the development and the process to obtain the results

(Figure 4.1). This will also describe the work methodology for the experimentation,

which is divided into three main steps:

1. Descriptor Generation: first we extract the features from the images and we apply

the variable reduction and discretization.

2. Learning : in this step we create the different networks, both static and dynamic,

from the information/dataset extracted at step 1.

3. Network Evaluation: finally we obtain the accuracy and other information (matrix

confusion, etc.) from all the networks created in the previous step.

37

38 CHAPTER 4. EXPERIMENTATION

Figure 4.1: Schema for the entire process, separated by steps.

4.1. Dataset: Image CLEF 2009

In this project, we are facing indoor environments to reduce the values of the class.

Therefore, our semantic localization problem can be seen as an indoor scene classifica-

tion problem. We select the RobotVision@ImageCLEF 2009 challenge, which provides

us with the KTH-IDOL2 dataset [Luo et al., 2006b]. This dataset consists of 24 se-

quences of images acquired using two mobile robots platforms (Figure 4.2). We call

them sequences because they present a chronological order, as these images were taken

by the robot while it moved around the indoor environment.

Figure 4.2: Robot platforms: Dumbo (left) and Minnie (right).

The sequences were taken under three illumination conditions across a span of 6

months. This illumination conditions are cloudy, sunny and night (the lights of the

building were on). The building used in the acquisition consists of five rooms.

These rooms are one-person office (BO), two-person office (EO), corridor (CR),

kitchen (KT) and print area (PA). Some sample images for each one of the five semantic

categories are shown in Figure 4.3. We can see the map of the environment in the figure

4.4.

4.1. DATASET: IMAGE CLEF 2009 39

Figure 4.3: Semantic labels used in the IDOL2 dataset.

Figure 4.4: Map of the IDOL2 environment.

The label of each image has the topological and semantic localization. The label

consists of the time when the image was taken, the x and y location coordinates, the

orientation theta and an abbreviation of the room the robot was. An example of the


label is: t1153756903.467512 rCR x6.39416 y9.05009 a-1.6, which encodes that time-

stamp → 1153756903.467512s, robot’s pose → (x-coordinate = 6.39416m, y-coordinate =

9.05009m), θ angle = -1.6 radians, room → corridor.

4.2. Tools

In this section we introduce the different tools used throughout this project. For

each software tool an introduction will be done, where we present their main purpose of

each one and why they have been selected. We also specify the file formats for outputs

and inputs, because this is the base for integrating all of them in our methodological

process.

MatLab

MatLab [MATLAB, 2010] is a known tool very useful when we manipulate big

datasets, especially with matrix representation. This program allows us divide the fun-

ctionally of an algorithm into different scripts and use them easily. MatLab is also used

for extracting information from images. Moreover, it is capable of store this information

in different format files.

MatLab is proprietary software, and requires a payment for its license of use. No-

netheless, there is a free software called Octave [Eaton et al., 2009] that implements

almost the same functions that MatLab and use the same language.

Weka

Weka is a collection of machine learning algorithms for data mining tasks [Hall et

al., 2009]. These tools implement different types of variable discretization and the most

important classifiers. Weka is capable of using different files to test their models and

it gives us different results like accuracy, confusion matrix, etc. The official format for

Weka is arff file, but it can read other standard formats like csv.

GeNIe

GeNIe [DSL, 1996-2006] is a program with a graphical environment for creating

probabilistic and decision models. It includes some machine learning algorithms like

the one for learning a Naıve Bayes classifier. Once the models are created it can be

modified and exported to different formats like dne. GeNIe accepts as input csv files.

Bi-CaMML

Bi-CaMML has been developed by Monash University. It is a GUI-version program-

med in Java, based on original CaMML but with extended features, and it is downloada-

ble from http://bayesian-intelligence.com/software/BI-CaMML-1.2.zip. It im-

plements machine learning Causal discovery via MML (CaMML), that learns causal

BNs and DBNs from data. Bi-CaMML can read arff files, but real values must be

discretized and it does not accept missing values. Bi-CaMML output format is dne,

http://bayesian-intelligence.com/software/BI-CaMML-1.2.zip

4.3. FIRST STEP: DESCRIPTOR GENERATION 41

that includes the network learned with CaMML.

Python

Python [van Rossum, 2007] is a high-level programming language. It has become

one of the most important and used programming languages because of its simplicity

and versatility. In this project, it has been used for its ability in handling strings. This

allows us modify the different files to add behaviours that other programs can not.

Netica

Netica [Norsys, 2000] is a complete program for working with belief networks and

influence diagrams. It has an intuitive and smooth user interface for drawing the net-

works. The relationships between variables may be entered as individual probabilities,

in the form of equations, or learned from data files (which may be in ordinary tab-

delimited form and it accepts “missing data”).

The official format files of Netica networks is dne, but Netica is capable of testing

networks from data files in different formats like csv.

Netica is payware, but there is a free version that is full-featured yet limited in

model size. The reason why Netica was chosen is that CaMML needs it for working.

Our collaboration with Monash University allowed us to use their software license.

There are other available tools which provide similar or extended capabilities to

learn, edit, infer, etc. Bayesian Networks and other probabilistic graphical models

such as SamIam (http://reasoning.cs.ucla.edu/samiam/), Hugin Expert (http://

www.hugin.com/), JavaBayes (www.cs.cmu.edu/~javabayes/Home/) or Elvira (http:

//www.ia.uned.es/proyectos/elvira/index-en.html, [Consortium, 2002]). An ex-

haustive and interesting survey about BN software packages can be found at [Korb &

Nicholson, 2010, Annex B].

4.3. First step: Descriptor Generation

In the first step of the experimentation process we have to extract the different

features from images and store them in several files. The images are processed with

HOG (PHOG with depth 0) with different variable selection and then we obtain the

histograms. Depending on the number of variables we will perform a particular dis-

cretization. Finally, with the retrieved information we create two different format files:

csv and arff .

4.3.1. Variable reduction and Discretization

For the study of the different behaviour of the variables we need to reduce the large

number of initial variables that the HOG descriptors give to us. It is also necessary to

http://reasoning.cs.ucla.edu/samiam/

http://www.hugin.com/

http://www.hugin.com/

www.cs.cmu.edu/~javabayes/Home/

http://www.ia.uned.es/proyectos/elvira/index-en.html

http://www.ia.uned.es/proyectos/elvira/index-en.html


Figure 4.5: First step schema: data extraction.

perform a preliminary discretization step in order to use the HOG as input data in a

CaMML learning procedure.

The HOG descriptor gives us a histogram with 360 variables, where each one repre-

sents a degree and its value the frequency which this angle has the highest invariance.

We see in Figure 4.6 how the HOG descriptor extracts features from the image and

then obtain the corresponding histogram.

Figure 4.6: Image taken by the robot (left). HOG descriptor (center). Histogram with

360 variables from HOG (right).

The easiest way to reduce the original 360 variables is to merge them into n groups.

Each group will have the same number of variables and its value will be the sum of all

the frequencies of the variables that it contains. For example, if we want 4 variables in

the HOG descriptor, we have to divide 360 by 4. Now we have 4 variables, so the first

of them groups the first 90 variables of the original histogram and its amount is the

sum of all these 90 values and so on. We can see an example of the variable reduction

for HOG in Figure 4.7. HOG already does this task, it allows us select the value of n

and it returns the corresponding histogram.

Although the number of variables is lower, these variables are still continuous, so

we have to discretize them. There are many ways to discretize a variable, one of them

is to use the discretization function (filters) of a tool like Weka. But these functions

are focused on optimizing the results of the algorithms that the tool use, so it is better

to develop our own discretization functions.

In this project a discretization function has been implemented. It consists in dividing

the variables to four values. As we are working with frequencies, the sum of all the

4.3. FIRST STEP: DESCRIPTOR GENERATION 43

Figure 4.7: HOG histogram with 360 variables (left). HOG histogram with variable

reduction n = 30 (right).

continuous values will be one, so the mean frequency value will be 1/n where n is

the number of variables. With the mean frequency we create four labels and each one

represents a range of frequency:

Low value: [0,1/2n)

Medium/Low value: [1/2n,1/n)

Medium/High value: [1/n,2/n)

High value: [2/n,1]

Then, we just have to replace the continuous values by the labels depending on

the ranges. This function is independent of the values, only depends on the number of

variables.

There are 5 different n values that are evaluated in this project: 5, 10, 20, 50 and 100.

The n selection will affect the generalization/specialization power for HOG descriptors.

The use of large n values involves the generation of specific descriptor that can cause

over-fitting when training the classifier. On the other hand, small values for n generate

too general models incapable to differentiate between classes.

Figure 4.8 shows a sample discretization for an input image and four n values. It can

be observed how n = 100 obtains a fine-grain representation of the original histogram,

while n = 5 simplifies it excessively. Regarding the class labels, red colour is used to

represent bins (angle ranges) with low frequency. Orange and yellow colour represent

medium/low and medium/high frequency, respectively. Finally, blue colour denotes

high frequency. The four processed histograms are shown using the same scale. That

allows to point out how similar values are associated to different labels. Concretely,

it can be seen that last bin presents similar values for n=20 (0.16) and n=10 (0.18).

However, we obtain label high for n=20 and medium/high for n=10. That occurs

because high label is associated to values greater than (2/n), that is, 0.1 and 0.2 with

20 and 10 values respectively.


Figure 4.8: Original Histogram (top right) extracted from sample image (top right).

Bottom: four histograms obtained with differents n values. The colours in the bot-

tom histograms represents the generated class labels or categories: low=red, me-

dium/low=orange, medium/high=yellow and high=blue.

4.3.2. Data Format

Once we have the discrete variables, we have to create the files with a format than

other programs can manipulate. In our case these formats are .arff and .csv. The two

formats are very similar, in both each line represents a case (in our example an image)

with all the information extracted as shown in Figure 4.9.

Figure 4.9: Example of arff file (left) and csv (right)

Due to the fact that DBNs are built with the same variables in different time steps,

each line of test data must contain the information of all variables in these different

time moments. For example, if we have 5 variables and the DBN have 3 time steps or

slices, each line of the test data must have 15 values.

As the DBNs we will learning have 2 time steps (slices) and the images are ordered

in time, each line of test data includes the information of the previous and current

4.4. SECOND STEP: LEARNING 45

case. So when we create a test or validation file, we have to make another with the

information of the previous and current case in each line.

In the test files for DBNs the class will be the class related with the current case.

So we have to erase the information about the class in the previous step in this file.

We do this because in a real case we do not know the class of the previous case, only

its prediction. So, another change has been done to these files, in the first case there is

not previous data, so we mark them as unknown. When the network tests this case it

assigns the prior probabilities. We see in Figure 4.10 how the variables are represented,

those which end in 0 correspond to the previous case and those which end in 1 belong

to the current data. We also observe that the class 0 is unknown, like the previous data

of the first case.

Figure 4.10: Example of a csv test file for DBNs

4.4. Second step: Learning

In this section we learn the networks using as input the image descriptors generated

in the previous step. This process is divided into two parts. The first part consists in

generating the networks with the different tools we have. These tools are GeNie and

Weka for the Naıve Bayes classifier and Bi-CaMML for the networks learned with

CaMML algorithm.

Weka generates the Naıve Bayes classifier more quickly than GeNie. The problem

is that these networks are not able to be exported. For this reason we created a dne

file that implements the Naıve Bayes classifier with GeNie.

CaMML creates different networks and collapses which only have small effect diffe-

rences, that is the models are identical apart from links which are only weakly suppor-

ted by the data. Using Monte Carlo sampling over this set of models, CaMML samples

posterior probabilities. Directing the search by the MML value of each DAG, the best

solution is the model which is most visited during the search. We always select this

best solution, since its values are well above the other.


Figure 4.11: Second step schema: learning.

4.4.1. DBN constructor

As we saw, CaMML is capable of creating/learning DBNs directly from data. Ho-

wever, the models it learned didn’t perform well, as we will explain in Section 5.3.

That is why we designed our own method to create Dynamic BNs and Dynamic Naive

Bayes from data, as we will describe next. So, this second step consists in learning

static BNs and constructing DBNs. We will modify BNs to convert them into DBNs.

Once we have the dne file, we can modify it with a text editor like notepad or with a

more powerful tool like Netica. But this work is too hard, especially in problems with

a high number of variables. One way to solve this is automatizing the process, for this

task we create a DBN constructor in Python.

The dynamical networks that our DBN constructor makes have two time slices. Each

slice is a static network with the same structure but variables at state 0 (previous) and

state 1 (current). They are connected by an edge from C0 to C1. For the case of Naıve

Bayes, known C0 (class at moment 0)1, C1 is independent of the predictive attributes at

time 0, if C0 is unknown this information flows, i.e., if C0 is not observed, its prediction

comes from the values of the predictive variables at moment 0, and C1 will depend on

that value and the other variables at instant 1. That is, we can use the information

predicted of C0 to modify the probabilities of C1. In our problem, the relationship

between classes will produce an increase of the probability to stay in the same room

that was predicted in the previous state.

The Naıve case is the most simple, but we need a generic method to construct

Dynamic Bayesian Networks in general. In order to construct a DBN with the above

property (C0 → C1) we divide the process into three parts:

1This also applies if there are no links from time slice 0 to 1, for the other variables (Xi).

4.4. SECOND STEP: LEARNING 47

1. We first learn a static BN (or a Naıve Bayes classifier). With this static network,

we create two networks with the same structure and disconnected between them

in a new single dne file (Figure 4.12). We do this task with Python.

Figure 4.12: Example of two static networks with the same structure disconnected

between them

2. Then we have to link C0 and C1 with Netica. This process modified the CPT of

C1 adding the values of C0, but this does not affect the C1 probabilities. We see

in Table 4.1 an example of this process in C1 CPT where ΩC0 and ΩC1 = a, b, cand C1 have a parent X1 with ΩX1 = x,x. In Figure 4.13 we see the same

network in Figure 4.12, but with the relation between classes. We can observe

how this link does not affect to the probabilities of C1, since there is numerical

independence between C0 and C1 in the new table.

a b c

x 0.33 0.33 0.33

x 0.2 0.4 0.4

a b c

a 0.33 0.33 0.33

x b 0.33 0.33 0.33

c 0.33 0.33 0.33

a 0.2 0.4 0.4

x b 0.2 0.4 0.4

c 0.2 0.4 0.4

Table 4.1: CPT for C1 before applying the link with C0 (left) and after (right).

3. Finally we have to modify the probabilities of C1 CPT, including the estimated

relation between classes, with a class transition table being this combination

based on [Vomlel, 2006]. Following the previous example, in Table 4.2 we see a

class transition table, where the probability of predicting the same class is higher

than the others, like in our problem, but not as high as in CaMML DBN learning

(around 99 %, that’s basically why it was discarded). Then, we combine Table 4.1


Figure 4.13: Example of a Dynamic Bayesian Network with related classes but inde-

pendent.

with the class transition table and obtain the probabilities of Table 4.3. These

probability values must be normalized to produce a valid CPT. We observe in

Figure 4.14 that now the predicted class in the previous step affects to the current

class, and its prediction changes depending on the value of C0. We used Python

for this step.

a b c

a 0.50 0.25 0.25

b 0.25 0.50 0.25

c 0.25 0.25 0.50

Table 4.2: Class transition table.

a b c

a 0.165 0.0825 0.0825

x b 0.0825 0.165 0.0825

c 0.0825 0.0825 0.165

a 0.1 0.1 0.1

x b 0.05 0.2 0.1

c 0.05 0.1 0.2

a b c

a 0.50 0.25 0.25

x b 0.25 0.50 0.25

c 0.25 0.25 0.50

a 0.33 0.33 0.33

x b 0.14 0.57 0.29

c 0.14 0.29 0.57

Table 4.3: CPT for C1 combined with class transition table (left) and normalized

(right).

Notice, that we can create a Dynamic Naive Bayes (DNB) classifier with the struc-

ture of Figure 4.15 using this process. Moreover, in networks more complex, where C1

have parents, the DBN constructor also works and we can create dynamic networks

with this properties from different machine learning algorithms. This includes static

BNs learned using CaMML.

4.5. THIRD STEP: NETWORK EVALUATION 49

Figure 4.14: Example of final Dynamic Bayesian Network obtained with our DBN

constructor.

Figure 4.15: Schema for Dynamic Naıve Bayes (DNB) classifier

4.5. Third step: Network Evaluation

In this step we obtain the information about how good our networks are. By testing

we obtain different information such as accuracy and confusion matrix. However, in the

results we will only see the accuracy. But if we want to see other information we only

have to follow the steps before to obtain the network and test it with the following

method or any other.

Weka tests automatically the networks that it learns. It is useful to test the Naıve

Bayes quickly, but Weka can not import or export dne files. As we have seen before,

the problem of not being able to export the files is that we can not modify the networks

with the DBN constructor. And now the problem of not being able to import the files

is that we can not test the networks with Weka.

Although we will test with Weka all Naıve Bayes classifiers and we obtain the

accuracies using it. In order to do this, we need the training data to be loaded. Then,

we select the Naıve Bayes classifier and select the data for the testing.

However, we are creating dne files for one reason. Netica uses these dne files to work.

Once Netica has loaded the network, we test them with the different data. In order to


Figure 4.16: Third step schema: network evaluation.

do this, we have to compile the networks (exact inference), but this process is quite

inefficient and it requires too much memory in problems with a high number of links.

If this step is not possible for the size of the tables (in compilation time), there is other

solution which consists on an updating by sampling. This process is more optimized

than normal compilation and it gives us almost the same results. When the network

is compiled, we only have to load test data and validate data to obtain the results.

Chapter 5

Results

This chapter is exclusive for the results obtained along this project. As we should

know the networks we create are used like probabilistic classifiers. The purpose is

predicting the room where the robot is. This is our Class in the classifiers. The test

process consists in comparing the output of the classifier with the labels in the images

and calculate the accuracy.

5.1. Experimental Setup

The first decision is to choose which dataset from the two different robot will be

use. We don’t use the two robots because we want to study the different objectives

from the same point of view. In this case the dataset we will use is the one from the

Dumbo. This can be interesting for future works, since we have a similar robot available

in SIMD1 laboratory.

As we said before the database is divided into three different illumination conditions:

sunny, cloudy and windy (Figure 5.1).

Figure 5.1: Sequence distribution.

The networks are usually learned with 60 % of all data, leaving 40 % for test or

1http://gruposimd.uclm.es/

51

52 CHAPTER 5. RESULTS

Figure 5.2: Example of BNs with 10 variables and the Class: CaMML (left) and Naıve

Bayes (right)

20 % for validation and 20 % for test. Taken advantage from the fact that we have 4

time series for each illumination condition, we will use the two first for the training,

the third for validation and the last for test.

5.2. Variable reduction

The first aim is studying the behaviour of the variable reduction in our problem. So

the first step is obtaining networks with different number variables. n represents the

number of variables and in the test we only consider that can take 5 different values: 5,

10, 20, 50, and 100. In Figure 5.2 there is an example of BNs learned with 10 variables.

As we can see there is a big difference of complexity in networks and in Table 5.1 we

observe that Naıve Bayes is learned immediately, because the structure is known, but

CaMML learning needs more time (structural learning and then parametric learning)

and it grows with an increasing number of variables.

Variables CaMML Naive

5 1.11s 0.02s

10 1.8s 0.02s

20 9.6s 0s

50 323.99s 0.01s

100 4555.7s 0.01s

Table 5.1: Learning time for BNs (s) for the Cloudy case with different number of

variables.

5.2. VARIABLE REDUCTION 53

Figure 5.3: Graph of learning time for BNs (s) for the Cloudy case with different number

of variables.

5.2.1. Similar lighting condition

In this first test battery we trained both a Naıve Bayes classifier and a Bayesian

Network (CaMML learning procedure) using as input the two training sequences for

each illumination condition. As we have 5 values for n and there are 3 illumination

conditions we have 15 networks for each learning model. Then we evaluate the obtai-

ned models using the test sequences (sunny4, cloudy4 and night4) with the training

networks learned under the same illumination condition and we obtain the accuracy

that show Tables 5.2, 5.3 and 5.4.

Training with Cloudy

Model Test n = 5 n = 10 n = 20 n = 50 n = 100

CaMML Cloudy 56.79 62.76 56.60 61.39 56.30

Naıve Bayes Cloudy 54.45 59.43 56.70 62.07 61.58

Table 5.2: Accuracy ( %) of CaMML model and Naıve Bayes classifier training and test

with Cloudy.

Training with Night

Model Test n = 5 n = 10 n = 20 n = 50 n = 100

CaMML Night 51.68 55.66 62.16 61.11 57.55

Naıve Bayes Night 49.48 52.20 62.16 63.84 62.16


with Night.

In these first results we can conclude that Naıve Bayes classifier is a bit better than

CaMML in these cases. The worst results belong to n = 5, the number of variables


Training with Sunny

Model Test n = 5 n = 10 n = 20 n = 50 n = 100

CaMML Sunny 53.05 58.16 63.56 62.46 58.76

Naıve Bayes Sunny 53.15 58.26 64.11 62.96 63.16


with Sunny.

Figure 5.4: Rates of test under the same illumination conditions as the training set:

CaMML (left), Naıve (right).

is too low for classify correctly. On the other hand, when n = 100 CaMML accuracy

is decreased, the model has lost generalization and is overfit due to the number of

variables. We can see better this information in Figure 5.4

If we see carefully the graphs we observe that Sunny is the model that better works.

As we said before, CaMML loses generalization when the number of variables is high.

However we see that Naıve Bayes remains its accuracy in similar values when the

number of variables is high.

5.2.2. Different lighting condition

We have learned how our networks work under the same illumination condition. But

what happens when we prove them with different lighting conditions? In Tables 5.5,

5.6 and 5.7 we will see the accuracy when we test each illumination condition against

the other two.

As we expected, the results are a bit worse than those for the previous test, since

some extracted features are related with the lighting condition. Naıve Bayes classifier

is still better than CaMML model. Now we compare the different results in Figures 5.5,

5.6 and 5.7. As we see Naıve Bayes classifiers keep the same behaviour in the different

illumination conditions. However CaMML models have diverse results: Cloudy have

good results for Sunny and Night; Night can not classify correctly the other conditions;

and Sunny obtain disastrous results with Night test.


Training with Cloudy

Model Test n = 5 n = 10 n = 20 n = 50 n = 100

CaMML Night 53.04 56.39 53.35 60.06 51.78

Sunny 52.85 60.86 59.86 59.96 53.15

Naıve Bayes Night 51.78 54.30 58.81 62.68 60.69

Sunny 51.25 59.46 61.36 64.66 63.96

Table 5.5: Accuracy of CaMML model and Naıve Bayes classifier training with Cloudy

and test with Night and Sunny.

Training with Night

Model Test n = 5 n = 10 n = 20 n = 50 n = 100

CaMML Cloudy 51.91 61.09 52.39 56.89 56.60

Sunny 50.65 57.96 53.85 52.35 58.26


Sunny 50.75 55.76 59.36 64.46 63.66

Table 5.6: Accuracy of CaMML model and Naıve Bayes classifier training with Night

and test with Cloudy and Sunny.

Training with Sunny

Model Test n = 5 n = 10 n = 20 n = 50 n = 100

CaMML Cloudy 51.22 58.65 59.82 62.27 58.46

Night 51.05 47.80 49.16 54.93 45.81


Night 51.05 54.51 56.60 61.32 61.01

Table 5.7: Accuracy of CaMML model and Naıve Bayes classifier training with Sunny

and test with Cloudy and Night.

Figure 5.5: Rates of test Night and Sunny under Cloudy illumination condition:



Figure 5.6: Rates of test Cloudy and Sunny under Night illumination condition:


Figure 5.7: Rates of test Cloudy and Night under Sunny illumination condition:


5.2.3. Multiple sequence integration

In this last test we create a new sequence “All” that integrates all the images from

the different illumination conditions as we have seen in Figure 5.8. We expect that if

we increase the number of images and mix them, the models will have more general

features and they will be capable of increasing their accuracy. We see the results in the

Table 5.8.

Training with All

Model Test n = 5 n = 10 n = 20 n = 50 n = 100

CaMML All 56.75 60.18 66.13 67.61 67.88

Naıve Bayes All 52.69 57.46 59.58 63.24 62.63

Table 5.8: Accuracy of CaMML model and Naıve Bayes classifier training and test with

“All” .

CaMML models improves their results. Meanwhile Naıve Bayes obtains similar

results to those done in the other tests. CaMML model clearly surpasses the accuracy

of Naıve Bayes classifiers as we can see in Figure 5.9.


Figure 5.8: Sequence distribution for “All” sequence.

Figure 5.9: Rate comparative for CaMML and Naıve Bayes training and test with “All”

sequence

5.2.4. Concluding observations

In this first section we have already studied the behaviour of the different number

of variables in our semantic localization problem with different lighting conditions. In

these results the best values for n are 20 and 50. The networks created with CaMML

have the best values in these cases, and it presents over-fitting when n = 100. On the

other hand, Naıve Bayes classifiers have the best accuracy with 50 variables, and keep

the results with 100.

The networks obtained with CaMML in the illumination cases are not good, the

accuracy values obtained do not surpass the Naıve Bayes results and they work poorly

with the tests of other illumination conditions. However the networks of All in CaMML

generalize better than any other. These networks have extracted the properties of the

rooms correctly regardless of illumination. That is, they perform better when using


datasets including different lighting conditions. This increase did not occur for the Naıve

Bayes model, because Naıve Bayes classifiers obtain too specific models (overfitting),

in contrast to the capacity of generalization proved by Bayesian Networks.

5.3. Dynamic Models

The next aim is the study of the dynamic models in our problem. We see the basic

scheme of Dynamic Naıve Bayes models in Figure 4.15. However, we do not know the

relationship between classes. In our problem the images taken by the robot are ordered

in time and we know that if the robot takes a photo in a room the most likely possibility

for the next moment is that it will remain in the same place. So relation between classes

is that if we predict where the robot is, predicting the same in the next case is the

most probable situation. These probabilities can have more or less weight, depending

on the model we want to use.

Two models with distinct distribution of probability will be used in these tests. In

the first case we assumed a medium probability (40 %) of remaining in the same room

for two consecutive frames, while the second one used a more conservative approach and

gives to that case 80 %. Notice that the remaining probability is uniformly distributed

among the other states or rooms. The class transitions for both approaches are shown

in Table 5.9.

CR 1PO 2PO KC PA CR 1PO 2PO KC PA

Model A Model B

CR 0.40 0.15 0.15 0.15 0.15 0.80 0.05 0.05 0.05 0.05

1PO 0.15 0.40 0.15 0.15 0.15 0.05 0.80 0.05 0.05 0.05

2PO 0.15 0.15 0.40 0.15 0.15 0.05 0.05 0.80 0.05 0.05

KC 0.15 0.15 0.15 0.40 0.15 0.05 0.05 0.05 0.80 0.05

PA 0.15 0.15 0.15 0.15 0.40 0.05 0.05 0.05 0.05 0.80

Table 5.9: Class transitions used for the dynamic models.

As we know, the CaMML learning has the option to generate a dynamic Bayesian

Network with two time steps. However the networks it make are useless, because the

relationship between the classes is too strong and the other attributes are disconnected

from them. We can see an example of a dynamic network learned with CaMML in the

figure 5.10 where the classes are separated from the rest.

We then had no choice but to reject the dynamic networks created with CaMML.

One solution for this problem is to apply the DBN constructor as we did for Naıve

Bayes, and whose overall method was explained in Section 4.4. The advantage of this

5.3. DYNAMIC MODELS 59

Figure 5.10: Dynamic Network with 5 variables created with CaMML.

is that we can compare the dynamic networks obtained with CaMML and Naıve Bayes

when the links have the comparable weights, since the transition models are the same.

5.3.1. Class transition probabilities comparative

The dynamic networks used are the networks learned in the previous section, but

they are modified with the DBN constructor (Figure 5.11). Now we have four different

models: two dynamic CaMML models (CaMML - ModelA and CaMML - ModelB), one

for each class transition, and same for Naıve Bayes classifier (Naıve - ModelA and Naıve

- ModelB). In this set of evaluation we obtain the accuracy of the learned networks

with “All” training (Figure 5.8), modified with DBN constructor and we evaluate them

with cases of “All” test. We use the same values for the number of variables n that in

the previous section.

Tables 5.10 show the accuracy values. In the results we appreciate that the values are

a bit better than those with static BNs, this is logical since the problem has a dynamic

component. Now Naıve Bayes is outperformed by the Dynamic BN. This happens,

because Naıve Bayes classifiers obtain too specific models. Bayesian Networks showed

as more appropriate than Naıve Bayes for adding them the dynamic behaviour.

We see in Figure 5.12 how Model B (conservative approach) has better accuracy

than Model A. If we give more weight to continuing in the same room, we will obtain

more information from the previous prediction. However, if this weight is too high, the

prediction of the actual class will be the same as in the previous one, so there is no

sense in using it for transitions.


Figure 5.11: Example of a DBN created with DBN constructor on a CaMML model

with 5 variables

Training with All

Model Test n = 5 n = 10 n = 20 n = 50 n = 100

CaMML - ModelA All 57.39 61.26 67.67 69.66 70.09

CaMML - ModelB All 58.00 61.69 69.05 70.56 70.87

Naıve - ModelA All 53.33 58.00 60.35 63.37 62.90

Naıve - ModelB All 52.92 58.20 60.55 63.61 62.87

Table 5.10: Accuracy of DBNs training and test with “All” sequence.

Figure 5.12: Rate comparative for DBNs training and test with “All” sequence with

different class transition: ModelA (left), ModelB (right).

5.3.2. Static vs Dynamic

Finally we have to compare the obtained results with the two approaches: BNs and

DBNs. In Figure 5.13 we can clearly see how dynamic networks improve the static

ones. The single time-slice structure learned by CaMML allows us obtain better results

in the dynamic approach. Meanwhile, Naıve Bayes is not able to adapt to the time

variability.

Structure of Naıve Bayes classifier gave us good solutions when the problem was

simple, as we saw in Subsection 5.2.1. However, when the problem increases its com-

5.3. DYNAMIC MODELS 61

Figure 5.13: Case All classification rates evolution when using dynamic and static

classifiers.

plexity, these classifiers give us results in the same range of quality. On the other hand,

Bayesian Networks learned with CaMML obtained bad results in the first set of test.

But using a more heterogeneous set of images to have a more generalised problem, and

later, including the dynamic component, we achieve to improve its accuracy conside-

rably.

Finally we compare our results with those obtained in CLEF 2009 challenge. Alt-

hough the techniques to extract information have evolved, these accuracy values allow

us to verify whether the results obtained by our solution are really good. The winner

of the CLEF 2009 competition [Martınez-Gomez et al., 2009] using the KTH-IDOL2

dataset correctly classified 63.43 % of test images. With our proposal, we have excee-

ded 70 % accuracy. That is, BNs and, even more, DBNs have been presented as a valid

solution for this problem.

Chapter 6

Conclusions and further work

In this project we have attempted to achieve two types of aims. The first one was

all the process related with the implementation of a solution to solve our problem using

Bayesian classifiers. The second kind was the analysis of the results obtained. In this

last chapter we will analyse if this objectives have been accomplished. Seeing the results

would be interesting to extend the work in different ways. This will be the focus of the

last section.

6.1. Conclusions

We have presented a procedure for using Bayesian classifiers to solve the problem

of semantic localization. This procedure includes the data extraction from images in

different descriptors, together with an unsupervised discretization step optimized for

coping with histograms as input, because most of the visual features extracted from

images present such structure. The project also includes Bayesian learning techniques

as CaMML, and also the use of DBNs to cope with the temporal continuity of the

dataset sequences.

With all the development process we can extract the features from the images and

create a Bayesian classifier, so the first objective is accomplished. One advantage of the

method developed is the process division depending on the function we want to do. This

allows us have techniques that extract information from a set of images and generate a

dataset in different formats. We also have the DBN constructor for networks to use it in

different problems with a temporal component like medical or meteorological domains.

Based on the experiments, we can conclude that we can achieve quite reasonable

results by means of our proposal. The results obtained using a reduced number of

variables give us an acceptable solution capable of working in real time. The variable

reduction allows us generate models able to generalize better than networks with all

the variables.

63

64 CHAPTER 6. CONCLUSIONS AND FURTHER WORK

Finally the dynamic networks have demonstrated their potential. They are able

to surpass the results obtained with the static models. The DBNs are shown as an

appropriate solution for integrating sequences of images. We also demonstrated the

limits of the Naıve Bayes models. They are not able to improve their results when we

increase the information and complexity of the problem. Meanwhile, learned Bayesian

Networks have taken advantage of this and they improve their accuracy.

In the CaMML algorithm is important when we use the dynamic version that the

data must be temporally sorted. But the different procedures presented in this project

do not have this restriction, because we created a static network and then we use the

DBN constructor. This allows us create the model with disordered data. However, if

we want test the networks the data must be ordered.

6.2. Further work

This work can be extended in many different ways. The first of them is studying

the effects of the different levels of the PHOG on the behaviour of the models. Another

possibility is to include new visual features, as SIFT-dense descriptors. This inclusion

would allow to evaluate how Bayesian approaches cope with multiple data fusion. If we

have obtained 70 % accuracy with PHOG in level 0, other more complex descriptors

may give us higher accuracy.

Figure 6.1: Grid example with size 4x4 in the IDOL2 environment.

We also would like to evaluate our proposal using topological information instead of

semantic one. This could be done using the topological annotations that are included

in the KTH-IDOL 2 dataset. These values are continuous. However, one way to use

this information could be to create a grid with the coordinates X and Y where each

6.2. FURTHER WORK 65

cell has a value as we can see in Figure 6.1. This approach turns continuous values to

discrete, where the new class can take any value assigned to cells.

We have been limited by the condition of CaMML that you can only use discrete

values. Other possibility is to study the behaviour of the continuous variables in our

problem with different techniques of machine learning. We also include in this way the

possibility to join continuous and discrete data. A comparison between our approach

and the use of Support Vector Machines is also considered.

This project has helped us to approach to the dynamic networks. One of the ways

of extending this work in the future will be study the different dynamic networks and

the effects in this kind of problems.

Conclusiones y trabajo futuro

En este proyecto intentamos conseguir dos tipos de objetivos. El primero consistıa en

obtener un proceso que nos permitiera obtener soluciones para nuestro problema usando

clasificadores Bayesianos. El segundo tipo de objetivos era el analisis de los resultados

obtenidos. En este ultimo capıtulo analizaremos si estos objetivos se han cumplido.

Por otro lado, serıa interesante observar los distintos resultados y las posibilidades

para extender el trabajo de diferentes maneras. Este sera el punto central sobre el que

tratara la ultima seccion.

6.1. Conclusiones

Hemos presentado un procedimiento para usar clasificadores Bayesianos con el obje-

tivo de solucionar el problema de la localizacion semantica. Este procedimiento incluye

la extraccion de datos de las imagenes con la ayuda de diferentes descriptores, junto

con un proceso optimizado de discretizado no supervisado con histogramas como en-

trada de datos, ya que la mayor parte de las caracterısticas extraıdas de las imagenes

presentan esta estructura. El proyecto tambien incluye tecnicas Bayesianas de apren-

dizaje como CaMML, y el uso de las DBNs para enfrentarse a la continuidad temporal

de las secuencias de los conjuntos de datos.

Con todo el proceso de desarrollo podemos extraer las caracterısticas de las imagenes

y crear un clasificador Bayesiano, por lo que el primer objetivo ya ha sido conseguido.

Una ventaja del metodo desarrollado es la division del proceso en diferentes pasos

dependiendo de la funcion que queramos realizar. Esto nos permite tener, por ejemplo,

una seccion del proceso que extrae informacion de un conjunto de imagenes y genera un

conjunto de datos en diferentes formatos. Ademas, podemos utilizar el constructor DBN

para crear redes para otros casos con una componente temporal, como los problemas

medicos o meteorologicos.

Basandonos en los experimentos, se puede concluir que podemos conseguir resul-

tados bastante razonables por medio de nuestra propuesta. Los resultados obtenidos

usando un numero reducido de variables nos dan una solucion aceptable capaz de tra-

bajar en tiempo real. La reduccion de variables nos permite generar modelos capaces

67

68 CHAPTER 6. CONCLUSIONS AND FURTHER WORK

de generalizar mejor que las redes con todas las variables.

Finalmente, las redes dinamicas han demostrado su potencial. Son capaces de so-

brepasar los resultados obtenidos con los modelos estaticos. Las DBNs se han mostrado

como una solucion apropiada para integrar secuencias de imagenes. Por otro lado, he-

mos visto los lımites de los modelos de Naıve Bayes, incapaces de mejorar sus resultados

cuando aumentamos la informacion y complejidad del problema. Mientras que las redes

Bayesianas aprendidas han sacado ventaja de esto mejorando su tasa de acierto.

En el proceso de aprendizaje de CaMML es importante que los datos sean ordenados

en el tiempo cuando usamos la version dinamica. Pero los diferentes procedimientos

presentados en este proyecto no tienen esta restriccion, porque nosotros creamos una

red estatica y despues aplicamos el constructor DBN. Esto nos permite crear el modelo

con datos desordenados. Sin embargo, si queremos evaluar dichas redes los datos deben

ser ordenados.

6.2. Trabajo Futuro

Este trabajo puede ampliarse de diferentes formas. La primera de ellas es estudiar el

comportamiento de los modelos usando diferentes niveles del PHOG. Otra posibilidad

es incluir nuevas caracterısticas visuales, como descriptores de densidad SIFT. Esta

inclusion permitirıa evaluar como los enfoques Bayesianos hacen frente a la integracion

de multiples datos. Si hemos obtenido un ındice de acierto (accuracy) del 70 % con un

PHOG a nivel 0, otros descriptores mas complejos pueden darnos ındices de acierto

mas elevados.

Ademas nos gustarıa evaluar nuestra propuesta empleando informacion topologica

como entrada de datos en lugar de una entrada semantica. Esto podrıa hacerse utilizan-

do las anotaciones topologicas que se incluyen en el conjunto de datos KTH-IDOL 2.

Estos valores son continuos. Sin embargo, una manera de usar esta infomacion podrıa

ser la creacion de una cuadrıcula con las coordenadas X e Y , donde cada celda tiene

un valor como podemos observar en la Figura 6.1. Este enfoque convierte los valores

continuos en discretos, donde la nueva clase puede tomar cualquier valor asignado a

las celdas.

Hemos estado limitados por la condicion de que CaMML solo puede utilizar valores

discretos. Otra posibilidad es estudiar el comportamiento de las variables continuas en

nuestro problema con diferentes tecnicas de aprendizaje automatico. Tambien inclui-

mos de esta forma la posibilidad de unir los datos continuos y los discretos. Hemos

considerado tambien una comparacion entre nuestro enfoque y el uso de las maquinas

de soporte vectorial (Support Vector Machines).

Este proyecto nos ha ayudado a aproximarnos a las redes dinamicas. Otra de las

6.2. TRABAJO FUTURO 69

formas de extender este trabajo en el futuro serıa estudiar las diferentes redes dinamicas

y los efectos en este tipo de problemas.

REFERENCES

Bay, H., Tuytelaars, T. & Van Gool, L. (2006). Surf: Speeded up robust fea-

tures. In Computer Vision–ECCV 2006 , 404–417, Springer. 31

Black, A., Korb, K.B. & Nicholson, A.E. (2013). Learning Dynamic Bayesian

Networks: Algorithms and issues. Presented as the Fifth Annual Conference of the

Australasian Bayesian Network Modelling Society (ABNMS2013), available at http:

//abnms.org/conferences/abnms2013/. 26

Canny, J. (1986). A computational approach to edge detection. Pattern Analysis and

Machine Intelligence, IEEE Transactions on, 679–698. 34

Chickering, D.M. (1995). A transformational characterization of equivalent Bayesian

network structures. In UAI95 – Proceedings of the 11th Conference on Uncertainty

in Artificial Intelligence, 87–98, Morgan Kaufmann, San Francisco, CA. 24

Chickering, D.M. (2003). Optimal structure identification with greedy search. Jour-

nal of Machine Learning Research, 3, 507–554. 24

Consortium, E. (2002). Elvira: An environment for creating and using probabilistic

graphical models. In Probabilistic Graphical Models . 41

Cooper, G.F. & Herskovits, E. (1992). A Bayesian method for the induction of

probabilistic networks from data. Machine Learning , 9, 309–347. 10, 24

Domingos, P. & Pazzani, M. (1997). On the optimality of the simple bayesian

classifier under zero-one loss. Machine Learning , 103–137. 22

DSL, P. (1996-2006). The GeNIe (Graphical Network Interface) software package.

Copyright (c) by Decision Systems Laboratory, University of Pittsburgh. Available

at http://genie.sis.pitt.edu/. (Accessed: 2 July 2014). 40

Eaton, J.W., Bateman, D. & Hauberg, S. (2009). GNU Octave version 3.0.1

manual: a high-level interactive language for numerical computations . CreateSpace

Independent Publishing Platform, ISBN 1441413006. 40

71

http://abnms.org/conferences/abnms2013/

http://abnms.org/conferences/abnms2013/

http://genie.sis.pitt.edu/

72 REFERENCES

Farhangfar, A., Kurgan, L. & Dy, J. (2008). Impact of imputation of missing

values on classification error for discrete data. Pattern Recognition, 41, 3692 – 3705.

23

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Wit-

ten, I.H. (2009). The weka data mining software: an update. SIGKDD Explor.

Newsl . 40

Heckerman, D., Geiger, D. & Chickering, D.M. (1995). Learning Bayesian

networks: The combination of knowledge and statistical data. Machine learning , 20,

197–243. 10

Hirschmuller, H. (2005). Accurate and efficient stereo processing by semi-global

matching and mutual information. In Computer Vision and Pattern Recognition,

2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, 807–814, IEEE.

29

Hsu, R.L., Abdel-Mottaleb, M. & Jain, A.K. (2002). Face detection in color

images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24, 696–

706. 29

Jensen, F.V. & Nielsen, T.D. (2007). Bayesian Networks and Decision Graphs .

Springer Verlag, New York, 2nd edn. 1, 5, 10, 16

Korb, K.B. & Nicholson, A.E. (2010). Bayesian Artificial Intelligence. Chapman

& Hall/CRC, 2nd edn. 1, 2, 5, 6, 10, 16, 19, 24, 41

Lazebnik, S., Schmid, C. & Ponce, J. (2006). Beyond bags of features: Spatial

pyramid matching for recognizing natural scene categories. In Computer Vision and

Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2, 2169–2178,

IEEE. 33

Leonard, J.J. & Durrant-Whyte, H.F. (1991). Mobile robot localization by

tracking geometric beacons. Robotics and Automation, IEEE Transactions on, 7,

376–382. 35

Lowe, D.G. (1999). Object recognition from local scale-invariant features. In Compu-

ter vision, 1999. The proceedings of the seventh IEEE international conference on,

vol. 2, 1150–1157, Ieee. 31

Luo, J., Pronobis, A., Caputo, B. & Jensfelt, P. (2006a). The KTH-IDOL2

Database. Tech. Rep. CVAP304, KTH Royal Institute of Technology, CVAP/CAS,

Stockholm, Sweden. 37

REFERENCES 73

Luo, J., Pronobis, A., Caputo, B. & Jensfelt, P. (2006b). The kth-idol2 da-

tabase. KTH, CAS/CVAP, Tech. Rep, 304. 38

Martınez-Gomez, J., Jimenez-Picazo, A. & Garcıa-Varea, I. (2009). A

particle–filter-based self–localization method using invariant features as visual in-

formation. Working Notes of CLEF . 61

MATLAB (2010). version 7.10.0 (R2010a). The MathWorks Inc., Natick, Massachu-

setts. 40

Mitchell, T.M. (1997). Machine Learning . McGraw-Hill, Inc., 1st edn. 20

Neapolitan, R.E. (2003). Learning Bayesian Networks . Prentice Hall. 10

Norsys (2000). Netica. http://www.norsys.com. 41

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems . Morgan Kaufmann,

San Mateo, CA. 1, 5, 10

Rubio, F., Flores, M., Martınez-Gomez, J. & Nicholson, A. (2014). Dynamic

bayesian networks for semantic localization in robotics. In 15th Workshop of Physical

Agents (WAF 14). 3, 7

Russell, S.J. & Norvig, P. (2003). Artificial Intelligence: A Modern Approach.-

Part IV Uncertain Knowledge and Reasoning – Probabilistic Reasoning over Time,

chap. 15. Prentice Hall. 1, 5

Sahami, M., Dumais, S., Heckerman, D. & Horvitz, E. (1998). A Bayesian

Approach to Filtering Junk E-Mail. In Learning for Text Categorization: Papers

from the 1998 Workshop. 22

Spirtes, P., Glymour, C. & Scheines, R. (2000). Causation, Prediction and

Search. MIT Press, 2nd edn. 24

van Rossum, G. (2007). Python programming language. In USENIX Annual Tech-

nical Conference. 41

Vomlel, J. (2006). Noisy-or classifier. International Journal of Intelligent Systems ,

21, 381–398. 47

Wallace, Korb, O’Donnell, Hope & Twardy (2005). CaMML.

http://www.datamining.monash.edu.au/software/camml, (Accessed: 2 July 2014).

23

Wallace, C. & Korb, K. (1999). Learning linear causal models by MML sampling.

In Causal Models and Intelligent Data Management , 89–111, Springer. 24

74 REFERENCES

Wallace, C.S. (2005). Statistical and Inductive Inference by Minimum Message

Length. Springer, Berlin, Germany. 2, 6, 24

Yang, J., Jiang, Y.G., Hauptmann, A.G. & Ngo, C.W. (2007). Evaluating

bag-of-visual-words representations in scene classification. In Proceedings of the in-

ternational workshop on Workshop on multimedia information retrieval , 197–206,

ACM. 29

Yehezkel, R. & Lerner, B. (2009). Bayesian network structure learning by recur-

sive autonomy identification. Journal of Machine Learning Research, 10, 1527–1570.

24

Ziou, D., Tabbone, S. et al. (1998). Edge detection techniques-an overview. Pattern

Recognition And Image Analysis C/C Of Raspoznavaniye Obrazov I Analiz Izobraz-

henii , 8, 537–559. 29

Dynamic Bayesian Networks for semantic …neithan.weebly.com/uploads/5/2/8/0/52807/tfdg_memory.pdfdizaje de redes Bayesianas est aticas y el clasi cador Na ve Bayes. (3) Aprendizaje

Documents