UNIVERSIDAD DE CASTILLA-LA MANCHA ESCUELA SUPERIOR DE INGENIER ´ IA INFORM ´ ATICA GRADO EN INGENIER ´ IA INFORM ´ ATICA TRABAJO FIN DE GRADO TECNOLOG ´ IA ESPECIFICA DE COMPUTACI ´ ON Dynamic Bayesian Networks for semantic localization in robotics Fernando Rubio Perona July 2014
92
Embed
Dynamic Bayesian Networks for semantic …neithan.weebly.com/uploads/5/2/8/0/52807/tfdg_memory.pdfdizaje de redes Bayesianas est aticas y el clasi cador Na ve Bayes. (3) Aprendizaje
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
UNIVERSIDAD DE CASTILLA-LA MANCHA
ESCUELA SUPERIOR DE INGENIERIA INFORMATICA
GRADO EN INGENIERIA INFORMATICA
TRABAJO FIN DE GRADO
TECNOLOGIA ESPECIFICA DE COMPUTACION
Dynamic Bayesian Networks for semantic
localization in robotics
Fernando Rubio Perona
July 2014
UNIVERSIDAD DE CASTILLA-LA MANCHA
ESCUELA SUPERIOR DE INGENIERIA INFORMATICA
Departamento de Sistemas Informaticos
TRABAJO FIN DE GRADO
TECNOLOGIA ESPECIFICA DE COMPUTACION
Dynamic Bayesian Networks for semantic
localization in robotics
Author: Fernando Rubio Perona
Supervisors: Marıa Julia Flores Gallego
Jesus Martınez Gomez
Collaborators: Ann Nicholson
Alex Black
July 2014
Abstract
This project presents a solution based on Bayesian Artificial Intelligent for the
problem of semantic localization in Autonomous Robots. We have developed a met-
hodology that covers the following steps: (1) Image processing and discretization for
creating feature-based scene descriptors. (2) Learning of static Bayesian Networks and
El capıtulo 4 explica en detalle todo el proceso de desarrollo y define el primer ob-
jetivo del proyecto. La primera seccion define el conjunto de datos y la segunda seccion
se encarga de las herramientas utilizadas. Las secciones restantes de este capıtulo re-
presentan cada uno de los pasos del proceso de experimentacion, en el cual se especifica
la entrada y salida de informacion y la herramienta utilizada. Incluimos un esquema
con la diferente funcionalidad en cada paso, para poder tener una vision global del
proceso entero.
El capıtulo 5 esta dedicado totalmente a los resultados. En este capıtulo intenta-
mos alcanzar los dos objetivos relacionados con la extraccion de la informacion. Vi-
sualizaremos las diferentes graficas y tablas para poder obtener informacion acerca del
comportamiento del numero de variables y las DBNs.
El ultimo capıtulo corresponde a las conclusiones y el trabajo futuro que nosotros
proponemos.
Chapter 2
Bayesian Artificial Intelligence
2.1. Notation
Random variables are represented by upper case roman letters: X, Y , etc. A set of
variables is written with the same letters, but in bold: XXX, YYY , etc. And the variables of
these sets are represented by the same letter, as random variables but with a subscript.
XXX = X1,X2, . . . ,Xn (2.1)
The set of values that a variable can take is written by the greek character Ω and
its arity by ∣Ω∣, it usually has a subscript indicates the variable that refers: ΩX
The values of a variable is represented by lower case letters: a, b, etc. And these
values below to a variable X if a ∈ ΩX
If the variable is binary we can write one of the values with the same character as
the variable in lower case and the opposite is represented with the same letter but with
a horizontal line above: ΩX = x,xThe probabilities are represented by the upper chase character P . P (X) represents
all the probabilities of each value that the variable X can take. For example we have
a ∈ ΩX , the probability of this specific value is represented as P (X = a) or P (a).The most common way to represent probability is a table, where each value for the
variable has a probability. For example we have a variable X with ΩX = a1, a2, . . . , am,
we can see in Table 2.1 how the probabilities are represented.
X Probability
a1 P (X = a1)a2 P (X = a2)⋮ ⋮am P (X = am)
Table 2.1: Example table of probabilities
9
10 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
We represent the intersection between variables with a comma(,) or a ∩. This
represents the joint probability as we will see in the correspondent section.
The conditional relationship is represented by the vertical line( ∣ ). The term in the
left is the hypothesis we want to know and the right term is the events we already
know. A example for this notation is P (X ∣Y )The symbol á represents the independence between variables. This also is repre-
sented by the upper case I but the expressions are different depending on the character
used. We will see these differences later.
2.2. The probability in our solution
The solutions explained in this project and most concepts and techniques of Ba-
yesian A.I. are based on probabilistic inference. First of all, what is probabilistic infe-
rence? It is the ability to obtain some evidences with a certain probability degree by
the observation of other evidences in a set of variables. This means that if we see some
characteristic in an image, we can obtain a probability for the place the robot is. For
example, if we are in a house and we see a fridge, we can say with a high probability
the robot is in the kitchen, but it is possible to stay in other room. If we see a mirror,
the robot can stay in a bathroom or in a bedroom with the same probability or maybe
in other room with less probability.
To calculate the probability distribution for each variable we need to capture the
relationships between variables. As we will see later, for this purpose we will use models
able to represent those dependences and independences, which are Bayesian Networks
[Jensen & Nielsen, 2007; Korb & Nicholson, 2010; Pearl, 1988]. The construction of
these networks will normally need two phases: structure learning and parametric lear-
ning, since this second phase depends on the first one. The construction of a model
can be obtained from an expert using knowledge engineering, this process is difficult
and slow, and needs specific techniques for knowledge elicitation. Besides, it involves
the availability of an expert in the domain to be modelled. Because of this difficulties
and thanks to the increasing popularity and development of Machine Learning tech-
niques, automatic learning of Bayesian Networks from data is also possible, and it is
very usually done [Cooper & Herskovits, 1992; Heckerman et al., 1995; Neapolitan,
2003]. We will mainly focus on this second approach. Then, we will usually calculate
the probability distribution from previous experiences for example, these experienced
data is called training data.
In the following subsections, we will then introduce basic concepts about probability,
which are necessary to understand how Bayesian Networks work.
2.2. THE PROBABILITY IN OUR SOLUTION 11
2.2.1. Marginal Probability
This distribution is the probability that each value or state of the variable has to
appear or happen and it is calculated differently depending on the type of variable
(discrete or continuous).
Discrete variables have a probability for each value and we calculate it counting from
data how many times the value appears and divided it for the number of training data
cases. As the variables we work with are random these values can change depending
on the train data. So normally we work with frequencies.
When the variables are continuous, we calculate the probabilities with an estimation
approach like the Gaussian distribution.
An example of discrete variable could be to determinate where the robot is. We have
a training data of images and a label with five types of room, so the variable Room has
five values: kt(kitchen), be(bedroom), ba(bathroom), lr(living-room) and cr(corridor).
Now we obtain the probabilities counting how many times the room is in the labels
and divide it by the total of images. We can see in Table 2.2 that the robot has more
possibilities to stay in a big room like the living-room.
Room Probability
kt 0.20
be 0.20
ba 0.15
lr 0.30
cr 0.15
Table 2.2: Room Probability P (Room)
Now we see all the images and count how many times we see a mirror and a fridge.
The variable values are two: if we see the object or not.
But the margin probabilities are not enough to obtain results. As we can see in
Table 2.3 if there is a mirror in the image we don’t know where the robot is and the
same for the fridge. But we know if we see the fridge the robot have a high probability
of be in the kitchen, so we have to relate the room variable with the fridge probability.
Mirror Probability
m 0.70
m 0.30
Fridge Probability
f 0.90
f 0.10
Table 2.3: Probability of variable Mirror P (Mirror) (left). Probability of variable
Fridge P (Fridge) (right)
12 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
2.2.2. Joint Probability
The union of variables is calculated by the joint probability distribution. It consists
in the probability to observe several evidences at the same time. It is a probability dis-
tribution between the observed variables and it has as many values as each combination
of the possibles values of each variable.
In the previous example we have a mirror and a fridge, these are our variables and
each one has the value to stay or not. With this, the table of joint probability will be
like Table 2.4. We obtain this probability by counting how many times the objects are
in the same image, when only there is one of them and when there is none.
m m
f 0.61 0.29
f 0.09 0.01
Table 2.4: Joint probability for Mirror and Fridge variables P (Mirror,Fridge)
In this case there is only two variables, but the size of the table with N variables
grows exponentially with them, as we can see in equation 2.2. That means if we have
50 binary variables the size of the table will be 250 ≃ 1015 and this is impossible to
calculate and store which will be alleviated by factorisation in BNs. And now we can
relate the room and the objects information as we see in Table 2.5.
size of the table =N
∏k=1
∣Ωk∣ (2.2)
kt be ba lr cr
m 0.20 0.15 0.05 0.20 0.10
m 0 0.05 0.10 0.10 0.05
Table 2.5: Joint probability for Mirror and Room variables P (Room,Mirror)
2.2.3. Conditional Probability
We can measure the probability distribution of a variable when we know the values
of other variables by the conditional probability. For example, what is the probability
to stay in the bedroom when we see a mirror? The conditional probability is calculated
by the equation 2.3.
P (X ∣Y ) = P (X,Y )P (Y ) (2.3)
We try to answer the question before with the conditional probability to stay in a
room when we see or not an object, in this case a mirror. We observe in Table 2.6 this
2.2. THE PROBABILITY IN OUR SOLUTION 13
probabilities, these are obtained with the tables above. If we see a mirror, it is more
likely to be in the bathroom or in the living-room than in other rooms.
kt be ba lr cr
m 0.29 0.21 0.07 0.29 0.14
m 0 0.17 0.33 0.33 0.17
Table 2.6: Cond. Prob. of the rooms known the value for mirror: P (Room∣Mirror)
2.2.4. Chain Rule
This rule permits to calculate any joint distribution probability through conditio-
nal probabilities. If we have a set of variables XXX = X1,X2, . . . ,Xn we can link the
conditional probabilities with the joint by the equation 2.3 that we have seen before
and, then, we obtain this:
P (Xn, . . . ,X1) = P (Xn∣Xn−1, . . . ,X1) ∗ P (Xn−1, . . . ,X1) (2.4)
Now we repeat the process with Xn−1:
P (Xn−1, . . . ,X1) = P (Xn−1∣Xn−2, . . . ,X1) ∗ P (Xn−2, . . . ,X1) (2.5)
And repeat this until we have P (X1). Finally we join all this operations and obtain
the product:
P (Xn, . . . ,X1) = P (X1) ∗n
∏k=2
P (Xk∣Xk−1, . . . ,X1) (2.6)
In a case with three variables like the Room, Fridge and Mirror, to calculate the
joint probability of this three variables we can use the Chain Rule as we can see in the
equation 2.7. Notice that this way of ordering the variables to produce this chain is
not arbitrary, it will depend on the structure of dependences between variables, but,
as we will show in subsection 2.4.1, the fact that the graph underlying the Bayesian
Network is acyclic guarantees that a possible ordering can be found.
P (Room,Fridge,Mirror) = P (Room∣Fridge,Mirror)∗P (Fridge∣Mirror)∗P (Mirror)(2.7)
2.2.5. Independence
When we work with conditional probabilities a basic concept is the independence.
A variable X is independent of other variable Y when the knowledge of Y do not affect
14 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
to the probability of X(2.8).
XáY ≡ P (X ∣Y ) = P (X) (2.8)
This is known like marginal independence an also can be represented with I(X ∣0∣Y )An example of independence in our case will be to see a lamp with the probabilities of
Table 2.7. The lamp does not give us information about the robot is, the probabilities
are the same if we see a lamp or not.
kt be ba lr cr
l 0.20 0.20 0.15 0.30 0.15
l 0.20 0.20 0.15 0.30 0.15
Table 2.7: Cond. Prob. of the rooms known the value for lamp: P (Room∣Lamp)
It is not easy to find marginal independences between variables, so we need other
kind of independence. The conditional independence occurs when we observe a event
Y and this do not affect to the conditional probability of P (X ∣Z)
XáY ∣Z ≡ P (X ∣Y,Z) = P (X ∣Z) (2.9)
Like before the conditional independence can be represented like I(X ∣Z ∣Y ). In this
example we see a desk and a TV, and we have Tables 2.8 and 2.9. We can see how
knowing the value of TV does not affect the conditional probability of Room known
Desk.
kt be ba lr cr
d 0.28 0.07 0.25 0.21 0.19
d 0.09 0.37 0.02 0.42 0.10
Table 2.8: Cond. Prob. of the rooms known the value for Desk: P (Room∣Desk)
kt be ba lr cr
d,tv 0.28 0.07 0.25 0.21 0.19
d,tv 0.28 0.07 0.25 0.21 0.19
d,tv 0.09 0.37 0.02 0.42 0.10
d,tv 0.09 0.37 0.02 0.42 0.10
Table 2.9: Cond. Prob. of the rooms known Desk and TV: P (Room∣Desk,TV )
2.3. BAYES’ THEOREM 15
2.3. Bayes’ Theorem
Bayes’ Theorem is one of the most important formula in probability theory. This
theorem formulated by Reverend Thomas Bayes is the result of the mathematical
manipulation of conditional probabilities. If we have the P (X ∣Y ) and P (Y ∣X) we can
express it as the same joint probability as we see in Equations 2.10 and 2.11.
P (X ∣Y ) = P (Y,X)P (Y ) (2.10)
P (X ∣Y ) = P (Y,X)P (X) (2.11)
Then we equate both equation and we clear one of the conditional probabilities
obtaining the Bayes’ Theorem of Equation 2.12.
P (X ∣Y ) = P (Y ∣X)P (X)P (Y ) (2.12)
It asserts that the probability of a hypothesis X conditioned by Y is equal to its
likelihood P (Y ∣X) multiplies by the probability P (X), then it is normalized dividing
by P (Y ) to obtain a conditional probability that sums 1.
We will see the importance of Bayes’ Theorem with the next example. We suppose
that we haven’t accessed the train data and the information we have is given by an
expert. He says us the probabilities of seeing different objects known the room the
robot is and the probabilities the robot has to stay in each of these rooms 2.2. An
example of this is the desk information represented in Table 2.10.
d d
kt 0.80 0.20
be 0.20 0.80
ba 0.95 0.05
lr 0.40 0.60
cr 0.70 0.30
Table 2.10: Conditional Probability of see a desk known the value for the room
P (Desk∣Room)
This probabilities are more easy to obtain than the probabilities of Table 2.8. That
is why the Bayes’ Theorem is too important. It allows us to link the probabilities to
see an object in a room with the probabilities to be in a room when we see an object.
Other example of the importance of the Bayes’ Theorem is its use in the field
of medicine. It links the probability of the symptoms known the disease with the
probability to have a certain disease when we know the symptoms.
16 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
2.4. Bayesian Networks
A Bayesian Network is a Directed Acyclic Graph (DAG) where each node has
associated a conditional probability distribution.
The nodes are random variables and the directed arcs represent a direct dependency
between them. If we have the link X → Y , it means that the variable X is a parent of
Y . The CPTs include the conditional probability for the variable in the node known
its parents P (X ∣pa(X)). In Figure 2.1 we can see an example of a Bayesian Network
with the probabilities related to each node.
Figure 2.1: Bayes Network with 3 variables and the probabilities associated to each
node in the form P (X ∣pa(X)).
There are many methods to obtain these relationships and probability tables. The
main form to get them is to analyse a set of training data with different techniques. We
can also add links previously established by an expert to this process, or we can use
a hybrid approach where the learning algorithms accept expert information as priors
which can be input into the algorithm. We talk more about this methods in CaMML:
Learning Bayesian Networks section.
2.4.1. The Markov Property
In Figure 2.2 we see the different relationships between the variable A with the rest
in the graph representing a Bayesian Network. Variables in the green are, labelled as C,
are its parents and are represented by pa(A). Those in the blue zone (labelled as B) are
the non-descendants of the variable A without the parents and we identify them with
nonde(A). And those in red are its descendants, noted as de(A). Markov property says
that a variable is conditional independent of its non-descendants given its parents. We
could reach this reasoning using the concept of d-separation (see [Jensen & Nielsen,
2007] – chapter 2 or [Korb & Nicholson, 2010] – chapter 2, for further detail).
Aánonde(A)∣pa(A) (2.13)
2.4. BAYESIAN NETWORKS 17
Figure 2.2: Relationships between variables in a Bayesian Network.
2.4.2. Inference
As we know a joint probability can be expressed by conditional probabilities through
the chain rule (Equation 2.6). Once we know the relationships (links/edges in the
Bayesian Network) and given the Markov property, we can reduce the expression to
the equation 2.14, because a variable is independent from its non-descendent known
its parents.
P (Xn, . . . ,X1) =n
∏k=1
P (Xk∣pa(Xk)) (2.14)
We only say that a variable is independent for the non-descendent known the pa-
rents, but we erase also the descendent from the equation. We can do this because
the Bayesian Networks are DAGs and there will always be a configuration in the de-
composition into conditional probabilities that does not leave descendants in the right
side. We can see in Figure 2.3 a Bayesian Network. In this example we will find the
configuration that does not leave descendants in the right term. The best way to do
this is starting with the nodes that have no ascendant.
The first step is to apply the Chain Rule starting with the nodes without descendent
and then with their parents and so on:
P (A,B,C,D,E) = P (E∣D,C,B,A) ∗ P (D∣C,B,A) ∗ P (C ∣B,A) ∗ P (B∣A) ∗ P (A)(2.15)
18 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
Figure 2.3: Bayesian Network example with 5 variables.
Then we use the Markov property adding the parents to the right term and erase
the non-descendant:
P (A,B,C,D,E) = P (E∣D) ∗ P (D∣C,B) ∗ P (C ∣A) ∗ P (B∣A) ∗ P (A) (2.16)
If we suppose that all the variables are binary, now the biggest table than we have
to calculate have a size of 23, before we have a table with size 25. The reduction in
memory cost is significant and we only have 5 binary variables, from 25 = 32 to 3 × 4
(P (E∣D), P (C ∣A), P (B∣A) with four entries in the CPT) plus 8 (P (D∣C,B)) + 2
(P (A)) = 22.
This reduction is much clearer for larger networks. Suppose a network with 50 binary
variables, which can be considered small, with the following structure: 10 variables
have no parents (20 entries), 10 have 1 parent (10 × 4 = 40 entries), 10 have 2 parents
(10 × 8 = 80 entries), 10 have 3 parents (10 × 16 = 160 entries) and 10 have 4 parents
(10×32 = 320 entries). That means in total we need 620 entries/values to store 1, which
implies a huge reduction with respect to 250 ⋍ 1015. Indeed, the latter is not manageable.
This simple example shows how important and necessary a factorisation is, and how
Bayesian Networks succeed in performing this factorisation using the independences
the network structure is able to model.
If we remember, the conditional tables store in a Bayesian Network correspond to
the variable known the parents. If we have a case with different values for the variables
in the example above like P (a, b, c, d, e) we only have to find the correspondent values
in the tables and multiply them:
P (a, b, c, d, e) = P (e∣d) ∗ P (d∣c, b) ∗ P (c∣a) ∗ P (b∣a) ∗ P (a) (2.17)
Remark that usually we will perform queries to Bayesian Networks, so that we
1These computations omit that we can compute some derived values, as we know that some pro-
babilities sump up to 1.0. for example, given P (x) we can get P (x) as 1.0 − P (x)
2.5. DYNAMIC BAYESIAN NETWORKS 19
won’t directly ask for joint probabilities, but we will want to know to compute pos-
terior probabilities given some evidence or observations. Besides, these queries won’t
normally involve all variables (for big structures), thanks to independences. For that
purpose, Bayes’ Theorem will be used, and the computations involved will be optimi-
zed internally using inference techniques, whose description is out of the scope of this
work (see [Korb & Nicholson, 2010] – chapter 3).
2.5. Dynamic Bayesian Networks
The term dynamic refers to the temporal relationships between variables. In this
approach we consider that variables have different states in the time. In cases like
the robot localization or a meteorological problem could be very significant to have
temporal information. For example, if your robot is in a bedroom and it moves half
meter, probably is still in the same place or at most in the corridor.
Bayesian Networks are not able to model temporal relationships between variables.
One possible way for represent the temporal links is adding a copy of the variables that
represents a different time moment, changing their names so that we can identify the
variable and the time instant. We have to define a new domain of interpretation for
this propose.
If our domain have n variables V = V1, V2, ..., Vn, each one represented a node in
the static network. And the current time step is represented by t, the previous steps
are represented by t − 1, ..., t −m where t − (i + 1) is the immediate previous step of
t − i and the posterior steps are represented by t + 1, ..., t + r where t + (i + 1) is the
immediate posterior step of t + i. Each time step is called time-slice.
Once we have the nodes it is the turn for the arcs. Now we have two types of
relationships between nodes. First the relationship between variables in the same time-
slice, this are called intra-slices arcs, X ti → X t
j . Usually the intra-slices arcs are the
same in each time-slice, because the structure doesn’t usually change over time.
The second relationship between variables is called inter-slices or temporal arcs.
This includes the relationships between the same variable over time X ti → X t+1
i and
different variables over time X ti → X t+1
j . In most cases, the value of a variable at one
time affects the value of the same value in the next step.
There are some rules with this temporal arcs, you can’t have a variable of a posterior
time-slice as antecedent of a variable in a previous time-slice. It has no sense that a
previous state is modified by a posterior value. The other rule is that a variable can’t
span more than a single time step. This is because the state of the world at a particular
time depends only on the previous state and any action taken in it.
Then to obtain the Conditional Probability Table for a node X ti we can use the
20 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
Figure 2.4: General structure of a Dynamic Bayesian Network.
same method as Bayesian Networks, but now we have two types of arcs, for this we
have two types of parents, the inter-slices X ti − 1 and Zt−1
1 , ..., Zt−1r and the intra-slices
Y t1 , ..., Y
tm. The CPT is:
P (X ti ∣Y t
1 , ..., Ytm,X
t−1i , Zt−1
1 , ..., Zt−1r )
Once we have the relationships and the CPTs the inference process is the same as
in the Bayesian Networks.
2.6. Classification
This section is dedicated to the statistical classification in machine learning [Mit-
chell, 1997]. Classification is a problem that tries to categorize a new input case when
we have previous knowledge of the problem based in a training set whose category is
known. If we consider that all the categories are values of a variable, the problem tries
to predict the output of this variable when we know or observe the values of other va-
riables. These other variables (input values) are called attributes, features or predictive
variables. The variable to be predicted (output) is known as Class variable, it can be
seen as a labelling task, where the possible labels are the possible values or states that
the Class can take.
Algorithms that implement classification are called classifiers. This task of classi-
fication in machine learning is also known as supervised learning. This is a way to
distinguish it from unsupervised learning, or clustering. This supervised adjective co-
mes from the fact that classification algorithms (classifiers) can learn from previously
classified instances, and the possible values for the Class are previously known, while
in unsupervised learning, the algorithm would have to extract a way to group varia-
bles that is initially unknown. Then, in classification algorithms learn from a training
2.6. CLASSIFICATION 21
dataset where we know the value for the Class. Once a classifier is trained, it will be
able to give us a value/label for the Class when a new instance or case is input, and
this instance has observations only for the predictive attributes, and we ask about the
Class variable.
In order to prove the effectiveness of the classifier we use test data, already labelled
but not used for training the model, for the shake of fairness and evaluate generalization
– we have to avoid overfitting2. These data have different input cases. Test process
consists in obtaining a value for the Class and compare them with the real value of
the case. In this process we obtain information such as the accuracy (or hit rate) and the
confusion matrix. Hit rate represents the percentage of correctly classified instances.
Confusion matrix, or error matrix, gives these information in a more detailed way:
each column of the matrix represents the instances in a predicted class, while each row
represents the instances in an actual class.
2.6.1. Probabilistic Classifier
A probabilistic classifier gives us a probability distribution for the Class variable
when it has a sample input. The basis of the classifier is a conditional distribution
P (C ∣Y ) where the input Y is the right and known part and C is the variable Class
that we want to classify. The target of the classifier is to obtain the value for the
variable C that maximizes that probability. This will be the value predicted by the
classifier as we can see in the equation 2.18.
c = argmaxc(C = c ∣Y ) (2.18)
The Bayesian Networks can be used like a Probabilistic Classifier. The Bayesian
classifiers use the Bayes’ Theorem and the inference process of the networks to obtain
the probabilities of the class and give us the best value as we can see in the equation
2.19.
c = argmaxc (P (Y ∣C = c)P (C = c)
P (Y ) ) (2.19)
As the value of P(Y) is the same for all the cases of c and it is using for normalize,
we can remove this part of the operation leave us the next equation:
c = argmaxc(P (Y ∣C = c)P (C = c)) (2.20)
For example, given a binary class, if P (c∣YYY ) = 0.21 and P (c∣YYY ) = 0.79, YYY will be
assigned to class c.
2Overfitting occurs when a model begins to memorize training data rather than learning to gene-
ralize from trend.
22 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
2.6.2. Naıve Bayes Classifier
Naıve Bayes [Domingos & Pazzani, 1997] is the most simple classifier based in Bayes’
Theorem because it assumes that all variables are independent given the Class. Naıve
Bayes links the Class with all the variables and, therefore, it is parent of all them. The
graph structure of Naıve Bayes is like the one shown in Figure 2.5. As we can see, the
CPTs of the variables are very simple, since it involves a marginal probability for the
class variable (P (C)), and P (Xi∣C) for the rest. Then, the number of entries for these
tables is ∣ΩXi∣ × ∣ΩC ∣.
Figure 2.5: General structure for a Naıve Bayes Classifier.
This classifier is very simple and easy to implement. It reduces training time and
its results are good in some areas, like the spam detection [Sahami et al., 1998]. We use
Naıve Bayes to compare the results of more complex models, this is a good baseline
to test them. If the most simple classifier have better results suggest the networks
obtained with complex methods are useless.
In the Bayes Network section we have explained how the probability is obtained
and that is Naıve Bayes obtains its probabilities. If the variable Class is C and the
rest are X1,X2, . . . ,Xn; then P (X1,X2, . . . ,Xn∣C) is obtained by equation 2.21 and
replacing it in 2.20 we yield equation 2.22.
P (X1,X2, . . . ,Xn∣C) = P (X1∣C = c) ∗ P (X2∣C = c) ∗ . . . ∗ P (Xn∣C = c) (2.21)
c = argmaxc(P (X1∣C = c) ∗ P (X2∣C = c) ∗ . . . ∗ P (Xn∣C = c) ∗ P (C = c)) (2.22)
When learning Naıve Bayes, the structure is already fixed, which makes the process
much faster, since this structural learning is a complex task, where current research is
still being developed. Naıve Bayes only performs parametrical learning, which implies
the estimation for the values/parameters in the CPTs, which are already simple, as
indicated below.
2.7. CAMML 23
2.7. CaMML
As already introduced, Bayesian A.I. is a powerful framework, and Bayesian Net-
works (BNs) allows us make predictions, classification, study the behaviour of variables
and the relationships between them in a simple way. They have been broadly used, since
mid-eighties until nowadays, due to its double capacity: (1) knowledge and uncertainty
representation and (2) well-established algorithms for inference. That is why we have
chosen this approach.
Once this framework is accepted as a reasonably option for intelligent systems, we
have to work on the construction of a particular Bayesian Network able to represent
the problem domain we aim to model. One of the possibilities for BNs construction is
expert elicitation. However, experts do not usually know how to perform this modelling
process or give us useless or adverse information. We may need a complex and long
process of knowledge engineering in order to obtain a good model. Another possibility
is to use an algorithm able to learn the model, which involves the use of Machine
Learning techniques.
2.7.1. Learning Bayesian Networks
In order to learn a network (semi)automatically from data, the first needed element
is the dataset to learn from. The most common format of dataset consists of a list where
each row is a case. If our problem has n variables, each case in the dataset may have n
values that represent a record, concerning the particular value for every variable. This
values can be discrete or continuous, but CaMML [Wallace et al., 2005] algorithm is
only able to work with discrete data. In some cases is possible that a few values are
missing, we represents this with “?” or “*”. Usually the BN learner is capable of dealing
with these data using specific techniques. In our case, CaMML is not able to use any
of this techniques and it does not accept cases with missing values. Notice this is not a
critical problem, since there exist algorithms for imputing missing values [Farhangfar
et al., 2008].
As long as a BN has a Class variable (C), this can be used as a classifier, since
we can compute P (C ∣X) for all states of C (see Equation 2.18), being X = X1, ...Xnthe set of predictive variables. To construction a classifier, in this case a BN, we will
use a training dataset. As indicate before, other datasets can be use to evaluate the
performance of the learned classifier: test and validation data. Finally, the aim of a
classifier is to predict the class value for a new case whose label is unknown so that
the model can automatically classify new instances. So, the use of datasets in machine
learning is for initially learn the model, but this kind of the information will also be
used for future prediction, classification, validation, etc...
24 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
Algorithms for learning BNs are to provide techniques for learning the DAG struc-
ture and also mechanisms for estimating the parameters of the CPTs from data. There
is one key limitation when learning BNs from observational data only – there is usually
no unique BN that represents the joint distribution. More formally, two BNs in the
same statistical equivalence class (SEC) can be parametrized to give an identical joint
probability distribution. There is no way to distinguish between the two using only ob-
servational data (although they may be distinguished given experimental data). That is
why many algorithms based on search techniques use the SEC space, which obviously,
is also smaller and the search will be more efficient [Chickering, 1995].
BN structural learning algorithms can be classified into constraint-based and metric-
based. Constraint-based methods (e.g., PC [Spirtes et al., 2000], RAI [Yehezkel &
Lerner, 2009]) use information about conditional independences gained by performing
statistical significance tests on the data. Metric-based methods (e.g., K2 [Cooper &
Herskovits, 1992], CaMML [Wallace & Korb, 1999]) search for a BN to minimize or
maximize a metric; many different metrics have been used, (e.g. K2 uses the BDe