Page 1
© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Simulation of Discrete Event Systems
Unit 10 and 11
Bayesian Networks and Dynamic Bayesian Networks
Fall Winter 2017/2018
Prof. Dr.-Ing. Dipl.-Wirt.-Ing. Sven Tackenberg
Benedikt Andrew Latos M.Sc.RWTH
Chair and Institute of Industrial Engineering and Ergonomics
RWTH Aachen University
Bergdriesch 27
52062 Aachen
phone: 0241 80 99 440
email: [email protected]
Page 2
10 - 2© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Contents
1. Introduction
2. Background
- Bayes theorem and rules of probability
- Maximum a posteriori hypothesis
- Bayesian methodology to calculate posterior distributions
3. Bayesian networks
- Approach
- Definition
- Inference in simple Bayesian networks
4. Introduction to Dynamic Bayesian networks
5. Formalism of Dynamic Bayesian Networks
Page 3
10 - 3© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
model
dynamicstatic
time-invarianttime-varying
nonlinearlinear
discrete statescontinuous states
event-driventime-driven
stochasticdeterministic
continuous-timediscrete-time
Focus of lecture and exercise
Focus of lecture and exercise
Page 4
10 - 4© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
1. Introduction
1. Introduction
Page 5
10 - 5© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
What for are Bayesian Networks helpful?
Observe
Learn Decide
Experts are persons who need for processing their tasks a specific expertise
Normal procedure of experts
Based on observations decisions are made
Decisions lead to actions
An Action causes good or bad results
The results lead to a learning of the expert
Experts often have to make decisions based
on incomplete and conflicting information
The probable best decision is in general the
one which minimizes the risk!
Bayesian networks are used to build up an expert system.
Page 6
10 - 6© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Motivation
The Bayes theorem and the associated rules of probability are a consistent and
powerful basis for algorithms manipulating probability mass functions and
probability density functions directly.
The Bayes methodology is a statistical approach to modeling and simulating
discrete-event systems under uncertainty.
!Why is it worth to consider the Bayes methodology?
The basic assumption is that the state variables can be represented by
probability mass functions (discrete variables) or probability density functions
(continuous variables).
Based on the Bayes theorem, conclusions can be drawn to identify optimal
decisions.
!Which is the concept of Bayes methodology?
!What is the Bayes methodology used for?
Page 7
10 - 7© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Think about….
… if you see that there are clouds, what is the probability soon there will be rain?
… if you know that it is raining, by hearing it patter on the roof, what is the probability
that there are clouds?
p(clouds | rain)
p(rain| clouds)
Is p(rain | clouds) equal to p(clouds | rain )??
Page 8
10 - 8© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Repetition of relevant definitions and formulas of
probability theory
Probability of event A:
Probability of event A and on the condition of event B:
Bayes’ formula (theoretical basis of Bayesian Networks):
Formula of the total probability:
P(A)
P(A|B)
P(A|B) =P(B|A) P(A)
P(B)
This formula enables the conversion of the probability of event A on the condition
of event B into the probability of event B on the condition of event A.
𝑃 𝐴 =
𝑖
𝑃 𝐴|𝐵𝑖 𝑃 𝐵𝑖
The absolute probability of A can be calculated based on the
conditional probability of A.
Page 9
10 - 9© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
1. Product rule: The joint probability of A and B is:
2. Independence: The random variables A and B are independent, if the joint probability distribution
can be factorized as:
3. Sum rule: If the hypotheses B1, ..., Bn are mutually exclusive and therefore form a partition of the
set B, the marginal likelihood of the data is:
Hence, the Bayes theorem can be expanded:
Rules of probability
Note, associated with Bayesian methodology the random variables A and B are named D and h.
𝑃 𝐴, 𝐵 = 𝑃 𝐵|𝐴 𝑃 𝐴 = 𝑃 𝐴|𝐵 𝑃 𝐵
B in condition to A A in condition to B
𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝑃 𝐵
𝑃 𝐴 =
𝑖
𝑃 𝐴, 𝐵𝑖 =
𝑖
𝑃 𝐴|𝐵𝑖 𝑃 𝐵𝑖
𝑃 𝐵𝑖|𝐴 =𝑃 𝐴|𝐵𝑖 𝑃 𝐵𝑖
σ𝑖′𝑃 𝐴|𝐵𝑖′ 𝑃 𝐵𝑖
′
Page 10
10 - 10© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Causal networks – Introduction
Causal networks are a precursor of Bayesian networks!
Formalism to describe causal dependence within given situations. Consisting of:
Set of variables
Each variable can have different (finite, infinite) states
Set of directed arcs
Each variable must be in one of the defined states, but the current state could be unknown!
A B
State of variable A direct causes the occurrence of states of variable B
Page 11
10 - 11© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a causal network
W
CD
K M
W (Winter): {true, false}
C (Slippery roads): {true, false}
D (Klaus drunk alcohol): {true, false}
K (Klaus has an accident): {true, false}
M (Mike has an accident): {true, false}
Formalism to describe causal dependence within given situations. Consisting of:
Season of the year: Variable winter | states (true, false) | has a significant impact on the condition of
the street
Condition of the street: Variable C | states (true, false) | describes the sleekness of the street and has
a significant impact on the risk of an accident of Klaus (K) or Mike (M)
Occurrence of an accident: Variables K or M | states (true, false) | describe the occurrence of an
accident of Klaus (K) or Mike (M)
Condition of Klaus: Variable D | states (true, false) | describes if Klaus has drunken alcohol
Causal
network
Page 12
10 - 12© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Dependency and conditional dependency
Two variables A and B of a causal network are designated as dependent if the
probabilities of the states of variable A depends on the state of variable B and vice
versa:
Two variables A and B of a causal network are designated as conditional dependent if A
and B are dependent for specific states Z and independent for all other states ҧ𝑍.
𝑃 𝐴, 𝐵 ≠ 𝑃 𝐴 𝑃 𝐵
𝑃 𝐴, 𝐵|𝑍 ≠ 𝑃 𝐴|𝑍 𝑃 𝐵|𝑍
𝑃 𝐴, 𝐵| ҧ𝑍 ≠ 𝑃 𝐴| ҧ𝑍 𝑃 𝐵| ҧ𝑍
and
Page 13
10 - 13© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Dependencies (1/2)
W
C
M
Serial Dependency
W (Winter): {true, false}
C (Slippery roads): {true, false}
M (Mike has an accident): {true, false}
Variables W and M are independent if the condition of the road C is known.
If the conditions of the street are known the season has no impact on the probability
of an accident.
K
C
M
BranchC (Slippery roads): {true, false}
K (Klaus has an accident): {true, false}
M (Mike has an accident): {true, false}
Variables K and M are independent if the condition of the road C is known.
If K has an accident and the condition of the street is unknown the probability of
the sleekness of the street increases. Furthermore, the probability of an accident
of M increases.
Page 14
10 - 14© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Dependencies (2/2)
Merge
D C
D (Klaus drunk alcohol): {true, false}
C (Slippery roads): {true, false}
K (Klaus has an accident): {true, false}
Variables D and C dependent on each other if the state of variable K is known.
If Klaus (K) has an accident and the street is not slippery then the probability that he
has drunken alcohol is increased.
K
Page 15
10 - 15© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
2. Background
2. Background
Page 16
10 - 16© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example: Diagnosis of scarce faults
A X-ray test of a track is done:
Object has hairline cracks
Object has no hairline cracks
Measurement result: hairline crack true: in 98% of the cases
Measurement result: hairline crack false: in 97% of the cases
Hairline cracks occur only at 0.8% of the produced tracks.
? Calculate the probability that a measurement indicates hairline cracks and in
reality the track has some cracks.
Page 17
10 - 17© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Bayes theorem
( | ) ( )( | )
( )
P D h P hP h D
P D
The Bayes theorem goes back to the seminal work of the English reverent
Thomas Bayes in the 18th century on games of chances.
To answer this question, the Bayes theorem is used.
Formula:
P(h): A priori probability of a hypothesis h (or a model) representing the initial
degree of belief
P(D): A priori probability of the data D (observations)
P(h|D): A posteriori probability of hypothesis h under the condition of given data D
P(D|h): Probability of data D under the condition of hypothesis h
known as the Bayesian methodology
Two meanings of probability:
Frequencies of outcomes in random experiments, e.g. repeated rolling of a dice
Degrees of belief in propositions that do not necessarily involve random experiments,
e.g. probability that a certain production machine will fail, given the evidence of a poor
surface quality of the workpiece
Page 18
10 - 18© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example: Diagnosis of scarce faults
Measurement: hairline crack true: in 98% of cases
Measurement: hairline crack false: in 97% of cases
Hairline cracks occur only at 0.8% of the produced tracks
?Calculate the probability that a measurement indicates
hairline cracks and in reality there are some cracks.
( | ) ( )( | )
( )
P D h P hP h D
P D
P(h): A priori probability of a hypothesis h (or a model) representing the
initial degree of belief
P(D): A priori probability of the data D (observations)
P(h|D): A posteriori probability of hypothesis h under the condition of given data D
P(D|h): Probability of data D under the condition of hypothesis h
𝑃 𝑠𝑐𝑟𝑎𝑝|⨁ =𝑃 ⨁|𝑠𝑐𝑟𝑎𝑝 𝑃 𝑠𝑐𝑟𝑎𝑝
𝑃 ⨁
Track has a crack
Data shows crack
Track has a crack
Probability that the Data shows a crack
Page 19
10 - 19© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example: Diagnosis of scarce faults
The probability of a positively tested track
that also has hailine cracks is only 21%!
𝑃 𝑠𝑐𝑟𝑎𝑝|⨁ =𝑃 ⨁|𝑠𝑐𝑟𝑎𝑝 𝑃 𝑠𝑐𝑟𝑎𝑝
𝑃 ⨁
Track has a crack
Data shows crack
Track has a crack
Probability that Data shows a crack
𝑃 𝑠𝑐𝑟𝑎𝑝 = 0.008
𝑃 ⨁|𝑠𝑐𝑟𝑎𝑝 = 0.98
𝑃 ⊝ | ⊣ 𝑠𝑐𝑟𝑎𝑝 = 0.97
⊣ not
𝑃 ⊣ 𝑠𝑐𝑟𝑎𝑝 = 0.992
𝑃 ⊝ |𝑠𝑐𝑟𝑎𝑝 = 0.02
𝑃 ⨁| ⊣ 𝑠𝑐𝑟𝑎𝑝 = 0.03
Auxiliary calculation: 𝑃 ⨁ = 𝑃 ⨁|𝑠𝑐𝑟𝑎𝑝 𝑃 𝑠𝑐𝑟𝑎𝑝 + ⨁| ⊣ 𝑠𝑐𝑟𝑎𝑝 𝑃 ⊣ 𝑠𝑐𝑟𝑎𝑝
Probability that
Data shows a crack= 0,0376
𝑃 𝑠𝑐𝑟𝑎𝑝|⨁ =0.98 ∙ 0.008
0.0376≈ 0.21
Page 20
10 - 20© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Bayesian methodology
• The choice of P(h) and P(D|h) represents the a priori knowledge and assumptions of the modeler
concerning the application domain.
• The hypotheses are regarded as functions of the observations, which can be adapted iteratively
to the state of knowledge of an observer.
• If all hypotheses have the same a priori probability, the equation above can be simplified further
and only the term P(D|h) has to be maximized.
Each hypothesis maximizing P(D|h) is called the maximum likelihood hypothesis (hML) :
arg maxMLh H
h P D h
( )
arg max arg max arg max ( )( )
MAPh H h H h H
P D h P hh P h D P D h P h
P D
Objective function for Bayesian parameter estimation is the most likely hypothesis given the
observations. The hypothesis hMAP representing the maximum of the probability mass is
called the maximum a posteriori hypothesis:
Page 21
10 - 21© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of Bayesian methodology (I)
Workpieces of only one type are stored in a pallet cage.
A produced workpiece is faultless (index g for “good”)
A produced workpiece is defective (index b for “bad”).
Due to a new manufacturing process, the prior probability distribution
of the frequency of faultless and defective workpieces is unknown.
?Calculate the posterior distribution of the proportion of faultless workpieces
step-by-step (produced workpieces) on the basis of the Bayesian methodology.
The input data are a sample of N workpieces, randomly drawn from the line!
The workpieces in the sample are tested independently!
Page 22
10 - 22© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of Bayesian methodology (II)
The probability of observing exactly ng times faultless workpieces in the sample
follows the binomial distribution.
Workpieces of only one type are stored in a pallet cage.
A produced workpiece is faultless (index g for “good”)
A produced workpiece is defective (index b for “bad”).
The proportions to be estimated under hypothesis h on the basis of
the sample of size N are:
ℎ = Ƹ𝑝𝑔, Ƹ𝑝𝑏 = Ƹ𝑝𝑔,1 − Ƹ𝑝𝑔 Ƹ𝑝𝑔 : estimated proportion of “good” workpiece
The properties of the sample can be described sufficiently by the
following aggregated quantities:
𝑛𝑏 = 𝑁 − 𝑛𝑔 𝑛𝑏 : frequency of “good” workpieces after N tests
Page 23
10 - 23© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Binomial distribution
The probably most important discrete distribution is the binomial distribution
Lets consider an experiment with n trials
Each trial can result in two states {a, b}
The probability of a or b is the same in each trial
The number of {a} is X
Probability, that a specific number of {a} appears.
𝑝 𝑋 = 𝑥 =𝑛𝑥
𝑝𝑥 1 − 𝑝 𝑛−𝑥
The distribution is defined by n and p.
- Mean value: 𝜇 = 𝑛 ∙ 𝑝- Variance: 𝜎2 = 𝑛 ∙ 𝑝 1 − 𝑝
Example: If an accident occurs, every tenth person of the population is able to provide initial
medical treatment
How large is the probability that there are 0, 1, … up to10 persons of a total quantity
of 10, who are able to provide initial medical treatment.
𝑒. 𝑔. 𝑝 𝑋 = 1 → 𝑂𝑛𝑒 𝑝𝑒𝑟𝑠𝑜𝑛 𝑖𝑠 𝑎𝑏𝑙𝑒 𝑡𝑜 𝑚𝑎𝑘𝑒𝑎 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡
𝑝 𝑋 = 0 =100
0.10 1 − 0.9 10−0 = 0.3487
𝑝 𝑋 = 1 =101
0.11 1 − 0.9 10−1 = 0.3874
Page 24
10 - 24© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of Bayesian methodology (III)
The binomial distribution represents the generative model of the data P(D|h) under hypothesis h:
𝑃 𝑛𝑔|𝑝𝑔, 𝑁 =𝑁!
𝑁 − 𝑛𝑔 ! 𝑛𝑔!𝑝𝑔𝑛𝑔
1 − 𝑝𝑔𝑁−𝑛𝑔 Probability of observing exactly ng
times faultless workpieces in the sample
Bayesian methodology (Remember )
Objective function for Bayesian parameter estimation is the most likely hypothesis under the
given the observations:
ℎ𝑀𝐴𝑃 = 𝑎𝑟𝑔 𝑚𝑎𝑥ℎ∈𝐻𝑃 𝐷|ℎ 𝑃 ℎ
𝑃 𝐷= 𝑎𝑟𝑔 𝑚𝑎𝑥ℎ∈𝐻𝑃 𝐷|ℎ 𝑃 ℎ
Due to the new manufacturing process, there is no knowledge regarding the proportion of
faultless workpieces.
Prior probability of the corresponding hypothesis h is described by a uniform distribution
for the parameter pg:
𝑓𝑝 𝑝𝑔|𝑁 = 0, 𝑛𝑔 = 0 =Γ 2
Γ 1 Γ 1𝑝𝑔0 1 − 𝑝𝑔
0= 1
Page 25
10 - 25© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
For each measurement observation the initial uniform distribution is transformed into the Beta-
type posterior distribution for the independent parameter pg.
Example of Bayesian methodology (IV)
𝑓𝑝 𝑝𝑔|𝑁, 𝑛𝑔 =Γ 𝑁 + 2
Γ 𝑛𝑒 + 1 Γ 𝑁 − 𝑛𝑔 + 1𝑝𝑔𝑛 1 − 𝑝𝑔
𝑁−𝑛𝑔~𝛽 𝑁 + 2, 𝑛𝑒 + 1
Due to the Bayesian methodology we can define the A-posteriori probability density:
Incremental measuring of the workpieces drawn from the production line leads to the samples:
after N = 5 measurements ng = 3 workpieces turned out to be faultless
after N = 10 measurements ng = 6 workpieces turned out to be faultless
after N = 15 measurements ng = 9 workpieces turned out to be faultless
Page 26
10 - 26© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
( )p gf p
( 3, 5)p g gf p n N
( 6, 10)p g gf p n N
( 9, 15)p g gf p n N ˆ 9, 15MAP
g gp n N
ˆ 6, 10MAP
g gp n N
ˆ 3, 5MAP
g gp n N
gp
pf
Example of Bayesian methodology (V)
Page 27
10 - 27© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
!ˆ arg max ( , ) arg max (1 )
( )! !
g g
g g
n N nML
g g g g gp p
g g
g
Np P n p N p p
N n n
n
N
Conversely, when using the maximum likelihood estimator and not the maximum a posteriori
estimator we have the point estimate:
For instance, the maximum likelihood value for the first sample that had been drawn from the line
(N = 5, ng = 3) is:
3 2
!3 2
!2 2 3
ˆ arg max (1 )
(1 ) 0
3ˆ3 (1 ) 2(1 )( 1) 0
5
g
ML
g g gp
g g
g
ML
g g g g g
p p p
dp p
dp
p p p p p
Obviously, the maximum likelihood estimate is equivalent to the relative frequency of the
faultless workpieces in the tested sample!
Example of Bayesian methodology (V)
Page 28
10 - 28© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
3. Bayesian Networks
3. Bayesian Networks
Page 29
10 - 29© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian Network (1/9)
Winter
Sprinkler Rain
Wet
Grass
Wet
Road
Winter = true Winter = false
0.6 0.4
ΘWinter|∅ is:
Page 30
10 - 30© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian Network (2/9)
ΘRain|Winter is:
Winter Rain ¬Rain
true
false
0.8
0.1
0.2
0.9
¬ represents „false“
Winter
Sprinkler Rain
Wet
Grass
Wet
Road
Page 31
10 - 31© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian Network (3/9)
ΘWet Grass|Sprinkler,Rain is:
¬ represents „false“
Winter
Sprinkler Rain
Wet
Grass
Wet
Road
Sprinkler
true
true
false
false
Rain ¬Wet GrassWet Grass
true
false
true
false
0.95
0.9
0.8
0
0.05
0.1
0.2
0
Page 32
10 - 32© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian Network (4/9)
ΘWet Road|Rain is:
Rain Wet Road ¬Wet Road
true
false
0.7
0
0.3
1
¬ represents „false“
Winter
Sprinkler Rain
Wet
Grass
Wet
Road
Page 33
10 - 33© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian Network (5/9)
¬ represents „false“
Winter
Sprinkler Rain
Wet
Grass
Wet
Road
Probability distribution described by a Bayesian Network
Allocation of interest:
ω(Winter) = true
ω(Sprinkler) = false
ω(Rain) = true
ω(Wet Grass) = true
ω(Wet Road) = true
Winter ¬Winter
0.6 0.4
ΘWinter|∅Probability of winter P(Winter) = 0.6
ΘSprinkler|Winter
Winter Sprinkler ¬Sprinkler
true
false
0.2
0.75
0.8
0.25
Probability of “Winter” and not
used “Sprinkler”
P(W ∧ ¬S) = 0.6 ∙ 0.8 = 0.48
Page 34
10 - 34© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian Network (6/9)
¬ represents „false“
Winter
Sprinkler Rain
Wet
Grass
Wet
Road
Probability distribution described by a Bayesian Network
Allocation of interest:
ω(Winter) = true
ω(Sprinkler) = false
ω(Rain) = true
ω(Wet Grass) = true
ω(Wet Road) = true
ΘRain|Winter
Winter Rain ¬Rain
true
false
0.8
0.1
0.2
0.9
Probability of “Winter” and not
used “Sprinkler”
P(W ∧ ¬S) = 0.6 ∙ 0.8 = 0.48
Probability of “Winter” and not
used “Sprinkler” and “Rain”
P(W ∧ ¬S ∧ R) = 0.48 ∙ 0.8 = 0.384
Page 35
10 - 35© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian Network (7/9)
¬ represents „false“
Winter
Sprinkler Rain
Wet
Grass
Wet
Road
Probability distribution described by a Bayesian Network
Allocation of interest:
ω(Winter) = true
ω(Sprinkler) = false
ω(Rain) = true
ω(Wet Grass) = true
ω(Wet Road) = true
ΘWet Grass|Sprinkler,Rain
Sprinkler
true
true
false
false
Rain ¬Wet GrassWet Grass
true
false
true
false
0.95
0.9
0.8
0
0.05
0.1
0.2
0
Probability of “Winter” and not
used “Sprinkler” and “Rain”
P(W ∧ ¬S ∧ R) = 0.384
Probability of “Winter” and not
used “Sprinkler” and “Rain” and
“Wet road”
P(W ∧ ¬S ∧ R ∧ WG)
= 0.384 ∙ 0.8 = 0.3072
Page 36
10 - 36© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian Network (8/9)
¬ represents „false“
Winter
Sprinkler Rain
Wet
Grass
Wet
Road
Probability distribution described by a Bayesian Network
Allocation of interest:
ω(Winter) = true
ω(Sprinkler) = false
ω(Rain) = true
ω(Wet Grass) = true
ω(Wet Road) = true
ΘWet Road|Rain is:
Rain Wet Road ¬Wet Road
true
false
0.7
0
0.3
1
Probability of “Winter” and not
used “Sprinkler” and “Rain” and
“Wet road”
P(W ∧ ¬S ∧ R ∧ WG)
= 0.3072
Probability of “Winter” and not
used “Sprinkler” and “Rain” and
“Wet road”
P(W ∧ ¬S ∧ R ∧ WG ∧ WR )
= 0.3072 ∙ 0.7 = 0.21504
Page 37
10 - 37© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian Network (9/9)
¬ represents „false“
Winter
Sprinkler Rain
Wet
Grass
Wet
Road
Probability distribution described by a Bayesian Network
Allocation of interest:
ω(Winter) = true
ω(Sprinkler) = false
ω(Rain) = true
ω(Wet Grass) = true
ω(Wet Road) = true
Summarized:
Pr ω = ΘW|∙ ∙ Θ¬S|W ∙ ΘR|W ∙ ΘWG|S∧R ∙ Θ¬WR|R
This basically corresponds to the chain rule of probabilities:
Pr φ1 ∧ ⋯∧ φ𝑛 = Pr φ1|φ2 ∧ ⋯∧ φ𝑛 Pr φ2|φ3 ∧ ⋯∧ φ𝑛 …Pr 𝛼𝑛 .
Page 38
10 - 38© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Approach (I)
Reason: Number of alternatives to factorize the joint probability distribution increases exponentially
with the number of variables:
To classify and predict a discrete event system model with uncertainty,
it is necessary to make assumptions about statistical independency of variables.
Assumption:
𝑃 𝑋1, 𝑋2 = 𝑃 𝑋2|𝑋1 ∙ 𝑃 𝑋1 = 𝑃 𝑋1|𝑋2 ∙ 𝑃 𝑋2
𝑃 𝑋1, 𝑋2, 𝑋3 = 𝑃 𝑋1|𝑋2, 𝑋3 ∙ 𝑃 𝑋2|𝑋3 ∙ 𝑃 𝑋3 = 𝑃 𝑋2|𝑋1, 𝑋3 ∙ 𝑃 𝑋2|𝑋3 ∙ 𝑃 𝑋3
A conditional independency of random variables X and Y given Z, if it holds:
Conditional independency:
𝑃 𝑋, 𝑌|𝑍 = 𝑃 𝑋|𝑍 ∙ 𝑃 𝑌|𝑍 ⇔ 𝑃 𝑋|𝑌, 𝑍 ∙ 𝑃 𝑋|𝑍
Bayesian networks
… encode conditional independency assumptions among subsets of random system variables
… are represented by a directed acyclic graphical model, with:
- directed arcs between nodes (model structure)
- conditional probability tables related to the random system
variables (model parameters)
Page 39
10 - 39© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Approach (II)
RainWet
RoadSemantics of the graphical model:
Bayesian networks
Nodes: Random variables as state variables and observables of the system model
Directed arcs: Causal dependencies of the system model from which the conditional
independency of the random system variables follows
If a directed arc is drawn from node X (“Rain”) to node Y (“Wet Road”), node X is called parent
node of Y and Y is called the child node of X
Nodes without parent nodes are called root nodes
A directed path from node X to Y is said to exist, if one can find a valid sequence of nodes starting
from X and ending in Y such that each node in the sequence is a parent of the following node in the
sequence
Each random variable Y with the parent nodes X1, ..., Xn is associated with a conditional
probability table (CPT) encoding the conditional probability P(Y=y | X1=x1, ..., Xn=xn)
Parent Child
Root node
RainWet
RoadClouds
X Y
Rain
Wet
Road
SprinklerX1
X2
Y
Page 40
10 - 40© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Definition of a Bayesian network
A discrete Bayesian network (BN) is represented by the parameter tuple:
Definition of a discrete Bayesian network (BN) :
𝜆𝐵𝑁 = 𝐺, 𝛩
G is a directed, acyclic graph. Its nodes represent discrete random variables Xi (i = 1, ... n):
„A node is conditionally independent from its non-descendents, given its parents“ if the
predecessor nodes are given.
𝛩i = (aimr) are the conditional probability tables (CPT) of nodes of the network with the
components aimr (values):
RainWet
RoadClouds
Slippery
Road
given non-descendent
aimr = P(Xi = xm | Parents(Xi) = wr) (m = 1... |Xi|; r = 1...|Parent1(Xi)| ∙ |Parent2(Xi)| ∙ |… |)
aimr The index r of the CPT columns enumerates the possible combinations of values wr of the associated
parent nodes (if the node is a root node, r is simply 1)
m = 1... |Xi| simply enumerates the values of the discrete random variable Xi.
The column vectors in the CPTs have always a sum of one.
r1
r2Values w2
m
Page 41
10 - 41© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
1. Proposition: The joint probability distribution of a discrete Bayesian network with the
random variables X1, X2, …, Xn can be factorized as follows:
Factorization of the joint probability distribution
𝑃 𝑋1, 𝑋2, … , 𝑋𝑛 =ෑ
𝑖=1
𝑛
𝑃 𝑋𝑖|𝑃𝑎𝑟𝑒𝑛𝑡𝑠 𝑋𝑖
Predecessor of Xi
Therefore, a transformation is only forward directed!
Note: Factorization mechanism is directly associated with the graphical model:
Compared to a fully interlinked and structurally uninformative graph the number of
alternatives to factorize the joint probability distribution can be significantly reduced.
A graphical model can be developed from first principles and established theories
about cause and effect relationships.
Note: Several valid factorizations can exist for a given joint probability distribution
of a Bayesian model
Page 42
10 - 42© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian network (I)
A production machine (M) tends to produce a significant amount of defective parts.
Source: 1000steine.de
.
Causes:
Its drive (D) is over-heated
The control electronics (E) are disturbed
The shop floor temperature (T) influences the over-heating
of the drive (D)
The shop floor (T) temperature depends on the season (S),
because there is no air conditioning system.
The functioning of the control electronics (E) is affected by grid (G)
voltage jitters and by the shop floor temperature (T).
Graphical model of
conditional independencies:
MachineTemperature
GridControl
Electronics
Drive
Season
Page 43
10 - 43© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian network (II)
Graphical model of
conditional independencies:
MachineTemperature
GridControl
Electronics
Drive
Season
X1 = M with binary states: {normal productivity, low productivity} = {m, ¬m}
X2 = E with binary states: {faultless, disturbed} = {e, ¬e}
X3 = D with binary states: {normal, over-heated} = {d, ¬d}
X4 = G with binary states: {no voltage jitters, significant jitters} = {g, ¬g}
X5 = T with ternary states: {high, normal, low} = {h, n, l}
X6 = S with quaternary states: {winter, spring, summer, fall} = {w, p, s, f}
Random system variables of system model:
Page 44
10 - 44© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
S = w S = p S = s S = f
P(T = h|.) 0.05 0.10 0.75 0.10
P(T = n|.) 0.20 0.30 0.20 0.30
P(T = l|.) 0.75 0.60 0.05 0.60
Example conditional probability table (CPTT) of the variable temperature (T):
E = e ∧ D = d E = e ∧ D = ¬d E = ¬e ∧ D = d E = ¬e ∧ D = ¬d
P(M = m|.) 0.94 0.01 0.025 0.01
P(M = -m|.) 0.06 0.99 0.975 0.99
Example conditional probability table (CPTM) of production machine (M):
Example of a Bayesian network (III)
{high, normal, low} = {h, n, l} relating to the season
Temperature: high
Temperature: normal
Temperature: low
Season: Winter Spring Summer Fall
normal
low
Machine productivity:
Electronic: faultless e, disturbed ¬e | Drive: normal d, over-heated ¬d
Page 45
10 - 45© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Example of a Bayesian network (IV)
, , , , , , , ( ) ( )P M E D G T S P M E D P E G T P D T P T S P S P G
The joint probability distribution encoded by a discrete Bayesian network with
the random variables X1, X2, …, Xn can be factorized as follows:
1 2
1
( , ,..., )n
n i i
i
P X X X P X Parents X
Remember
For the example the following parameter setting is developed:
P(M,E,D,G,T,S) =
Machine
Control
Electronics
Drive
P(M|E,D)
Temperature
Grid
P(E|G,T) P(D|T)
Season
P(T|S)
Grid
P(G)P(S)
Season
Page 46
10 - 46© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Inference in Bayesian networks (I)
Overall goal
Probability calculation with Bayesian networks, also referred to as “inference”
Estimation of the probability mass functions of not-observable (hidden) random
variables in the network, if (some) states of observable variables are known.
Child
Parent (root)
MachineTemperature
GridControl
Electronics
Drive
Season
!If due to the network structure the child nodes are observable and hidden causes have to be
estimated, the inference is called a diagnosis or bottom-up inference.
Example: P(“significant grid voltage jitters” | ”low productivity of machine”)
Grid
Machine
If root nodes or parent nodes are observable and effects have to be estimated, the inference is
called a prognosis or top-down inference.!
Example: P (“low productivity of machine” | ”over-heated drive”)
Machine
Drive
Page 47
10 - 47© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Inference in Bayesian networks (II)
Child
Parent (root)
MachineTemperature
GridControl
Electronics
Drive
Season
!Inference in Bayesian networks is very flexible: The states of arbitrary network nodes can be
defined and therefore the probability distributions of the other nodes can be updated.
Example: P(... | “winter season”, “significant grid voltage jitters“)
But… the exact calculation of probability values usually is a NP-incomplete problem.
Therefore, we only present closed-form solutions for chains of variables
(like Markov chains) and a simple tree in this introductory course.
Page 48
10 - 48© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Diagnosis in chains (I)
Case 1: Dual chain: X Y and {Y = y} is observed
( | ) ( )( | )
( )
P D h P hP h D
P D
P(h): A priori probability of a hypothesis h (or a model) representing the initial degree of belief
P(D): A priori probability of the data D (observations)
P(h|D): A posteriori probability of hypothesis h under the condition of given data D
P(D|h): Probability of data D under the condition of hypothesis h
Remember the Bayes Theorem:
y represents the a priori probability of the observed data D
probability of the observed y under the condition of x
𝐵𝑒𝑙𝑖𝑒𝑓 𝑥 ≡ 𝑃 𝑋 = 𝑥|𝑌 = 𝑦 =𝑃 𝑋 = 𝑥|𝑌 = 𝑦 𝑃 𝑋 = 𝑥
𝑃 𝑌 = 𝑦=
𝑃 𝑋 = 𝑥|𝑌 = 𝑦 𝑃 𝑋 = 𝑥
σ𝑥′𝑃 𝑌 = 𝑦|𝑋 = 𝑥′ 𝑃 𝑋 = 𝑥′
𝐵𝑒𝑙𝑖𝑒𝑓 𝑥 ≡ 𝑝 𝑥|𝑦 =𝑝 𝑦|𝑥 𝑝 𝑥
σ𝑥′ 𝑝 𝑦|𝑥′ 𝑝 𝑥′= 𝑐 ∙ 𝑝 𝑥 ∙ 𝑙 𝑥
with: 𝑐 =
𝑥′
𝑝 𝑦|𝑥′ 𝑝 𝑥′−1
; 𝑙 𝑥 = 𝑝 𝑦|𝑥
Page 49
10 - 49© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Diagnosis in chains (II)
Case 1: Dual chain: X Y and {Y = y} is observed
MachineTemperature
GridControl
Electronics
Drive
Season
Example: Grid control electronics, observed is {control electronics = “disturbed”}
Assumptions: P(E = “faultless” | G = “no jitters”) ≡ p(e|g) = 0.9 ⟹ p(¬e|g) = 0.1
Faultless of electronics in condition to no grid voltage jitters
P(E = “perturbed” | G = “significant jitters”) ≡ p(¬e|¬g) = 0.8 ⟹ p(e|¬g) = 0.2
Disturbed electronics in condition to significant grid voltage jitters
P(G = “no jitters”) ≡ p(g) = 0.95 ⟹ p(¬g) = 0.05
Probability of the occurrence of jitters
𝐵𝑒𝑙𝑖𝑒𝑓 𝑔 = 𝑐 ∙ 𝑝 𝑔 ∙ 𝑙 𝑔 = 𝑐 ∙ 𝑝 𝑔 ∙ 𝑝 ¬e|g = 𝑐 ∙0.95 ∙0.1 = 𝑐 ∙ 0.095
𝐵𝑒𝑙𝑖𝑒𝑓 ¬𝑔 = 𝑐 ∙ 𝑝 ¬𝑔 ∙ 𝑙 ¬𝑔 = 𝑐 ∙ 𝑝 ¬𝑔 ∙ 𝑝 ¬e|¬g = 𝑐 ∙0.05 ∙0.8 = 𝑐 ∙ 0.04
𝑐 ∙0.95 + 𝑐 ∙ 0.04 = 1 ⇒ 𝑐 = 0.095 + 0.04 −1 = 7.4074
𝐵𝑒𝑙𝑖𝑒𝑓 𝑔 ≈ 0.70 𝐵𝑒𝑙𝑖𝑒𝑓 ¬𝑔 ≈ 0.30and
Page 50
10 - 50© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Diagnosis in chains (III)
Case 2: Triple chain: X → Y → Z and {Z = z} is observed
probability of the observed x under the condition of z
𝐵𝑒𝑙𝑖𝑒𝑓 𝑥 = 𝑝 𝑥|𝑧 =1
𝑝 𝑧𝑝 𝑥 ∙ 𝑝 𝑧|𝑥 = 𝑐 ∙ 𝑝 𝑥 ∙ 𝑙 𝑥
𝑙 𝑥 =
𝑦
𝑝 𝑧|𝑦, 𝑥 ∙ 𝑝 𝑦|𝑥 =
𝑦
𝑝 𝑧|𝑦 ∙ 𝑝 𝑦|𝑥 Likelihood function
MachineTemperature
GridControl
Electronics
Drive
Season
Example: Grid Electronics Machine ⟹ Observed is the {machine = “low productivity”}
Assumptions: P(M = “normal productivity” | E = “faultless”) ≡ p(m|e) = 0.95⟹ p(¬m|e) = 0.05
P(M = “low productivity” | E = “disturbed”) ≡ p(¬m|¬e) = 0.85 ⟹ p(m|¬e) = 0.15
P(E = “faultless” | G = “no jitters”) ≡ p(e|g) = 0.9 ⟹ p(¬e|g) = 0.1
P(E = “perturbed” | G = “significant jitters”) ≡ p(¬e|¬g) = 0.8 ⟹ p(e|¬g) = 0.2
P(G = “no jitters”) ≡ p(g) = 0.95 ⟹ p(¬g) = 0.05
Page 51
10 - 51© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Diagnosis in chains (IV)
𝑙 𝑔 = 𝑝 ¬𝑚|𝑔𝑙 𝑥 = 𝑝 𝑦|𝑥
MachineTemperature
GridControl
Electronics
Drive
Season
Low productivity in condition to no voltage jitters
𝑙 𝑔 = 𝑝 ¬𝑚|𝑔 = 𝑝 ¬𝑚|𝑒 ∙ 𝑝 𝑒|𝑔 + 𝑝 ¬𝑚|¬𝑒 ∙ 𝑝 ¬𝑒|𝑔
Faultless electronic in condition to no voltage jitters
𝑙 𝑔 = 0.05 ∙ 0.90 + 0.85 ∙ 0.1 = 0.13
𝐵𝑒𝑙𝑖𝑒𝑓 𝑔 = 𝑐 ∙ 𝑝 𝑔 ∙ 𝑙 𝑔 = 𝑐 ∙0.95 ∙ 0.13 = 𝑐 ∙ 0.1235
from last slide 𝑙 ¬𝑔 = 𝑝 ¬𝑚|¬𝑔 = 𝑝 ¬𝑚|𝑒 ∙ 𝑝 𝑒|¬𝑔 + 𝑝 ¬𝑚|¬𝑒 ∙ 𝑝 ¬𝑒|¬𝑔
𝐵𝑒𝑙𝑖𝑒𝑓 ¬𝑔 = 𝑐 ∙ 𝑝 ¬𝑔 ∙ 𝑙 ¬𝑔 = 𝑐 ∙0.05 ∙ 0.69 = 𝑐 ∙ 0.0345
𝑙 ¬𝑔 = 0.05 ∙ 0.2 + 0.85 ∙ 0.8 = 0.69
𝐵𝑒𝑙𝑖𝑒𝑓 𝑔 ≈ 0.78 𝐵𝑒𝑙𝑖𝑒𝑓 ¬𝑔 ≈ 0.22and
𝑐 = 0.1235 + 0.0345 −1
Page 52
10 - 52© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Case 3: n-tuple chain: X1 ... Xn and {Xn=xn} is observed
1 2
1 2
1 2 2
1 1 1 1 11
1 1 1 1 2 1 2 1
1 1 2 2 1
1 1 2 2 1
1( ) ( ) ( ) ( )
( )
( ) ... ,..., ,..., ...
... ...
...
n
n
n n
n n
n
n n n n
x x
n n n n
x x
n n n n
x x x
belief x p x x p x p x x cp x l xp x
l x p x x x p x x x p x x
p x x p x x p x x
p x x p x x p x x
Diagnosis in chains (V)
Page 53
10 - 53© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Tree Multiply connected tree
Topologies of trees
Note, in a tree there is not any node which merges arcs.
Page 54
10 - 54© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Case 4: Simple tree: X1 Y Z and {Z=z} is observed
X2
2
2
2
1 1 1 1 11
1 2 1 2 1 2 1
2 1 2
2 1 2
1( ) ( ) ( ) ( )
( )
( ) , , ,
,
,
y x
y x
y x
bel x p x z p x p z x cp x l xp z
l x p z y x x p y x x p x x
p z y p y x x p x
p z y p y x x p x
Moreover, it is possible to derive exact inference algorithms for trees with multiple layers as
well as multiply connected trees. Multiply connected trees are converted into multiple layer
trees. These algorithms are given in KOCH (2000).
Diagnosis in a simple tree
Page 55
10 - 55© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
4. Introduction to Dynamic Bayesian networks
4. Introduction
Page 56
10 - 56© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
O1 O2 O3 OT...
t = 1 t = 2 t = 3 t = T...
time slices
0 1 1( 1) ( 2) ...P O P O π
11 12
21 22
...
[ ] ...
... ... ...
ij
p p
p p p
P...
12 1( 2 1)t tp P O O
Approach (I)
In the previous lecture the approach of static Bayesian networks with discrete random variables
was introduced, which is able to encode prior knowledge and independency assumptions of a
problem domain to be modelled both efficiently and consistently in a graphical model and allows to
infer the system state from incomplete data.
In this lecture the primary question is how we can exploit the methodology of Bayesian networks to
model and simulate stochastic processes. These processes were already analyzed in the 7th and
8th lecture. As in the case of Markov chains we are only interested in the total probability p(x´, x) of
a transition from state x to state x´ and do not distinguish the events triggering the state transition.
For instance, it is possible to represent a discrete-state and discrete-time Markov chain as a
Bayesian network. Therefore, the time-indexed random variable Ot defined over the integers
1,2,… encodes the observable state of the chain in each time step t of the process:
Page 57
10 - 57© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Clearly, we can make use of the structure of the graphical model according to proposition 1 of the
previous lecture to factorize the joint probability distribution of the observables:
Furthermore, we showed in the previous lecture how to compute the bottom-up inference
(diagnosis) in such a Markov chain using the Bayes theorem.
According to the factorization of the joint distribution the predictive power of this simple process
model is limited, because the state transition mechanism considers only two neighboring time
slices. In other words, if we have modeled the state sequence {O1, ..., Ot} and we want to
predict the future state of the stochastic process Ot+1, the simple Markovian chain model
considers only the distribution of the probability mass related to Ot in conjunction with the
single-step transition probabilities pij. The previous instances of the process are irrelevant, given
the present state.
This minimum chain model is also called a first-order Markov chain, because only two consecutive
time slices are linked in the graphical process model. The first-order Markov chain can be
considered as the minimum structure of a dynamic Bayesian network.
1 2 1 2 1 3 2 1( , ,..., ) ( ) ( ) ( ) ( )T T TP O O O P O P O O P O O P O O
Approach (II)
Page 58
10 - 58© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
A significantly larger predictive power of the chain model is possible (without recoding states, see
8th lecture!), if the present (t) state of the chain does not only depend on the state in the previous
time slice (t-1) but also on additional time slices in the past of the process (t-2, t-3, …). If the
“memory depth” of the model is 2 it is called a second-order Markov chain and drawn as follows:
Clearly, the joint probability distribution of the second-order Markov chain can be factorized as:
1 2 1 2 1 3 2 1 4 3 2 1 2( , ,..., ) ( ) ( ) ( , ) ( , ) ( , )T T T TP O O O P O P O O P O O O P O O O P O O O
1. Proposition: The joint probability distribution of a discrete-state, discrete-time Markov chain of
order k can be factorized in each time step T as:
1 2 1 1 1 1 2 1 2 1
1 1 2 1 2 1
( , ,..., ) ( ) ( ,..., ) ( ,..., ) ... ( )
( ,..., ) ( ,..., ) ( ,..., )
T k k k k
k k k k T T T k
P O O O P O P O O O P O O O P O O
P O O O P O O O P O O O
High-order Markov chains
O1 O2 O3 OT...
...t = 1 t = 2 t = 3 t = T
Page 59
10 - 59© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Markov chains (MC) of finite order k are able to simulate significant memory capacity, but the number
of model parameters N = || ( represents the parameter tuple) that are stored in the prior and
conditional probabilities tables grows polynomially with the order.
Consider a stochastic process with three states ot {1, 2, 3}. We have:
• First-order MC: N1 = (3-1) + 3(3-1) (initial state prob. plus transition matrix; rows must sum up to 1)
• Second-order MC: N2 = (3-1) + 3(3-1) + 32(3-1)
• k-th order MC:Nk = (3-1) + 3(3-1) +…+ 3k(3-1)
In order to avoid this rapid growth of the number of parameters and to be able to model processes
with latent dependency structures leading to long-range correlations, in engineering science the
approach of Markov chains with hidden variables was invented. These Hidden Markov Models
(HMM) distinguish a not directly observable state process {Qt} that satisfies the Markov property
and a non Markovian observation process {Ot} that depends on the state process. We have the
following structure of this kind of dynamic Bayesian network with hidden (latent) state variables:
Q1 Q2 Q3 QT...
O1 O2 O3 OT
t = 1 t = 2 t = 3 t = T...
Graph of Hidden Markov Model
Markov chains with hidden variables (I)
Page 60
10 - 60© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Phoneme and word recognition on the
basis of adequately sampled and
encoded acoustical spectra in speech
recognition
Classification of human behavior when
interacting with anthropomorphic robots
Prediction of event sequences in
communication engineering and human-
computer interaction
Function model of a speech recognition system of
Prof. Schukat (Jena University)
Application examples of HMM
Page 61
10 - 61© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
2. Proposition: The joint probability distribution of a Hidden Markov Model can be factorized in
each time step T as:
1 1 1 1 1 1
2
( ,..., , ,..., ) ( ) ( ) ( ) ( )T
T T t t t t
t
P Q Q O O P Q P O Q P Q Q P O Q
1. Def.: A discrete-time, discrete-state Hidden Markov Model is represented by the parameter tuple
HMM = (Q, O, , A, B), where
- Q is a set of hidden states being mapped in the following onto the integers {1, 2,..., J}
- O is a set of observable states being mapped in the following onto the integers {1, 2,..., K}
- = (1, ..., J) encodes the start vector indicating the initial distribution of the probability
mass over the hidden states with j = P(Q1 = j) (j = 1...J)
- A = (aij) = P(Qt = j | Qt-1 = i) encodes the transition matrix of the hidden process (i, j = 1...J)
- B = (bjk) = P(Ot = k | Qt = j) encodes the emission matrix of the observable states given the
hidden states (j = 1...J , k = 1... K).
Therefore, the distribution of the probability mass (t) in time step t given the initial
distribution , transition matrix A and emission matrix B can be calculated as follows:
Markov chains with hidden variables (II)
1( ) ( )
t
tP O t Π ΠA B
Page 62
10 - 62© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
A fluid in a chemical reactor has two states Q = {1 (non-toxic), 2 (toxic)}. According to the
molecular properties of the fluid its state can change spontaneously (e.g. due to temperature
jitters) from the non-toxic state to the toxic state with probability p12 = 0.01 at any time instant.
This state switching is irreversible. Laboratory studies have shown that the temporal unfolding of
the state process can be represented with a sufficiently high level of accuracy by a first-order
Markov chain model. Initially, the fluid is filled in the non-toxic state into the reactor.
The measurement of the state of the fluid can only be carried out with the help of an integrated
sensor. A direct state observation is not possible. The sensor is fast enough to finish the
measurement in the same time instant. The sensor identifies the toxic state with a reliability of
99.9% and the non-toxic state with a reliability of 95%.
How is the probability mass distributed over the observable states in time step t =4 when the
system is initialized in non-toxic state?
Solution: 1.00 0.00
0.99 0.01
0.00 1.00
0.95 0.05
0.001 0.999
Π
A
B
3
4( )
0.92 0.08
tP O
ΠA B
HMM example
Page 63
10 - 63© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
5. Formalism of Dynamic Bayesian Networks
5. Formalism of Dynamic Bayesian Networks
Page 64
10 - 64© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
2. Def.: A discrete-state, discrete-time dynamic Bayesian network is represented by the parameter tuple
DBN = (G1, Gtr, {i}i{1,...I}, {CPTj} j{1,...J}), where
- G1 is a directed, acyclic graph of start nodes in the first time slice (t=1) encoding the initial
distribution of the probability mass, which has the same meaning as a static Bayesian network:
“Each node is conditionally independent from its non-descendents, given its parents”,
- Gtr is a directed, acyclic graph of transition nodes in replicated time slices encoding the
transition probabilities between time steps with the same meaning as a static Bayesian
network,
- i = (ikm) encode the start vectors or start matrices of observable as well as hidden
random variables X1i of the start nodes in the first time slice (t=1) with the components
i1m = P(X1
i= m) (i = 1...|G1|, m = 1,…, |X1i|) if X1
i is a root node or
ikm = P(X1
i= m | Parents(X1i) = w1
k) (k = 1,…, |Parents(X1i|) if X1
i is not a root node,
- CPTj = (ajrl) encode the transition matrices regarding observable as well as hidden random
variables Xtrj in the replicated time slices (t=2, 3, …) with the components
ajkm = P(Xtr
j = m | Parents(Xtrj) = wtr
k ) (j = 1...|Gtr(t=2)|-1) (k = 1...|Parents(Xtrj)|, m = 1... |Xtr
j |).
Network definition
Page 65
10 - 65© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
1. First-order Markov chain:
O1Ot-1 Ot
G1 Gtr
1 1 1( 1) ( 2)P O P O Π Π1 1
1 1
( 1 1) ( 2 1)
( 1 2) ( 2 2)
t t t t
t t t t
P O O P O O
P O O P O O
1CPT P
2. HMM: Q1 Qt-1 Qt
Ot
1 1 1( 1) ( 2)P Q P Q Π Π
1 1
1 1
( 1 1) ( 2 1)
( 1 2) ( 2 2)
t t t t
t t t t
P Q Q P Q Q
P Q Q P Q Q
1CPT A
( 1 1) ( 2 1)
( 1 2) ( 2 2)
t t t t
t t t t
P O Q P O Q
P O Q P O Q
2CPT B
G1 Gtr
Basic DBN structures with
parameterization for binary states (I)
Page 66
10 - 66© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
3. Autoregressive HMM:
Q1 Qt-1 Qt
Ot
1 1
1 1
( 1 1) ( 2 1)
( 1 2) ( 2 2)
t t t t
t t t t
P Q Q P Q Q
P Q Q P Q Q
1CPT A
1 1
1 1
2
1 1
1 1
( 1 1, 1) ( 2 1, 1)
( 1 2, 1) ( 2 2, 1)
( 1 1, 2) ( 2 1, 2)
( 1 2, 2) ( 2 2, 2)
t t t t t t
t t t t t t
t t t t t t
t t t t t t
P O Q O P O Q O
P O Q O P O Q O
P O Q O P O Q O
P O Q O P O Q O
CPT
G1 Gtr
Ot-1
1 1 1 1
2
1 1 1 1
( 1 1) ( 2 1)
( 1 2) ( 2 2)
P O Q P O Q
P O Q P O Q
Π
1 1 1( 1) ( 2)P Q P Q Π Π
O1
Basic DBN structures with
parameterization for binary states (II)
Page 67
10 - 67© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
3. Factorial HMM: Q11
Q1t-1 Q1
t
Ot
1 11 11 ( 1) ( 2)P Q P Q Π
1 1 1 11 1
1 1 1 11 1
( 1 1) ( 2 1)
( 1 2) ( 2 2)
t t t t
t t t t
P Q Q P Q Q
P Q Q P Q Q
1CPT
1 2 1 2
1 2 1 2
1 2 1 2
1 2 1 2
( 1 1, 1) ( 2 1, 1)
( 1 2, 1) ( 2 2, 1)
( 1 1, 2) ( 2 1, 2)
( 1 2, 2) ( 2 2, 2)
t t t tt t
t t t tt t
t t t tt t
t t t tt t
P O Q Q P O Q Q
P O Q Q P O Q Q
P O Q Q P O Q Q
P O Q Q P O Q Q
3CPT
G1Gtr
Q21
Q2t-1 Q2
t
2 21 12 ( 1) ( 2)P Q P Q Π
2 2 2 21 1
2 2 2 21 1
( 1 1) ( 2 1)
( 1 2) ( 2 2)
t t t t
t t t t
P Q Q P Q Q
P Q Q P Q Q
2CPT
Basic DBN structures with
parameterization for binary states (III)
Page 68
10 - 68© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
3. Def.: A DBN in two consecutive time slices and the aggregated random state variables
is a net fragment Gtr and represents the probability distribution of a transition model
according to
1 2, ,..., n
t t tX X X X
1 2
1 1 1, ,..., n
t t tX X X X
1
n
tr i i
i
P X X P X Parents X
3. Proposition: The joint probability distribution of the aggregated random state variables of DBN
can be factorized in each time slice T according to:
1(.) represents the initial probability distribution of the aggregated state variables in the
first time slice and Ptr(.) represents the transition model defined by Def. 3.
1 1 1
2
,..., ,T
T DBN DBN tr DBN
t
P X X X P X X
Factorization in DBN
Page 69
10 - 69© Chair and Institute of Industrial Engineering and Ergonomics, RWTH Aachen University
Open Questions ???
Questions ?