TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology Helge Langseth IT-VEST 310 [email protected] 1 TDT4171 Artificial Intelligence Methods
TDT4171 Artificial Intelligence MethodsLecture 3 & 4 – Probabilistic Reasoning Over Time
Norwegian University of Science and Technology
Helge LangsethIT-VEST 310
1 TDT4171 Artificial Intelligence Methods
Outline
1 Leftovers from last timeInference
2 Probabilistic Reasoning over TimeSet-upBasic speech recognitionInference: Filtering, prediction, smoothingInference for Hidden Markov modelsKalman FiltersDynamic Bayesian networksSummary
3 Speech recognitionSpeech as probabilistic inferenceSpeech soundsWord sequences
2 TDT4171 Artificial Intelligence Methods
Leftovers from last time
Summary from last time
Bayes nets provide a natural representation for (causallyinduced) conditional independence
Topology + CPTs = compact representation of jointdistribution
Generally easy to construct – also for non-experts
Canonical distributions (e.g., noisy-OR) = compactrepresentation of CPTs
Announcements
The first assignment due next Friday
Deliver it using It’s Learning
There will be no lecture next week!
3 TDT4171 Artificial Intelligence Methods
Leftovers from last time Inference
Inference tasks
Simple queries: compute posterior marginal P(Xi|E= e), e.g.,P (NoGas|Gauge= empty, Lights= on, Starts= false)
Conjunctive queries:P(Xi,Xj |E= e) = P(Xi|E= e)P(Xj |Xi,E= e)
Optimal decisions: decision networks include utility information;probabilistic inference required forP (outcome|action, evidence)
Value of information: which evidence to seek next?
Sensitivity analysis: which probability values are most critical?
Explanation: why do I need a new starter motor?
4 TDT4171 Artificial Intelligence Methods
Leftovers from last time Inference
Inference tasks – Inference by enumeration
Slightly intelligent way to sum out variables from the joint withoutactually constructing its explicit representation.
Simple query on the burglary network:
P(B|j,m) = P(B, j,m)/P (j,m)
= αP(B, j,m)
= α Σe Σa P(B, e, a, j,m)
B E
J
A
MRewrite full joint entries using product of CPT entries:
P(B|j,m) = ΣeΣaP(B)P (e)P(a|B, e)P (j|a)P (m|a)
Recursive depth-first enumeration: O(n) space, O(n · dn) time
5 TDT4171 Artificial Intelligence Methods
Leftovers from last time Inference
Inference tasks – Inference by enumeration
Slightly intelligent way to sum out variables from the joint withoutactually constructing its explicit representation.
Simple query on the burglary network:
P(B|j,m) = P(B, j,m)/P (j,m)
= αP(B, j,m)
= α Σe Σa P(B, e, a, j,m)
B E
J
A
MRewrite full joint entries using product of CPT entries:
P(B|j,m) = ΣeΣaP(B)P (e)P(a|B, e)P (j|a)P (m|a)
= αP(B) ΣeP (e)Σa P(a|B, e)P (j|a)P (m|a)
Recursive depth-first enumeration: O(n) space, O(dn) time
5 TDT4171 Artificial Intelligence Methods
Leftovers from last time Inference
Enumeration algorithm
function Enumeration-Ask(X,e,bn) returns distr. over X
inputs: X, the query variablee, observed values for variables Ebn, a Bayesian network with variables {X } ∪ E ∪ Y
Q(X )← a distribution over X, initially emptyfor each value xi of X do
extend e with value xi for X
Q(xi)←Enum-All(Vars[bn],e)return Normalize(Q(X ))
function Enum-All(vars,e) returns a real numberif Empty?(vars) then return 1.0Y←First(vars)if Y has value y in e
then return P (y|Pa(Y )) ×Enum-All(Rest(vars),e)else return
∑
y P (y|Pa(Y ))×Enum-All(Rest(vars),ey)where ey is e extended with Y = y
6 TDT4171 Artificial Intelligence Methods
Leftovers from last time Inference
Evaluation tree
P(j|a).90
P(m|a).70 .01
P(m| a)
.05P(j| a) P(j|a)
.90
P(m|a).70 .01
P(m| a)
.05P(j| a)
P(b).001
P(e).002
P( e).998
P(a|b,e).95 .06
P( a|b, e).05P( a|b,e)
.94P(a|b, e)
Enumeration is inefficient, as we have repeated computation ofe.g., P (j|a)P (m|a) for each value of e.⇒ Nice to know that better methods are available. . .
7 TDT4171 Artificial Intelligence Methods
Leftovers from last time Inference
Summary of Chapter 14
Bayes nets provide a natural representation for (causallyinduced) conditional independence
Topology + CPTs = compact representation of joint
Generally easy to construct – also for non-experts
Canonical distributions (e.g., noisy-OR) = compactrepresentation of CPTs
Efficient inference calculations are available (but the goodones are outside the scope of this course)
What you should know:
How to build models (and verify them using ConditionalIndependence and Causality)
What drives the . . .
model building burdencomplexity of inference
8 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Set-up
Time and uncertainty
Motivation: The world changes; we need to track and predict itStatic (Vehicle diagnosis) vs. Dynamic (Diabetes management)
Basic idea: copy state and evidence variables for each time step
Raint = Does it rain at time t
This assumes discrete time; step size depends on problemHere: A timestep is one day, I guess (?)
9 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Set-up
Markov processes (Markov chains)
If we want to construct a Bayes net from these variables, then whatare the parents?
Assume we have observations of Rain0, Rain1, . . . , Raint andwant to predict whether or not it rains at day t + 1:P(Raint+1|Rain0, Rain1, . . . , Raint)
Try to build a BN over Rain0, Rain1, . . . , Raint+1:
P(Raint+1) 6= P(Raint+1|Raint); base on Raint.P(Raint+1|Raint) ≈ P(Raint+1|Raint, Raint−1)(Do you agree?)
10 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Set-up
Markov processes (Markov chains)
If we want to construct a Bayes net from these variables, then whatare the parents?
Assume we have observations of Rain0, Rain1, . . . , Raint andwant to predict whether or not it rains at day t + 1:P(Raint+1|Rain0, Rain1, . . . , Raint)
Try to build a BN over Rain0, Rain1, . . . , Raint+1:
P(Raint+1) 6= P(Raint+1|Raint); base on Raint.P(Raint+1|Raint) ≈ P(Raint+1|Raint, Raint−1)(Do you agree?)
First-order Markov process:
P(Raint+1|Rain0, . . . , Raint) = P(Raint+1|Raint)“Future is cond. independent of Past given Present”
10 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Set-up
Markov processes (Markov chains)
If we want to construct a Bayes net from these variables, then whatare the parents?
Assume we have observations of Rain0, Rain1, . . . , Raint andwant to predict whether or not it rains at day t + 1:P(Raint+1|Rain0, Rain1, . . . , Raint)
Try to build a BN over Rain0, Rain1, . . . , Raint+1:
P(Raint+1) 6= P(Raint+1|Raint); base on Raint.P(Raint+1|Raint) ≈ P(Raint+1|Raint, Raint−1)(Do you agree?)
k’th-order Markov process:
P(Raint+1|Rain0, . . . , Raint) = P(Raint+1|Raint−k+1, . . . , Raint)
10 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Set-up
Markov processes as Bayesian networks
If we want to construct a Bayes net from these variables, then whatare the parents?
Markov assumption: Xt depends on bounded subset of X0:t−1
First-order Markov process: P(Xt|X0:t−1) = P(Xt | Xt−1)
Second-order Markov process:
P(Xt|X0:t−1) = P(Xt|Xt−2,Xt−1)
X t −1 X tX t −2 X t +1 X t +2
X t −1 X tX t −2 X t +1 X t +2First−order
Second−order
11 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Set-up
Is a first-order Markov process suitable?
First-order Markov assumption not exactly true in real world!
Possible fixes:1 Increase order of Markov process2 Augment state, e.g., add Tempt, Pressuret
State augmentation is enough!
Any k’th-order Markov process can be expressed as a First orderMarkov process – Focus on first order processes from now on.
“Proof”:1 Assume for simplicity that the process contains only variable
X, and that we have a second-order Markov process2 Create a new variable X ′
t identical to Xt−1.3 Let Xt+1 have both Xt and X ′
t as parent.4 Do for all t. Augmented model is first-order Markov process
12 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Basic speech recognition
Speech as probabilistic inference
How can we recognize speech?
Speech signals are noisy, variable, ambiguous
What is the most likely word sequence, given the speechsignal?
Why not choose Words to maximize P(Words|signal)??Use Bayes’ rule:
P(Words|signal) = αP(signal|Words)P(Words)
I.e., decomposes into acoustic model + language model
Need to be able to do the required calculations!!
13 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Basic speech recognition
Generation of Speech
14 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Basic speech recognition
The sound signal - Characteristics
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14−0.4
−0.2
0
0.2
0.4
Time (s)
Am
plit
ude
Sound is dynamic, and we must take this into account torepresent it faithfully.
Sound is a “wavy” signal-train, with amplitude and frequencyinformation changing all the time.
Volume of speech ↔ Global change of amplitudesSpeed of speech ↔ Global change of frequencies
Most information is carried by the frequencies around 1kHz
15 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Basic speech recognition
The raw sound for recognition/classification
0 0.05 0.1 0.15 0.2 0.25−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Time (s)
Sig
nal −
− S
tart
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
Time (s)
Sig
nal −
− S
top
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Time (s)
Sig
nal −
− L
eft
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
Time (s)
Sig
nal −
− R
ight
The raw signal of the the words “Start”, “Stop”, “Left”, and “Right”.
16 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Basic speech recognition
Phones
All human speech is composed from 40-50 phones,determined by the configuration of articulators
Form an intermediate (hidden) level between words and signal⇒ speech of a word = uttering a sequence of phones.
ARPAbet designed for American English:
[iy] beat [b] bet [p] pet
[ih] bit [ch] Chet [r] rat[ey] bet [d] debt [s] set[ao] bought [hh] hat [th] thick
[ow] boat [hv] high [dh] that[er] Bert [l] let [w] wet[ix] roses [ng] sing [en] button
......
......
......
E.g., “ceiling” is [s iy l ih ng] / [s iy l ix ng] / [s iy l en]
17 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Basic speech recognition
Markov processes and speech
Assume we observe phonemes directly. Let Xt be the phonemeuttered inside frame t:
Xt is a single, discrete variable.
Xt takes on a value from the state-space {1, 2, . . . , N}, whereN is the total number of phonemes.
An observation sequence is {x1, x2, . . . , xT } (use x1:T as ashorthand).
It is common to assume a Markov process for speech signals.
18 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Basic speech recognition
(Observable) Markov processes; full set of assumptions
Stationary process:
Transition model P(Xt|pa (Xt)) fixed for all t
k’th-order Markov process:
P(Xt|X0:t−1) = P(Xt|Xt−k:t−1)
Parameters:
Transition matrix T : P(Xt|Xt−k:t−1).
Prior distribution π: P(X0:k−1)
19 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Basic speech recognition
Hidden Markov models
Phonemes are not observable themselves
Phoneme Xt is partially disclosed by the sound signal in framet (or our representation of that). We call the observation Et.
Reasonable assumptions to make:
Stationary process:
Transition model P(Xt|pa (Xt)) fixed for all t
k’th-order Markov process:
P(Xt|X0:t−1) = P(Xt|Xt−k:t−1)
Sensor Markov assumption:
P(Et|X1:t,E1:t−1) = P(Et|Xt).
20 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Basic speech recognition
Hidden Markov models as Bayesian networks
X0 X1 X2 X3 X4
E1 E2 E3 E4
The variables Xt are discrete and one-dimentional, the variables Et are vectors of variables
used to represent the sound signal in a that frame.
21 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Basic speech recognition
Example of Hidden Markov Model from the book
tRain
tUmbrella
Raint −1
Umbrella t −1
Raint +1
Umbrella t +1
Rt −1 tP(R )
0.3f0.7t
tR tP(U )
0.9t0.2f
22 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Recognition of isolated words
Let e1:T denote the observation of a sound signal over Tframes.
Must define model to find likelihood P (e1:T |word) forisolated word
P (word|e1:T ) = αP (e1:T |word)P (word)
Prior probability P (word) by counting word frequencies.
This leaves us with the problem of calculating
P (e1:T |word) to make single-word speech recognition.
23 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Top level design
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
word1
word2
wordn
p1
p2
pno1:T
o1:T
o1:T ClassifierTransform
The top-level structure for the classifier has one HMM per word.Note that the same data is sent to all models, and that theprobability pj = P (e1:T |wordj) is returned from the HMMs.
24 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Inference tasks
Filtering: P(Xt|e1:t). This is the belief state – input to thedecision process of a rational agent. Also, as a artifactof the calculation scheme, we get the probabilityneeded for speech recognition if we are interested.
Prediction: P(Xt+k|e1:t) for k > 0. Evaluation of possible actionsequences; like filtering without the evidence
Smoothing: P(Xk|e1:t) for 0 ≤ k < t. Better estimate of past
states – Essential for learning
Most likely explanation: arg maxx1:tP (x1:t|e1:t). Speech
recognition, decoding with a noisy channel
25 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Filtering
Aim: devise a recursive state estimation algorithm:P(Xt+1|e1:t+1) = Some-Func(P(Xt|e1:t), et+1)
P(Xt+1|e1:t+1) = P(Xt+1, e1:t, et+1)/P (e1:t+1)
= P(et+1|Xt+1, e1:t) · P(Xt+1|e1:t) · P (e1:t)/P (e1:t+1)
= P(et+1|Xt+1 ) · P(Xt+1|e1:t) · α= α · P(et+1|Xt+1)
︸ ︷︷ ︸
Evidence
·P (Xt+1|e1:t)︸ ︷︷ ︸
Prediction
So, filtering is a prediction updated by evidence.
26 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Filtering
Aim: devise a recursive state estimation algorithm:P(Xt+1|e1:t+1) = Some-Func(P(Xt|e1:t), et+1)
Prediction by summing out Xt:
P(Xt+1|e1:t+1)
= α · P(et+1|Xt+1) · P(Xt+1|e1:t)
= α · P(et+1|Xt+1) · ΣxtP(Xt+1|xt, e1:t)P (xt|e1:t)
= α · P(et+1|Xt+1) · ΣxtP(Xt+1|xt)P(xt|e1:t)
︸ ︷︷ ︸
P(Xt+1|e1:t)using what we have already
All relevant information contained in f1:t =P(Xt|e1:t); beliefrevision using f1:t+1 = Forward(f1:t, et+1).
Note! Time and space requirements for calculating f1:t+1 isconstant (independent of t)
26 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Filtering example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
P(X0) = 〈0.5, 0.5〉
27 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Filtering example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
P(X0) = 〈0.5, 0.5〉P(X1) = ?
P(X1) =∑
x0
P(X1|x0) · P (x0)
= 〈0.7, 0.3〉 · 0.5 + 〈0.3, 0.7〉 · 0.5= 〈0.5, 0.5〉
27 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Filtering example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = ?
P(X1|e1) = α · P(e1|X1)P(X1)
= α · 〈0.9 · 0.5, 0.2 · 0.5〉= 〈0.818, 0.182〉
27 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Filtering example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉
P(X2|e1) = ?
P(X2|e1) =∑
x1
P(X2|x1) · P (x1|e1)
= 〈0.7, 0.3〉 · 0.818 + 〈0.3, 0.7〉 · 0.182= 〈0.627, 0.373〉
27 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Filtering example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉
P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = ?
P(X2|e1:2) = α · P(e2|X2) · P(X2 | e1)
= α · 〈0.9, 0.2〉 · 〈0.627, 0.373〉= α · 〈0.565, 0.075〉= 〈0.883, 0.117〉
27 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Filtering example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉
P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉
27 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Prediction
P(Xt+k+1|e1:t) = Σxt+kP(Xt+k+1|xt+k)P (xt+k|e1:t)
Again we have a recursive formulation – This time over k. . .
As k →∞, P (xt+k|e1:t) tends to the stationary distribution ofthe Markov chain. This means that the effect of e1:t will vanish ask increases, and predictions will become more and more dubious.
Mixing time depends on how stochastic the chain is (“howpersistent X is”)
28 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Prediction – Example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
P(X0) = 〈.5, .5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉
P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉
P(X3|e1:2) =∑
x2
P(X3|x2) · P (x2|e1:2)
= 〈0.7, 0.3〉 · 0.883 + 〈0.3, 0.7〉 · 0.117= 〈0.653, 0.347〉
29 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Prediction – Example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
P(X0) = 〈.5, .5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉
P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉
P(X4|e1:2) =∑
x3
P(X4|x3) · P (x3|e1:2)
= 〈0.7, 0.3〉 · 0.653 + 〈0.3, 0.7〉 · 0.347= 〈0.561, 0.439〉
29 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Prediction – Example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
P(X0) = 〈.5, .5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉
P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉
P(X10|e1:2) =∑
x9
P(X10|x9) · P (x9|e1:2)
= 〈0.7, 0.3〉 · 0.501 + 〈0.3, 0.7〉 · 0.499= 〈0.500, 0.500〉
limk→∞ P(Xt+k|e1:t) = 〈12, 1
2〉 — but why?
29 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Example: Automatic recognition of hand-written digits
We have this system that can “recognise” hand-written digits:
replacements
0 3 6 8
P(image | Digit)
Takes a binary image of a handwritten digit as input
Returns P(image | Digit)(The system we will consider is not very good)
30 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Internals of recogniser – Naive Bayes
An image is a 16 × 16 matrix of binary variables Imagei,j:Imagei,j = true if pixel (i, j) is white, false otherwise.
How should we proceed? We need a model forP(image | Digit). Note that image is 256-dimentional.
31 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Internals of recogniser – Naive Bayes
An image is a 16 × 16 matrix of binary variables Imagei,j:Imagei,j = true if pixel (i, j) is white, false otherwise.
Note! The different digits distribute white spots differently inthe image ⇒ combine single-pixel information to find digit.
31 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Internals of recogniser – Naive Bayes
An image is a 16 × 16 matrix of binary variables Imagei,j:Imagei,j = true if pixel (i, j) is white, false otherwise.
In this example we assume that each location contributeindependently (Naive Bayes model):
P(image | Digit) =∏
i
∏
j
P (imagei,j | Digit).
31 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Scaling up: ZIP-codes
We want to build a system that can decode hand-written ZIP-codesfor letters to Norway.
Digit1 Digit2 Digit3 Digit4
32 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Scaling up: ZIP-codes
We want to build a system that can decode hand-written ZIP-codesfor letters to Norway.
There is a structure in this:
ZIP-codes always have 4 digitsSome ZIP-codes more frequent than others (e.g., 0xxx – 13xx
for Oslo, 50xx for Bergen, 70xx for Trondheim)Some ZIP-codes are not used, e.g. 5022 does not exist. . . but some illegal numbers are often used, e.g. 7000meaning “Wherever in Trondheim”
Can we utilise the internal structure to improve thedigits-recogniser?
33 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
How to model the internal structure of ZIP-codes
Take 1: Full model
Digit1 Digit2 Digit3 Digit4
34 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
How to model the internal structure of ZIP-codes
Take 1: Full model
The full model includes all relations between digits:
7465 is good, 7365 is not
The problem is related to size of CPTs:
How many numbers to represent P(Digit4 | Pa(Digit4))?What if we want to use this system to recognise KID numbers(> 10 digits)?
34 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
How to model the internal structure of ZIP-codes
Take 2: Markov model
Digit1 Digit2 Digit3 Digit4
34 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
How to model the internal structure of ZIP-codes
Take 2: Markov model
The reduced model includes only some relations betweendigits:
Can represent “If start with 7 and digit number three is 6, thenthe second one is probably 4”Cannot represent “If start with 9 thin digit number four isprobably not 7”
What about making the model stationary?
Does not seem appropriate here.Might be necessary and/or reasonable for KID recognitionthough.
34 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Inference (filtering)
Step 1: First digit classified as a 4! (Not good! I told you!)
replacements
Digit1 Digit2 Digit3 Digit4
1
4 7
?? ?
P(Digit1 | image1)
35 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Inference (filtering)
Step 1: First digit classified as a 4! (Not good! I told you!)
So what happened?
The Naive Bayes method supplies P(image1 | Digit1)
Using the calculation rule, the system finds
P(Digit1 | image1) = α · P(image1 | Digit1) · P(Digit1)
35 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Inference (filtering)
Step 2: Second digit classified as a 4.
replacements
Digit1 Digit2 Digit3 Digit4
2
2
44 7
? ?
P(Digit2 | image1, image2)
35 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Inference (filtering)
Step 2: Second digit classified as a 4.
So what happened?
The Naive Bayes method supplies P(image2 | Digit2)Using the calculation rule, the system findsP(Digit2 | image1, image2) =
α · P(image2 | Digit2)·∑
digit1
P(Digit2 | digit1)P (digit1 | image1)
To do the classification, the system used the information thatThe image is a very typical “4”7→ 4 is probable4→ 4 is not very probable, but possible
Can this structural information also be used“backwards”?
If the 2nd digit is 4, then the 1st digit is probably a 7, not a 4
This is called smoothing
35 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Smoothing
X 0 X 1
1E tE
tXX k
Ek
Calculate P(Xk|e1:t) by dividing evidence e1:t into e1:k, ek+1:t:
P(Xk|e1:t) = P(Xk|e1:k, ek+1:t)
= P(Xk, e1:k, ek+1:t)/P(e1:k, ek+1:t)
= P(ek+1:t|Xk, e1:k) · P(Xk|e1:k) · P(e1:k)/P(e1:k, ek+1:t)
= P(ek+1:t|Xk ) · P(Xk|e1:k) · α= α · P(Xk|e1:k) · P(ek+1:t|Xk)
= α · f1:k · bk+1:t
where bk+1:t = P(ek+1:t|Xk).
36 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Smoothing
X 0 X 1
1E tE
tXX k
Ek
Backward message computed by a backwards recursion:
P(ek+1:t|Xk) = Σxk+1P(ek+1:t|Xk, xk+1)P(xk+1|Xk)
= Σxk+1P (ek+1:t|xk+1)P(xk+1|Xk)
= Σxk+1P (ek+1|xk+1) · P (ek+2:t|xk+1) · P(xk+1|Xk)
So. . .
bk+1:t = P(ek+1:t|Xk)
= Σxk+1P (ek+1|xk+1) · bk+2:t(xk+1) · P(xk+1|Xk)
36 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Smoothing example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉
b3:2 = ?
b3:2 = P(e3:2|X2)
= 〈1, 1〉 (void)
37 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Smoothing example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉
b3:2 = 〈1, 1〉
P(X2|e1:2) = ?
P(X2|e1:2) = α · f1:2 · b3:2
= α · 〈0.883, 0.117〉 · 〈1, 1〉= 〈0.883, 0.117〉
37 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Smoothing example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉
b3:2 = 〈1, 1〉
P(X2|e1:2) = 〈0.883, 0.117〉
b2:2 = ?
b2:2 = P(e2:2|X1)
=∑
x2
P (e2|x2) · b3:2(x2) · P(x2|X1)
= (0.9 · 1 · 〈0.7, 0.3〉) + (0.2 · 1 · 〈0.3, 0.7〉) = 〈0.690, 0.410〉37 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Smoothing example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉
b3:2 = 〈1, 1〉
P(X2|e1:2) = 〈0.883, 0.117〉
b2:2 = 〈0.690, 0.410〉
P(X1|e1:2) = ?
P(X1|e1:2) = αf1:1 · b2:2
= α · 〈0.818, 0.182〉 · 〈0.690, 0.410〉= 〈0.883, 0.117〉
37 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Smoothing example
Rain0 Rain1 Rain2
Umbrella1 Umbrella2
f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉
b3:2 = 〈1, 1〉
P(X2|e1:2) = 〈0.883, 0.117〉
b2:2 = 〈0.690, 0.410〉
P(X1|e1:2) = 〈0.883, 0.117〉
37 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Smoothing example — conclusion
Rain1
Umbrella1
Rain2
Umbrella2
Rain0
TrueFalse
0.8180.182
0.6270.373
0.8830.117
0.5000.500
0.5000.500
1.0001.000
0.6900.410
0.8830.117
forward
backward
smoothed0.8830.117
Forward–backward algorithm: cache forward messages as wemoveTime linear in t (polytree inference), space O(t · |f|)
38 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
How to classify ZIP-codes?
Digit1 Digit2 Digit3 Digit4
Can we take the most probable digit per image and use forclassification?
NO! Most likely sequence IS NOT the sequence of most
likely states!
39 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Most likely explanation
Most likely sequence 6= sequence of most likely states!
Most likely path to each xt+1 is most likely path to some xt
plus one more step
maxx1...xt
P(x1, . . . , xt,Xt+1|e1:t+1)
= maxx1...xt
P(x1, . . . , xt,Xt+1, e1:t+1)/P (e1:t+1)
= maxx1...xt
P(et+1|x1, . . . , xt,Xt+1, e1:t) · P(Xt+1|x1, . . . , xt, e1:t)
· P (x1, . . . , xt|e1:t) · α= max
x1...xt
α · P(et+1|Xt+1) · P(Xt+1|xt) · P (x1, . . . , xt|e1:t)
= αP(et+1|Xt+1)maxxt
(
P(Xt+1|xt) maxx1...xt−1
P (x1, . . . , xt−1, xt|e1:t)
)
40 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Most likely explanation
Most likely sequence 6= sequence of most likely states!
Most likely path to each xt+1 is most likely path to some xt
plus one more step
maxx1...xt
P(x1, . . . , xt,Xt+1|e1:t+1)
= αP(et+1|Xt+1) ·maxxt
(
P(Xt+1|xt) maxx1...xt−1
P (x1, . . . , xt|e1:t)
)
Identical to filtering, except f1:t replaced by
m1:t = maxx1...xt−1
P(x1, . . . , xt−1,Xt|e1:t),
I.e., m1:t(i) gives the probability of the most likely path to state i.Update has sum replaced by max, giving the Viterbi algorithm:
m1:t+1 = P(et+1|Xt+1)maxxt
(P(Xt+1|xt) ·m1:t)
40 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing
Viterbi example
Rain1 Rain2 Rain3 Rain4 Rain5
true
false
true
false
true
false
true
false
true
false
.8182 .5155 .0361 .0334 .0210
.1818 .0491 .1237 .0173 .0024
m 1:1 m 1:5m 1:4m 1:3m 1:2
statespacepaths
mostlikelypaths
umbrella true truetruefalsetrue
m1:t+1 = P(et+1|Xt+1)maxxt
(P(Xt+1|xt) ·m1:t)
41 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference for Hidden Markov models
Simplifications for Hidden Markov models
Xt is a single, discrete variable (usually Et is too)Domain of Xt is {1, . . . , S}Transition matrix Tij = P (Xt = j|Xt−1 = i), e.g.,
(0.7 0.30.3 0.7
)
Sensor matrix Ot for each t, diagonal elements P (et|Xt = i).For instance, with U1 = true we get
O1 =
(P (u1|x1) 0
0 P (u1|¬x1)
)
=
(0.9 00 0.2
)
Forward and backward messages as column vectors:
f1:t+1 = αOt+1T⊤f1:t
bk+1:t = TOk+1bk+2:t
Forward-backward algorithm needs time O(S2t) and space O(St)
42 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Inference for Hidden Markov models
Hidden Markov Models at work: Bach
Chord0 Chord1 Chord2
Melody1 Melody2
http://www.anc.inf.ed.ac.uk/demos/hmmbach/demo1.html
43 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Kalman Filters
Kalman filters
Modelling systems described by a set of continuous variables,e.g., tracking a bird flying — Xt = X,Y,Z, X , Y , Z.
Also: Airplanes, robots, ecosystems, economies, chemical plants,planets, . . .
“Noisy” observations, continuous variables, dynamic model
tZ t+1Z
tX t+1X
tX t+1X
Gaussian prior, linear Gaussian transition model and sensor model
44 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Kalman Filters
Continuous variables
Need a way to define a conditional density function for childvariable given continuous parents
Most common is the linear Gaussian model, e.g.,:
P (Xt = xt|Xt−1 = xt−1)
= N(a · xt−1 + b, σ)(xt)
=1
σ√
2πexp
(
−1
2
(xt − (a · xt−1 + b)
σ
)2)
Mean Xt varies linearly with Xt−1, variance is fixed
Linear variation and fixed variance may be unreasonable
over the full range, but may work OK if the likely range ofXt is narrow
45 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Kalman Filters
Continuous variables (cont’d)
All-continuous network with LG distributions⇒ full joint distribution is a multivariate Gaussian
46 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Kalman Filters
Updating Gaussian distributions
Prediction step: if P(Xt|e1:t) is Gaussian, then prediction
P(Xt+1|e1:t) =
∫
xt
P(Xt+1|xt)P (xt|e1:t) dxt
is Gaussian.
If P(Xt+1|e1:t) is Gaussian, then the updated distribution
P(Xt+1|e1:t+1) = αP(et+1|Xt+1)P(Xt+1|e1:t)
is Gaussian
Hence P(Xt|e1:t) is multivariate Gaussian N(µt,Σt) for all t
General (nonlinear, non-Gaussian) process: description of posteriorgrows unboundedly as t→∞
47 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Kalman Filters
Simple 1-D example
Task: Measure the Norwegian population’s level of jobsatisfaction on a monthly basis
Scale: Real numbers (typical values from −5 to +5)
Indirect measurement: Ask a random subset of N people
Modelling assumptions:
The true value cannot be measured (N < 4.5 · 106), but themeasurements (Zt) are correlated with the true value (Xt):
P (zt|Xt = xt) ∼ N(xt,Σz)(zt)
The true level at time t is related to the level at time t− 1:P (xt|Xt−1 = xt−1) ∼ N(xt−1,Σx)(xt)
That is, we have a Gaussian Random Walk
48 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Kalman Filters
Simple 1-D example (cont’d)
Gaussian random walk on X–axis, s.d. σx, sensor s.d. σz
µt+1 =(σ2
t + σ2x)zt+1 + σ2
zµt
σ2t + σ2
x + σ2z
σ2t+1 =
(σ2t + σ2
x)σ2z
σ2t + σ2
x + σ2z
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-8 -6 -4 -2 0 2 4 6 8
P(X
)
X position
P(x0)
P(x1)
P(x1 | z1=2.5)
*z1
49 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Kalman Filters
General Kalman update
Transition and sensor models:
P (xt+1|xt) = N(Fxt,Σx)(xt+1)P (zt|xt) = N(Hxt,Σz)(zt)
F is the matrix for the transition; Σx the transition noisecovariance
H is the matrix for the sensors; Σz the sensor noise covariance
Filter computes the following update:
µt+1 = Fµt + Kt+1(zt+1 −HFµt)
Σt+1 = (I−Kt+1)(FΣtF⊤ + Σx)
Kt+1 = (FΣtF⊤ + Σx)H⊤(H(FΣtF
⊤ + Σx)H⊤ + Σz)−1
is the Kalman gain matrix
Note! Σt and Kt are independent of observation sequence, socompute offline
50 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Kalman Filters
2-D tracking example: filtering
8 10 12 14 16 18 20 22 24 266
7
8
9
10
11
12
X
Y
2D filtering
trueobservedfiltered
51 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Kalman Filters
2-D tracking example: smoothing
8 10 12 14 16 18 20 22 24 266
7
8
9
10
11
12
X
Y
2D smoothing
trueobservedsmoothed
52 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Kalman Filters
Where Kalman Filtering falls apart
Kalman Filters cannot be applied if the transition model isnonlinear
Extended Kalman Filter models transition as locally linear
around xt = µt. Fails if systems is locally unsmoothSwitching Kalman Filter kan be used to handlediscontinuities
53 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Dynamic Bayesian networks
Dynamic Bayesian networks
Xt, Et contain arbitrarily many variables in a replicated Bayes net
0.3f0.7t
0.9t0.2f
Rain0 Rain1
Umbrella1
P(U )1R1
P(R )1R0
0.7
P(R )0
Z1
X1
X1tXX 0
X 0
1BatteryBattery 0
1BMeter
54 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Dynamic Bayesian networks
DBNs vs. HMMs
Every HMM is a single-variable DBN; every discrete DBN is
an HMM
X t Xt+1
tY t+1Y
tZ t+1Z
Sparse dependencies ⇒ exponentially fewer parameters
. . . e.g., 20 state variables, three parents each
DBN has 20× 23 = 160 parameters, HMM has220× 220 ≈ 1012
55 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Dynamic Bayesian networks
DBNs vs. Kalman filters
Every Kalman filter model is a DBN, but few DBNs are KFs, asreal world requires non-Gaussian posteriors:Where are my keys? What’s the battery charge? Does this systemwork?
Z1
X1
X1tXX 0
X 0
1BatteryBattery 0
1BMeter
0BMBroken 1BMBroken
-1
0
1
2
3
4
5
15 20 25 30
E(B
atte
ry)
Time step
E(Battery|...5555005555...)
E(Battery|...5555000000...)
P(BMBroken|...5555000000...)
P(BMBroken|...5555005555...)
56 TDT4171 Artificial Intelligence Methods
Probabilistic Reasoning over Time Summary
Summary – Representation and inference in temporalmodels
Temporal models use state and sensor variables replicatedover time
Markov assumptions and stationarity assumption, so weneed
Transition model P(Xt|Xt−1)Sensor model P(Et|Xt)
Tasks are filtering, prediction, smoothing, most likely sequence;all done recursively with constant cost per time step
Hidden Markov models have a single discrete state variable;used for speech recognition
Kalman filters allow n state variables, linear Gaussian, O(n3)update
Dynamic Bayes nets subsume HMMs, Kalman filters; exactupdate intractable; approximations exist
57 TDT4171 Artificial Intelligence Methods
Speech recognition Speech as probabilistic inference
Speech as probabilistic inference
Let us return to the question of how to recognize speech
Speech signals are noisy, variable, ambiguous
Classify as to Words to maximize P(Words|signal)??Use Bayes’ rule:
P(Words|signal) = αP(signal|Words)P(Words)
I.e., decomposes into acoustic model + language model
The Words are the hidden state sequence, signal is theobservation sequence
We use Hidden Markov Models to model this
58 TDT4171 Artificial Intelligence Methods
Speech recognition Speech sounds
Speech sounds
Raw signal is the microphone displacement as a function of time;processed into overlapping 30ms frames, each described byfeatures
Analog acoustic signal:
Sampled, quantized digital signal:
Frames with features:10 15 38
52 47 82
22 63 24
89 94 11
10 12 73
Frame features are typically formants (peaks in the powerspectrum)
59 TDT4171 Artificial Intelligence Methods
Speech recognition Speech sounds
Phone models
Frame features in P (features|phone) summarized by
an integer in [0 . . . 255] (using vector quantization); orthe parameters of a mixture of Gaussians
Three-state phones: each phone has three phases (Onset,Mid, End)E.g., [t] has silent Onset, explosive Mid, hissing End⇒ P (features|phone, phase)Triphone context: each phone becomes n2 distinct phones,depending on the phones to its left and rightE.g., [t] in “star” is written [t(s,aa)] (different from “tar”!)
Triphones useful for handling coarticulation effects: thearticulators have inertia and cannot switch instantaneouslybetween positionsE.g., [t] in “eighth” has tongue against front teeth
60 TDT4171 Artificial Intelligence Methods
Speech recognition Speech sounds
Phone model example
Phone HMM for [m]:
0.1
0.90.3
0.6
0.4
C1: 0.5
C2: 0.2
C3: 0.3
C3: 0.2
C4: 0.7
C5: 0.1
C4: 0.1
C6: 0.5
C7: 0.4
Output probabilities for the phone HMM:
Onset: Mid: End:
FINAL0.7
Mid EndOnset
61 TDT4171 Artificial Intelligence Methods
Speech recognition Speech sounds
Word pronunciation models
Each word is described as a distribution over phone sequencesDistribution represented as an HMM transition model
0.5
0.5
0.2
0.8
[m]
[ey]
[ow][t]
[aa]
[t]
[ah]
[ow]
1.0
1.0
1.0
1.0
1.0
P ([towmeytow]|“tomato”) = P ([towmaatow]|“tomato”) = 0.1P ([tahmeytow]|“tomato”) = P ([tahmaatow]|“tomato”) = 0.4
Structure is created manually, transition probabilities learned fromdata
62 TDT4171 Artificial Intelligence Methods
Speech recognition Speech sounds
Isolated words
Phone models + word models fix likelihood P (e1:t|word) forisolated word
P (word|e1:t) = αP (e1:t|word)P (word)
Prior probability P (word) by counting word frequencies
P (e1:t|word) can be computed recursively: define
ℓ1:t =P(Xt, e1:t)
and use the recursive update
ℓ1:t+1 = Forward(ℓ1:t, et+1)
and then P (e1:t|word) =∑
xtℓ1:t(xt)
Isolated-word dictation systems with training reach 95% –99% accuracy
63 TDT4171 Artificial Intelligence Methods
Speech recognition Word sequences
Continuous speech
Not just a sequence of isolated-word recognition problems!
Adjacent words highly correlated
Sequence of most likely words is not equal to most likelysequence of words
Segmentation: there are few gaps in speech
Cross-word coarticulation, e.g., “next thing” ≈ “nexing” indaily speech
Continuous speech recognition is hard; currently the best systemsmanage 60% – 80% accuracy
64 TDT4171 Artificial Intelligence Methods
Speech recognition Word sequences
Language model
Prior probability of a word sequence is given by chain rule:
P (w1 · · ·wn) =
n∏
i=1
P (wi|w1 · · ·wi−1)
simplify using a Bigram model:
P (wi|w1 · · ·wi−1) ≈ P (wi|wi−1)
Train by counting all word pairs in a large text corpus
More sophisticated models (trigrams, grammars, etc.) help,but only a little bit
65 TDT4171 Artificial Intelligence Methods
Speech recognition Word sequences
Summary – Speech
Since the mid-1970s, speech recognition has been formulatedas probabilistic inference
Evidence = speech signal, hidden variables = word and phonesequences
“Context” effects (coarticulation etc.) are handled byaugmenting state
Variability in human speech (speed, timbre, etc., etc.) andbackground noise make continuous speech recognition in realsettings an open problem
66 TDT4171 Artificial Intelligence Methods