Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition:

Spectrogram Reading,Support Vector Machines,

Dynamic Bayesian Networks,and Phonology

Mark [email protected]

University of Illinois at Urbana-Champaign, USA

mailto:[email protected]

Lecture 7. Dynamic Bayesian Networks: Trees

• Bayesian Network = Factored probability mass function

• Definitions: Node, Edge, Graph, Directed, Acyclic, Tree, and Graph Semantics

• Belief Propagation; Sum-Product Algorithm• Example: hidden Markov model• Dynamic Bayesian Networks = BN with dynamics• Viterbi approximate inference• Continuous-valued observations• Example: two-stream audiovisual speech

recognition, with a non-deterministic phoneme-to-viseme mapping

Example of a Bayesian Network

• Does knowing a person’s HEIGHT tell you anything useful about the length of his or her HAIR?

• H = height• L = hair length• Are H and L independent, i.e., is p(H,L)=p(H)p(L)?

H L

Solution to the Example• H and L are approximately conditionally

independent, given a person’s gender– p(H,L,G) = p(H|G) p(L|G) p(G)– G = “male” or “female”

• If you don’t know the gender, then H and L are not independent, because knowing the height can help you to guess the person’s gender– p(H,L) = G p(H|G)p(L|G)p(G) ≠ p(H) p(L)

H L

G

Components of a Bayesian Network• Nodes = Random Variables• Directed Edges (Arrows) = Dependencies among

variables• Parent: a variable’s “parents” are the variables

upon which it is dependent, e.g., G is the parent of both H and L (H and L are “sister nodes;” H and L are the “daughters” of G).

H L

G

Components of a Bayesian Network

• Probability distributions: One per variable– Number of columns = number of different values the variable can take– Number of rows = number of different values the variable’s parents can

take

H Tall Short

P(H|G=Male) 0.7 0.3

P(H|G=Female) 0.3 0.7 H L

G

G Male Female

P(G) 0.5 0.5

L Long Short

P(H|G=Male) 0.1 0.9

P(H|G=Female) 0.9 0.1

Inference in a Bayesian Network• Problem: suppose we know that a person is

tall. Estimate the probability that he or she has short hair.

• p(L|H) = p(L,H)/Lp(L,H)

• p(L,H) = G p(G,H,L)

= G p(L|G) p(G) p(H|G)

H L

G

HUnobserved (“Hidden”) Variable (Drawn as a Circle)

Observed Variable (Drawn as a Square)

A More Interesting Inference Problem• Does knowing a person’s height tell you how

much shampoo he or she uses?

• S=amount of shampoo; depends on hair length

• p(L,H,G,S) = p(S|L) p(L|G) p(H|G) p(G)

H L

G

S

A More Interesting Inference Problem• Suppose we observe that H=tall. Then:

p(S=“a lot”, H=“tall”) =

L p(S=“a lot” | L)

G p(L|G) p(G) p(H=“tall”|G)

• The Reason for Bayesian Networks: Modularity of the Graph Modularity of Computation (sum over G, independent of S; then for each S, sum over L)

H L

G

S

H

Definition: Graph• Node: any unique identifier (letter, integer, phoneme, word)• Node List: a set of nodes, without repetition

– {a,c,b,e,d,f} is a valid Node List

– {a,c,c,b,e,d,f,f } is not

• Edge: a unique identifier, linked to a pair of nodes• Edge List: a set of edges, without repetition

– { 1:ac, 2:cb, 3:cd, 4:ce, 5:df, 6:ef }

• Graph: A Node List and an Edge List, such that all edges connect nodes selected from the Node List

a

c

bd

ef

1 2 3

4

5

6

Directed Graph• Directed graph: a graph in which each edge is directed,

i.e., the order of the node pair is important• Example:

– Nodes = {a,b,c,d,e,f}– Edges = { 1:ac, 2:bc, 3:cd, 4:ce, 5:df, 6:ef, 7:ce }

• Parent node/Mother node: the node listed first on an edge– { a, b, c, c, d, d } are the parents of {c, c, d, e, f, f }

• Daughter node: the node listed second on an edge

a

c

bd

ef

1 23

4

5

6

Acyclic Graph

• Acyclic Graph: A graph in which there is only one path from any node to any other node

• It is always possible to turn a Cyclic Graph into an Acyclic Graph by deleting edges

a

c

bd

ef

1 23

4

5

Tree

• A tree is a directed, acyclic graph, with the following extra limitation:– No node has more than one parent– There is a unique root node– There is a unique directed path from the root node to any other

node

a

c

bd

ef

1 23

4

5

A Few Trees

Descendants and Non-descendants in a Tree

B

Descendants of Node B

Non-descendants of Node B

Graph Semantics• The “semantics” of a graph is the set of meanings applied

to its nodes and edges• Example: Finite State Diagram

– Each NODE represents a state that the system can occupy for one period of time

– Each EDGE represents a possible state transition

• Example: Markov State Diagram– Same as a regular Finite State Diagram, but– There can be more than one edge leaving a node– When multiple edges leave a node, each has an associated

probability

1 2 3

0.7

0.3

0.9

0.1

1.0

Bayesian Network

• A “Bayesian Network” is a particular SEMANTICS for any directed graph– Each NODE is a random variable– The probability distribution for each node depends only on its

parents

• Example:– p(a,b,c,d,e,f) = p(f|d,e) p(d|c) p(e|c) p(c|a,b) p(a) p(b)

a

c

bd

ef

Factored Probabilities• Computations using a graph are simplified because the

probability mass function (PMF) is factored• Factorization of the PMF is shown by the graph

– p(a,b,c,d,e,f) = p(f|d,e) p(d|c) p(e|c) p(c|a,b) p(a) p(b)

a

c

bd

ef

Belief Propagation• Belief propagation =

– Given knowledge about some of the variables in the graph,– Compute the posterior probabilities of all other variables

• Because the PMF is factored, belief propagation is entirely local

• For example, given p(f|c), we can compute p(a,f) locally:

p(a,f) = C B p(f|c) p(c|a,b) p(b) p(a)

a

c

bd

ef

Belief Propagation: Easier in Trees• In general, belief

propagation in an arbitrary graph is difficult (wait until next lecture), but belief propagation in a tree can use an efficient algorithm called the “sum-product algorithm.”

a

c

bd

ef

a

c

bd

ef

The Sum-Product Algorithm

– Dv = The set of all observed descendants of node v

– Nv = The set of all observed nondescendants of node v

1. Propagate Up: for each variable v,1. For each daughter di, compute the sum:

p(Ddi | v) = di p(di |v) p( Ddi | di )

2. Combine different daughters d1,…,dN using the product:

p(Dv | v) = i p(Ddi | v)

2. Propagate Down: for each variable v, whose mother is m:1. Combine the other daughters of m, s1,…,sM, using the product:

p(m, Nv) = p(m, Nm) i,si≠v p( Dsi | m)

2. Compute the sum:

p(v, Nv) = m p(v | m) p(m, Nv)

3. Multiply: p(v | observations) = p(v,Nv) p(Dv|v) / v p(v,Nv) p(Dv|v)

Example #1: Six-Node Network• Propagating Up:

– Nodes g, f, b, and e have no daughters, so we define• p(Dg | g) = 1

• p(Df | f) = 1

• p(De | e) = 1

• p(Db | b) = 1

a

c

bd

ef

b

f

fg


– Node d: Sums• First daughter: g is observed, so

p(Dg | d) = g p(g | d) p(Dg | g) = p(g | d)

• Second daughter: f is observed, so

p(Df | d) = p(f| d)

– Product: p(Dd | d) = p(g,f|d) = p(g|d) p(f|d)

a

c

bd

ef

b

f

fg


– Node d: Sums• First daughter: g is observed, so

p(Dg | d) = g p(g | d) p(Dg | g) = p(g | d)

• Second daughter: f is observed, so

p(Df | d) = p(f| d)

– Product: p(Dd | d) = p(g,f|d) = p(g|d) p(f|d)

a

c

bd

ef

b

f

fg


– Node c: Sums• First daughter: p(Db | c) = b p(Db | b) p(b | c) = p(b | c)

• Second daughter: p(Dd | c) = d p(d|c) p(Dd | d)

• Third daughter: p(De | c) = e p(De | e) p(e | c) = e p(e|c) = 1

– Node c: Product: • p(Dc | c) = p(b,f,g|c) = p(b|c) p(f,g|c)

a

c

bd

e

f

b

f

fg






a

c

bd

e

f

b

f

fg






a

c

bd

e

f

b

f

fg





– Node c: Product: • p(Dc | c) = p(b,f,g|c) = p(Db|c) p(Dd|c) p(De|c)

a

c

bd

e

f

b

f

fg


– Node a: Sums• One Daughter: p(Dc | a) = p(b,f,g|a) = c p(c|a) p(Dc|c)

– Node a: Product • p(Da | a) = p(Dc | a)

a

c

bd

e

f

b

f

fg

Example #1• Propagating Down:

– p(a,Na) = p(a)

a

c

bd

ef

b

f

fg


– Node c: Product• None: p(a,Nc) = p(a,Na) because c has no sisters

– Node c: Sum (Marginalize out the mother)• p(c,Nc) = p(c) = a p(c|a) p(a,Nc)

a

c

bd

ef

b

f

fg


– Node c: Product• None: p(a,Nc) = p(a,Na) because c has no sisters

– Node c: Sum (Marginalize out the mother)• p(c,Nc) = p(c) = a p(c|a) p(a,Nc)

a

c

bd

ef

b

f

fg


– Node d: Product• p(c,Nd) = p(c,Nc) p(Db|c) p(De|c) = p(c) p(b|c)

– Node d: Sum• p(d,Nd) = p(d,b) = c p(d|c) p(c,b)

a

c

bd

ef

b

f

fg


– Node d: Product• p(c,Nd) = p(c,Nc) p(Db|c) p(De|c) = p(c) p(b|c)

– Node d: Sum• p(d,Nd) = p(d,b) = c p(d|c) p(c,b)

a

c

bd

ef

b

f

fg


– Node e: Product• p(c,Ne) = p(c,Nc) p(Db|c) p(Dd|c)

– Node e: Sum• p(e,Ne) = c p(e|c) p(c,Ne)

a

c

bd

e

f

b

f

fg


– Node e: Product• p(c,Ne) = p(c,Nc) p(Db|c) p(Dd|c)

– Node e: Sum• p(e,Ne) = c p(e|c) p(c,Ne)

a

c

bd

ef

b

f

fg

Example #1• Multiply:

p(a|b,f,g) = p(a,Na) p(Da|a) / d p(a,Na) p(Da|a)

p(c|b,f,g) = p(c,Nc) p(Dc|c) / c p(c,Nc) p(Dc|c)

p(d|b,f,g) = p(d,Nd) p(Dd|d) / d p(d,Nd) p(Dd|d)

p(e|b,f,g) = p(e,Ne) p(De|e) / d p(e,Ne) p(De|e)

a

c

bd

ef

b

f

fg

Example #2: Hidden Markov Model

• qt = hidden state variable at time t, 1 ≤ t ≤ T

– “N states” 1 ≤ qt ≤ N

• xt = discrete observation at time t, 1 ≤ t ≤ T

– xt quantized to K different levels or codes 1 ≤ xt ≤ K

• The “speech recognition problem:” – Several different “word models” available

• “Word model” ≡ Parameter values for p(qt|qt-1) and p(xt|qt)

– For each word model, compute the likelihood p(x1,…,xT)

– The “correct” word is the one with max p(x1,…,xT)

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT

x5

…x1 xt xt+1 xT-1 xT

…


• Propagate Up (the “backward algorithm”):– p(DqT | qT) = p(xT|qT)

– …

– p(Dqt | qt)

• Sum: p(Dqt+1 | qt) = qt+1 p(qt+1 | qt) p(Dqt+1 | qt+1)

• Product: p(Dqt | qt) = p(xt | qt) p(Dqt+1 | qt)

– …

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT

x5


…


• Propagate Down (the “forward algorithm”):– p(q1,Nq1) = p(q1)

– …

– p(qt+1,Nqt+1)

• Product: p(qt,Nqt+1) = p(xt | qt) p(qt, Nqt)

• Sum: p(qt+1,Nqt+1) = q1 p(qt+1|qt) p(qt,Nqt+1)

– …

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT

x5


…


• Multiply:– p(qT|x1,…,xT) = p(qT,NqT)p(DqT | qT) / qT p(qT,NqT)p(DqT | qT)

• The speech recognition problem:– p(x1,…,xT) = qT p(qT,NqT) p(DqT | qT)

• Small-vocabulary isolated word recognition: operations above are repeated for every word model. Choose the word model with maximum p(x1,…,xT)

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT

x5


…

Three More Useful Concepts• Dynamic Bayesian Network

– Equal to a Bayesian Network with a periodically repeating central portion

• Max-Product Algorithm– An approximate form of inference, used to find the

hidden variable sequence q1,…,qM that maximizes p(q1,…,qM,x1,…,xN)

• Continuous-Valued Random Variables– Continuous-valued observations: PDF replaces PMF;

no effect on the sum-product algorithm; works well– Continuous-valued hidden variables: integral replaces

sum in the sum-product algorithm; often incomputable

Hidden Markov Model is an Example of a Dynamic Bayesian Network

• A “Dynamic” Bayesian Network is a Bayesian Network with 3 parts:– The initial part, corresponding to the first frame

• q1 and x1

• PMF: q1 has no parents, parent of x1 is q1

– The periodic part, which is duplicated T-2 times in order to match a speech signal with T frames

• qt and xt

• PMF: The parent of qt is qt-1, the parent of xt is qt

– The final part, corresponding to the last frame • qT and xT

• In an HMM, the final part is the same as the periodic part, but that’s not true for all DBNs

q1

x1 x2

qt qT

x5

…x1 xt xT

…

The Max-Product Algorithm

– Let the hidden variables in any Bayesian network (dynamic or non-dynamic) be called q1,…,qM

– Let the observed variables be called x1,…,xN

– The Max-Product Algorithm finds – (q1*,…,qM*)=argmax p(q1,…,qM,x1,…,xN)

– Algorithm detail: exactly like the sum-product algorithm, except that every sum is replaced by a maximum

– Common use: continuous speech recognition– 1≤qt≤N, where N=Nw*Lw, Nw=number of words, Lw=length of each word

– Word wi (0≤i≤Nw-1) is the sequence of states iLw+1 ≤ qt ≤ (i+1)Lw

– Max-product algorithm computes the optimum word sequence (w1*,…,wK*)

The Max-Product Algorithm– Dv* = The set of all OPTIMUM descendants of node v

– Observed descendants: set to their observed value– Hidden/unobserved descendants: set to their optimum values, d*

– Nv* = The set of all OPTIMUM nondescendants of node v

– (Nv-m)* = An optimized set including all nondescendants except variable m

1. Propagate Up: for each hidden variable v,

1. For each daughter di, compute the max:

p(di*, Ddi* | v) = maxdi p(di |v) p( Ddi* | di )

2. Combine different daughters d1,…,dN using the product:

p(Dv* | v) = i p(di*, Ddi* | v)

2. Propagate Down: for each variable v, whose mother is m:

1. Combine the other daughters of m, s1,…,sM, using the product:

p(m, (Nv-m)*) = p(m, Nm*) i,si≠v p( di*, Dsi* | m)

2. Compute the max (choose optimum value of m*):

p(v, Nv*) = maxm p(v | m) p(m, (Nv-m)*)

Max-Product Termination• The maximum alignment probability is

max p(q1,…,qM,x1,…,xN) = maxv p(v,Nv*) p(Dv* | v)

• The optimum hidden variable sequence is the set of all optimum hidden nondescendants of v (stored in the pointer Nv*), plus the set of all optimum hidden descendants of v (stored in the pointer Dv*), plus the optimum value of v itself:

(q1*,…,qM*) = argmax p(v,Nv*) p(Dv* | v)

Continuous-Valued Variables– Either the observed variables or the hidden variables can be

continuous-valued.– A continuous-valued variable has a probability density function

(PDF) instead of a probability mass function (PMF).– Observed variables:

– Discrete: variable xi depends on some hidden variables qj,qk, so its PMF is given by the table p(x1|qj,qk)

– Continuous: the PDF is written the same way, but now it refers to some function of a continuous-valued variable xi, for example, p(xi|qj,qk) = exp(-(xi-jk)2/2jk

2) / (jk√2) – Computation of the Sum-Product algorithm proceeds without change

– Hidden variables:– Now, instead of the sum in the sum-product algorithm, it is necessary to

use an integral. – Most such integrals have no analytic solution, and must be

approximated numerically. – The only analytically available solution is the Kalman filter, for the case

when ALL hidden variables are continuous with Gaussian distribution.

Example #3: Audiovisual Speech Recognition (AVSR)

• A viseme is a group of phonemes with identical lip and tongue blade features, e.g., vt=BILABIAL may correspond to qt in {p,b,m}.

• Mapping from phonemes to visemes may be probabilistic, e.g., /o/ may not always look ROUNDED_HIGH, thus p(vt|qt) is a PMF

• xt and yt may be discrete or continuous

q1

x1 x2

qt+1 qT

x5

…x1 xt+1 xT

v1 vt+1 vT

x1 x2 x5y1 yt+1 yT Video observations

Viseme states

Audio phoneme states

Audio spectral observationsx2

qt

xt

vt

x2yt

…

AVSR: Max-Product Algorithm

• Propagate Up: maximimization step, for the three daughters of qt

– p(xt | qt)

– p(qt+1*, Dqt+1* | qt) = maxqt+1 p(Dqt+1* | qt+1) p(qt+1 | qt)

– p(vt*, Dvt* | qt) = maxvt p(yt | vt) p(vt | qt)

q1

x1 x2

qt+1 qT

x5

…x1 xt+1 xT

v1 vt+1 vT

x1 x2 x5y1 yt+1 yT

x2

qt

xt

vt

x2yt

…


• Propagate Up: product step, multiply together three daughters of qt

– p(Dqt* | qt) = p(vt*, Dvt* | qt) p(qt+1*, Dqt+1* | qt) p(xt | qt)

q1

x1 x2

qt+1 qT

x5

…x1 xt+1 xT

v1 vt+1 vT

x1 x2 x5y1 yt+1 yT

x2

qt

xt

vt

x2yt

…


• Propagate Down, product step: multiply probabilities of the mother and sisters of qt+1:

– p(qt, (Nqt+1-qt)*) = p(qt, Nqt*) p(vt*, Dvt* | qt) p(xt | qt)

q1

x1 x2

qt+1 qT

x5

…x1 xt+1 xT

v1 vt+1 vT

x1 x2 x5y1 yt+1 yT

x2

qt

xt

vt

x2yt

…


• Propagate Down, maximization step: choose the optimum value of the mother of qt+1:

– p(qt+1*, Nqt+1*) = maxqt p(qt, (Nqt+1-qt)*) p(qt+1 | qt)

q1

x1 x2

qt+1 qT

x5

…x1 xt+1 xT

v1 vt+1 vT

x1 x2 x5y1 yt+1 yT

x2

qt

xt

vt

x2yt

…


• Optimum Alignment Probability:

max p(q1,…,qT,v1,…,vT,x1,…,xT,y1,…,yT) = maxqT p(qT,NqT*) p(DqT* | qT)

• Optimally Aligned Hidden-Variable Sequence:

(q1*,…,qT*,v1*,…,vT*) = argmax p(qT,NqT*) p(DqT* | qT)

q1

x1 x2

qt+1 qT

x5

…x1 xt+1 xT

v1 vt+1 vT

x1 x2 x5y1 yt+1 yT

x2

qt

xt

vt

x2yt

…

Summary

• Bayesian Network = Factored probability mass function

• Graphs, Trees, and Graph Semantics• Belief Propagation = Progressive Marginalization• Example: hidden Markov model• Dynamic Bayesian Networks• The Max-Product Algorithm• Continuous-Valued Observations• Example: two-stream audiovisual speech

recognition, with a non-deterministic phoneme-to-viseme mapping

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Documents

node pair

hair lengthare h

sister nodes h

daughter node

valid node list

bayesian networknodes

bayesian networkdoes

bayesian networkproblem