Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA
Jan 24, 2016
Landmark-Based Speech Recognition:
Spectrogram Reading,Support Vector Machines,
Dynamic Bayesian Networks,and Phonology
Mark [email protected]
University of Illinois at Urbana-Champaign, USA
Lecture 7. Dynamic Bayesian Networks: Trees
• Bayesian Network = Factored probability mass function
• Definitions: Node, Edge, Graph, Directed, Acyclic, Tree, and Graph Semantics
• Belief Propagation; Sum-Product Algorithm• Example: hidden Markov model• Dynamic Bayesian Networks = BN with dynamics• Viterbi approximate inference• Continuous-valued observations• Example: two-stream audiovisual speech
recognition, with a non-deterministic phoneme-to-viseme mapping
Example of a Bayesian Network
• Does knowing a person’s HEIGHT tell you anything useful about the length of his or her HAIR?
• H = height• L = hair length• Are H and L independent, i.e., is p(H,L)=p(H)p(L)?
H L
Solution to the Example• H and L are approximately conditionally
independent, given a person’s gender– p(H,L,G) = p(H|G) p(L|G) p(G)– G = “male” or “female”
• If you don’t know the gender, then H and L are not independent, because knowing the height can help you to guess the person’s gender– p(H,L) = G p(H|G)p(L|G)p(G) ≠ p(H) p(L)
H L
G
Components of a Bayesian Network• Nodes = Random Variables• Directed Edges (Arrows) = Dependencies among
variables• Parent: a variable’s “parents” are the variables
upon which it is dependent, e.g., G is the parent of both H and L (H and L are “sister nodes;” H and L are the “daughters” of G).
H L
G
Components of a Bayesian Network
• Probability distributions: One per variable– Number of columns = number of different values the variable can take– Number of rows = number of different values the variable’s parents can
take
H Tall Short
P(H|G=Male) 0.7 0.3
P(H|G=Female) 0.3 0.7 H L
G
G Male Female
P(G) 0.5 0.5
L Long Short
P(H|G=Male) 0.1 0.9
P(H|G=Female) 0.9 0.1
Inference in a Bayesian Network• Problem: suppose we know that a person is
tall. Estimate the probability that he or she has short hair.
• p(L|H) = p(L,H)/Lp(L,H)
• p(L,H) = G p(G,H,L)
= G p(L|G) p(G) p(H|G)
H L
G
HUnobserved (“Hidden”) Variable (Drawn as a Circle)
Observed Variable (Drawn as a Square)
A More Interesting Inference Problem• Does knowing a person’s height tell you how
much shampoo he or she uses?
• S=amount of shampoo; depends on hair length
• p(L,H,G,S) = p(S|L) p(L|G) p(H|G) p(G)
H L
G
S
A More Interesting Inference Problem• Suppose we observe that H=tall. Then:
p(S=“a lot”, H=“tall”) =
L p(S=“a lot” | L)
G p(L|G) p(G) p(H=“tall”|G)
• The Reason for Bayesian Networks: Modularity of the Graph Modularity of Computation (sum over G, independent of S; then for each S, sum over L)
H L
G
S
H
Definition: Graph• Node: any unique identifier (letter, integer, phoneme, word)• Node List: a set of nodes, without repetition
– {a,c,b,e,d,f} is a valid Node List
– {a,c,c,b,e,d,f,f } is not
• Edge: a unique identifier, linked to a pair of nodes• Edge List: a set of edges, without repetition
– { 1:ac, 2:cb, 3:cd, 4:ce, 5:df, 6:ef }
• Graph: A Node List and an Edge List, such that all edges connect nodes selected from the Node List
a
c
bd
ef
1 2 3
4
5
6
Directed Graph• Directed graph: a graph in which each edge is directed,
i.e., the order of the node pair is important• Example:
– Nodes = {a,b,c,d,e,f}– Edges = { 1:ac, 2:bc, 3:cd, 4:ce, 5:df, 6:ef, 7:ce }
• Parent node/Mother node: the node listed first on an edge– { a, b, c, c, d, d } are the parents of {c, c, d, e, f, f }
• Daughter node: the node listed second on an edge
a
c
bd
ef
1 23
4
5
6
Acyclic Graph
• Acyclic Graph: A graph in which there is only one path from any node to any other node
• It is always possible to turn a Cyclic Graph into an Acyclic Graph by deleting edges
a
c
bd
ef
1 23
4
5
Tree
• A tree is a directed, acyclic graph, with the following extra limitation:– No node has more than one parent– There is a unique root node– There is a unique directed path from the root node to any other
node
a
c
bd
ef
1 23
4
5
A Few Trees
Descendants and Non-descendants in a Tree
B
Descendants of Node B
Non-descendants of Node B
Graph Semantics• The “semantics” of a graph is the set of meanings applied
to its nodes and edges• Example: Finite State Diagram
– Each NODE represents a state that the system can occupy for one period of time
– Each EDGE represents a possible state transition
• Example: Markov State Diagram– Same as a regular Finite State Diagram, but– There can be more than one edge leaving a node– When multiple edges leave a node, each has an associated
probability
1 2 3
0.7
0.3
0.9
0.1
1.0
Bayesian Network
• A “Bayesian Network” is a particular SEMANTICS for any directed graph– Each NODE is a random variable– The probability distribution for each node depends only on its
parents
• Example:– p(a,b,c,d,e,f) = p(f|d,e) p(d|c) p(e|c) p(c|a,b) p(a) p(b)
a
c
bd
ef
Factored Probabilities• Computations using a graph are simplified because the
probability mass function (PMF) is factored• Factorization of the PMF is shown by the graph
– p(a,b,c,d,e,f) = p(f|d,e) p(d|c) p(e|c) p(c|a,b) p(a) p(b)
a
c
bd
ef
Belief Propagation• Belief propagation =
– Given knowledge about some of the variables in the graph,– Compute the posterior probabilities of all other variables
• Because the PMF is factored, belief propagation is entirely local
• For example, given p(f|c), we can compute p(a,f) locally:
p(a,f) = C B p(f|c) p(c|a,b) p(b) p(a)
a
c
bd
ef
Belief Propagation: Easier in Trees• In general, belief
propagation in an arbitrary graph is difficult (wait until next lecture), but belief propagation in a tree can use an efficient algorithm called the “sum-product algorithm.”
a
c
bd
ef
a
c
bd
ef
The Sum-Product Algorithm
– Dv = The set of all observed descendants of node v
– Nv = The set of all observed nondescendants of node v
1. Propagate Up: for each variable v,1. For each daughter di, compute the sum:
p(Ddi | v) = di p(di |v) p( Ddi | di )
2. Combine different daughters d1,…,dN using the product:
p(Dv | v) = i p(Ddi | v)
2. Propagate Down: for each variable v, whose mother is m:1. Combine the other daughters of m, s1,…,sM, using the product:
p(m, Nv) = p(m, Nm) i,si≠v p( Dsi | m)
2. Compute the sum:
p(v, Nv) = m p(v | m) p(m, Nv)
3. Multiply: p(v | observations) = p(v,Nv) p(Dv|v) / v p(v,Nv) p(Dv|v)
Example #1: Six-Node Network• Propagating Up:
– Nodes g, f, b, and e have no daughters, so we define• p(Dg | g) = 1
• p(Df | f) = 1
• p(De | e) = 1
• p(Db | b) = 1
a
c
bd
ef
b
f
fg
Example #1: Six-Node Network• Propagating Up:
– Node d: Sums• First daughter: g is observed, so
p(Dg | d) = g p(g | d) p(Dg | g) = p(g | d)
• Second daughter: f is observed, so
p(Df | d) = p(f| d)
– Product: p(Dd | d) = p(g,f|d) = p(g|d) p(f|d)
a
c
bd
ef
b
f
fg
Example #1: Six-Node Network• Propagating Up:
– Node d: Sums• First daughter: g is observed, so
p(Dg | d) = g p(g | d) p(Dg | g) = p(g | d)
• Second daughter: f is observed, so
p(Df | d) = p(f| d)
– Product: p(Dd | d) = p(g,f|d) = p(g|d) p(f|d)
a
c
bd
ef
b
f
fg
Example #1: Six-Node Network• Propagating Up:
– Node c: Sums• First daughter: p(Db | c) = b p(Db | b) p(b | c) = p(b | c)
• Second daughter: p(Dd | c) = d p(d|c) p(Dd | d)
• Third daughter: p(De | c) = e p(De | e) p(e | c) = e p(e|c) = 1
– Node c: Product: • p(Dc | c) = p(b,f,g|c) = p(b|c) p(f,g|c)
a
c
bd
e
f
b
f
fg
Example #1: Six-Node Network• Propagating Up:
– Node c: Sums• First daughter: p(Db | c) = b p(Db | b) p(b | c) = p(b | c)
• Second daughter: p(Dd | c) = d p(d|c) p(Dd | d)
• Third daughter: p(De | c) = e p(De | e) p(e | c) = e p(e|c) = 1
– Node c: Product: • p(Dc | c) = p(b,f,g|c) = p(b|c) p(f,g|c)
a
c
bd
e
f
b
f
fg
Example #1: Six-Node Network• Propagating Up:
– Node c: Sums• First daughter: p(Db | c) = b p(Db | b) p(b | c) = p(b | c)
• Second daughter: p(Dd | c) = d p(d|c) p(Dd | d)
• Third daughter: p(De | c) = e p(De | e) p(e | c) = e p(e|c) = 1
– Node c: Product: • p(Dc | c) = p(b,f,g|c) = p(b|c) p(f,g|c)
a
c
bd
e
f
b
f
fg
Example #1: Six-Node Network• Propagating Up:
– Node c: Sums• First daughter: p(Db | c) = b p(Db | b) p(b | c) = p(b | c)
• Second daughter: p(Dd | c) = d p(d|c) p(Dd | d)
• Third daughter: p(De | c) = e p(De | e) p(e | c) = e p(e|c) = 1
– Node c: Product: • p(Dc | c) = p(b,f,g|c) = p(Db|c) p(Dd|c) p(De|c)
a
c
bd
e
f
b
f
fg
Example #1: Six-Node Network• Propagating Up:
– Node a: Sums• One Daughter: p(Dc | a) = p(b,f,g|a) = c p(c|a) p(Dc|c)
– Node a: Product • p(Da | a) = p(Dc | a)
a
c
bd
e
f
b
f
fg
Example #1• Propagating Down:
– p(a,Na) = p(a)
a
c
bd
ef
b
f
fg
Example #1• Propagating Down:
– Node c: Product• None: p(a,Nc) = p(a,Na) because c has no sisters
– Node c: Sum (Marginalize out the mother)• p(c,Nc) = p(c) = a p(c|a) p(a,Nc)
a
c
bd
ef
b
f
fg
Example #1• Propagating Down:
– Node c: Product• None: p(a,Nc) = p(a,Na) because c has no sisters
– Node c: Sum (Marginalize out the mother)• p(c,Nc) = p(c) = a p(c|a) p(a,Nc)
a
c
bd
ef
b
f
fg
Example #1• Propagating Down:
– Node d: Product• p(c,Nd) = p(c,Nc) p(Db|c) p(De|c) = p(c) p(b|c)
– Node d: Sum• p(d,Nd) = p(d,b) = c p(d|c) p(c,b)
a
c
bd
ef
b
f
fg
Example #1• Propagating Down:
– Node d: Product• p(c,Nd) = p(c,Nc) p(Db|c) p(De|c) = p(c) p(b|c)
– Node d: Sum• p(d,Nd) = p(d,b) = c p(d|c) p(c,b)
a
c
bd
ef
b
f
fg
Example #1• Propagating Down:
– Node e: Product• p(c,Ne) = p(c,Nc) p(Db|c) p(Dd|c)
– Node e: Sum• p(e,Ne) = c p(e|c) p(c,Ne)
a
c
bd
e
f
b
f
fg
Example #1• Propagating Down:
– Node e: Product• p(c,Ne) = p(c,Nc) p(Db|c) p(Dd|c)
– Node e: Sum• p(e,Ne) = c p(e|c) p(c,Ne)
a
c
bd
ef
b
f
fg
Example #1• Multiply:
p(a|b,f,g) = p(a,Na) p(Da|a) / d p(a,Na) p(Da|a)
p(c|b,f,g) = p(c,Nc) p(Dc|c) / c p(c,Nc) p(Dc|c)
p(d|b,f,g) = p(d,Nd) p(Dd|d) / d p(d,Nd) p(Dd|d)
p(e|b,f,g) = p(e,Ne) p(De|e) / d p(e,Ne) p(De|e)
a
c
bd
ef
b
f
fg
Example #2: Hidden Markov Model
• qt = hidden state variable at time t, 1 ≤ t ≤ T
– “N states” 1 ≤ qt ≤ N
• xt = discrete observation at time t, 1 ≤ t ≤ T
– xt quantized to K different levels or codes 1 ≤ xt ≤ K
• The “speech recognition problem:” – Several different “word models” available
• “Word model” ≡ Parameter values for p(qt|qt-1) and p(xt|qt)
– For each word model, compute the likelihood p(x1,…,xT)
– The “correct” word is the one with max p(x1,…,xT)
q1
x1 x2
qt qt+1
x3
qT-1
x4
qT
x5
…x1 xt xt+1 xT-1 xT
…
Example #2: Hidden Markov Model
• Propagate Up (the “backward algorithm”):– p(DqT | qT) = p(xT|qT)
– …
– p(Dqt | qt)
• Sum: p(Dqt+1 | qt) = qt+1 p(qt+1 | qt) p(Dqt+1 | qt+1)
• Product: p(Dqt | qt) = p(xt | qt) p(Dqt+1 | qt)
– …
q1
x1 x2
qt qt+1
x3
qT-1
x4
qT
x5
…x1 xt xt+1 xT-1 xT
…
Example #2: Hidden Markov Model
• Propagate Down (the “forward algorithm”):– p(q1,Nq1) = p(q1)
– …
– p(qt+1,Nqt+1)
• Product: p(qt,Nqt+1) = p(xt | qt) p(qt, Nqt)
• Sum: p(qt+1,Nqt+1) = q1 p(qt+1|qt) p(qt,Nqt+1)
– …
q1
x1 x2
qt qt+1
x3
qT-1
x4
qT
x5
…x1 xt xt+1 xT-1 xT
…
Example #2: Hidden Markov Model
• Multiply:– p(qT|x1,…,xT) = p(qT,NqT)p(DqT | qT) / qT p(qT,NqT)p(DqT | qT)
• The speech recognition problem:– p(x1,…,xT) = qT p(qT,NqT) p(DqT | qT)
• Small-vocabulary isolated word recognition: operations above are repeated for every word model. Choose the word model with maximum p(x1,…,xT)
q1
x1 x2
qt qt+1
x3
qT-1
x4
qT
x5
…x1 xt xt+1 xT-1 xT
…
Three More Useful Concepts• Dynamic Bayesian Network
– Equal to a Bayesian Network with a periodically repeating central portion
• Max-Product Algorithm– An approximate form of inference, used to find the
hidden variable sequence q1,…,qM that maximizes p(q1,…,qM,x1,…,xN)
• Continuous-Valued Random Variables– Continuous-valued observations: PDF replaces PMF;
no effect on the sum-product algorithm; works well– Continuous-valued hidden variables: integral replaces
sum in the sum-product algorithm; often incomputable
Hidden Markov Model is an Example of a Dynamic Bayesian Network
• A “Dynamic” Bayesian Network is a Bayesian Network with 3 parts:– The initial part, corresponding to the first frame
• q1 and x1
• PMF: q1 has no parents, parent of x1 is q1
– The periodic part, which is duplicated T-2 times in order to match a speech signal with T frames
• qt and xt
• PMF: The parent of qt is qt-1, the parent of xt is qt
– The final part, corresponding to the last frame • qT and xT
• In an HMM, the final part is the same as the periodic part, but that’s not true for all DBNs
q1
x1 x2
qt qT
x5
…x1 xt xT
…
The Max-Product Algorithm
– Let the hidden variables in any Bayesian network (dynamic or non-dynamic) be called q1,…,qM
– Let the observed variables be called x1,…,xN
– The Max-Product Algorithm finds – (q1*,…,qM*)=argmax p(q1,…,qM,x1,…,xN)
– Algorithm detail: exactly like the sum-product algorithm, except that every sum is replaced by a maximum
– Common use: continuous speech recognition– 1≤qt≤N, where N=Nw*Lw, Nw=number of words, Lw=length of each word
– Word wi (0≤i≤Nw-1) is the sequence of states iLw+1 ≤ qt ≤ (i+1)Lw
– Max-product algorithm computes the optimum word sequence (w1*,…,wK*)
The Max-Product Algorithm– Dv* = The set of all OPTIMUM descendants of node v
– Observed descendants: set to their observed value– Hidden/unobserved descendants: set to their optimum values, d*
– Nv* = The set of all OPTIMUM nondescendants of node v
– (Nv-m)* = An optimized set including all nondescendants except variable m
1. Propagate Up: for each hidden variable v,
1. For each daughter di, compute the max:
p(di*, Ddi* | v) = maxdi p(di |v) p( Ddi* | di )
2. Combine different daughters d1,…,dN using the product:
p(Dv* | v) = i p(di*, Ddi* | v)
2. Propagate Down: for each variable v, whose mother is m:
1. Combine the other daughters of m, s1,…,sM, using the product:
p(m, (Nv-m)*) = p(m, Nm*) i,si≠v p( di*, Dsi* | m)
2. Compute the max (choose optimum value of m*):
p(v, Nv*) = maxm p(v | m) p(m, (Nv-m)*)
Max-Product Termination• The maximum alignment probability is
max p(q1,…,qM,x1,…,xN) = maxv p(v,Nv*) p(Dv* | v)
• The optimum hidden variable sequence is the set of all optimum hidden nondescendants of v (stored in the pointer Nv*), plus the set of all optimum hidden descendants of v (stored in the pointer Dv*), plus the optimum value of v itself:
(q1*,…,qM*) = argmax p(v,Nv*) p(Dv* | v)
Continuous-Valued Variables– Either the observed variables or the hidden variables can be
continuous-valued.– A continuous-valued variable has a probability density function
(PDF) instead of a probability mass function (PMF).– Observed variables:
– Discrete: variable xi depends on some hidden variables qj,qk, so its PMF is given by the table p(x1|qj,qk)
– Continuous: the PDF is written the same way, but now it refers to some function of a continuous-valued variable xi, for example, p(xi|qj,qk) = exp(-(xi-jk)2/2jk
2) / (jk√2) – Computation of the Sum-Product algorithm proceeds without change
– Hidden variables:– Now, instead of the sum in the sum-product algorithm, it is necessary to
use an integral. – Most such integrals have no analytic solution, and must be
approximated numerically. – The only analytically available solution is the Kalman filter, for the case
when ALL hidden variables are continuous with Gaussian distribution.
Example #3: Audiovisual Speech Recognition (AVSR)
• A viseme is a group of phonemes with identical lip and tongue blade features, e.g., vt=BILABIAL may correspond to qt in {p,b,m}.
• Mapping from phonemes to visemes may be probabilistic, e.g., /o/ may not always look ROUNDED_HIGH, thus p(vt|qt) is a PMF
• xt and yt may be discrete or continuous
q1
x1 x2
qt+1 qT
x5
…x1 xt+1 xT
v1 vt+1 vT
x1 x2 x5y1 yt+1 yT Video observations
Viseme states
Audio phoneme states
Audio spectral observationsx2
qt
xt
vt
x2yt
…
AVSR: Max-Product Algorithm
• Propagate Up: maximimization step, for the three daughters of qt
– p(xt | qt)
– p(qt+1*, Dqt+1* | qt) = maxqt+1 p(Dqt+1* | qt+1) p(qt+1 | qt)
– p(vt*, Dvt* | qt) = maxvt p(yt | vt) p(vt | qt)
q1
x1 x2
qt+1 qT
x5
…x1 xt+1 xT
v1 vt+1 vT
x1 x2 x5y1 yt+1 yT
x2
qt
xt
vt
x2yt
…
AVSR: Max-Product Algorithm
• Propagate Up: product step, multiply together three daughters of qt
– p(Dqt* | qt) = p(vt*, Dvt* | qt) p(qt+1*, Dqt+1* | qt) p(xt | qt)
q1
x1 x2
qt+1 qT
x5
…x1 xt+1 xT
v1 vt+1 vT
x1 x2 x5y1 yt+1 yT
x2
qt
xt
vt
x2yt
…
AVSR: Max-Product Algorithm
• Propagate Down, product step: multiply probabilities of the mother and sisters of qt+1:
– p(qt, (Nqt+1-qt)*) = p(qt, Nqt*) p(vt*, Dvt* | qt) p(xt | qt)
q1
x1 x2
qt+1 qT
x5
…x1 xt+1 xT
v1 vt+1 vT
x1 x2 x5y1 yt+1 yT
x2
qt
xt
vt
x2yt
…
AVSR: Max-Product Algorithm
• Propagate Down, maximization step: choose the optimum value of the mother of qt+1:
– p(qt+1*, Nqt+1*) = maxqt p(qt, (Nqt+1-qt)*) p(qt+1 | qt)
q1
x1 x2
qt+1 qT
x5
…x1 xt+1 xT
v1 vt+1 vT
x1 x2 x5y1 yt+1 yT
x2
qt
xt
vt
x2yt
…
AVSR: Max-Product Algorithm
• Optimum Alignment Probability:
max p(q1,…,qT,v1,…,vT,x1,…,xT,y1,…,yT) = maxqT p(qT,NqT*) p(DqT* | qT)
• Optimally Aligned Hidden-Variable Sequence:
(q1*,…,qT*,v1*,…,vT*) = argmax p(qT,NqT*) p(DqT* | qT)
q1
x1 x2
qt+1 qT
x5
…x1 xt+1 xT
v1 vt+1 vT
x1 x2 x5y1 yt+1 yT
x2
qt
xt
vt
x2yt
…
Summary
• Bayesian Network = Factored probability mass function
• Graphs, Trees, and Graph Semantics• Belief Propagation = Progressive Marginalization• Example: hidden Markov model• Dynamic Bayesian Networks• The Max-Product Algorithm• Continuous-Valued Observations• Example: two-stream audiovisual speech
recognition, with a non-deterministic phoneme-to-viseme mapping