+ Overview of Speech Models September 5, 2017 Professor Marie Meteer Brandeis University
+Overview of Speech Models
September 5, 2017 Professor Marie Meteer Brandeis University
+1970’s 1980’s 1990-96 1997 1998 1999 2000 2001 2003 2005 2007 2008 2009 2010
SRI 1995 Decipher SRI DecipherMIT 1989 Voyager Jupiter MIT (research)AT&T InteractionsCambridge Unveristy HTKCMU Hearsay & Harpy 1990 Sphinx Sphinx open sourceBBN HWIM Byblos Byblos (research)
1993 Hark --> AVOKE Hark (Not sold)
Kurzweil OCR > Zerox 1992>Visioneer >>Scancsoft Nuance(s)1982 Dragon > L&H
eScription > Nuance1982 Kurzweil AI > L&H1987 L&H Bankrupt; Assets > Scansoft
1994 AlTech 1998 renamed SpeechWorks bought by Scansoft1994 Nuance Nuance goes public “Merger” with Scansoft1994 Philips SR for medical dictation > Nuance1996 Voice Control Systems > PhilipsVoice Signal >>Nuance
IBM speech research 2009 IP> NuanceLoquendo Loquendo
Microsoft MicrosoftGoogle Google
Yap Amazon AtoZJHU Kaldi Kaldi
Nuance takeovers begin
Industry consolidation
6/6/16 © MM Consulting 2016
2
Early research groups still researching
The big boys come in: Microsoft, Google,
Amazon
BBN Product in UFA
Nuance powers Siri
Kaldi, new research recognizer goes
OpenSource
Caveat: map is not to scale ;-)
+ Today n The speech problem
n Units of speech
n Hidden Markov Models
n Phonetic HMMs
n Recognition architecture n Training n Decoding
3
+ Friday: Next level of detail n Where do the features come from?
n How are the transition and observation probabilities learned?
n What is the Grammar/Language Model?
n How are the models applied in recognition (decoding) and how does the Viterbi algorithm help?
n What were the core improvements ala 1995?
n Topics that have been overcome by time: n Effects of training and grammar, Speaker dependent vs. speaker independent, Real time speech
recognition
n Topics we’ll return to n Adaptation and out of vocabulary (OOV)
4
+ 1995: First significant breakthrough in Speech Recognition n Hidden Markoff Models
n Mathematical framework n Ability to model time and spectral variability simultaneously n Ability to automatically estimate parameters given data n Not longer need to hand segment into phonemes n Segmentation and modeling done in one step n Data driven à Standard scientific procedures n Empirical!
5
+ The speech problem n Continuous-time signal à
n Sequence of discrete entitites
n Sequence of discrete entities
n Or did she say n It’s easy to wreck a nice beach?
6
ih t S iy z iy t u r eh k o g n iy z s p iy ch
It’s easy to recognize speech
+ Challenge: Variability n Linguistic
n Can say many different things n Phonetics, phonology, syntax, semantic, discourse
n Speaker n Physical characteristics of speaker n Co-articulation (mouth has to transition between sounds n Native language/dialect
n Channel n Background noise n Transmission channel (microphone/telephone quality)
7
+ 8 The Problem of Segmentation... or... Why Speech Recognition is so Difficult
m I n & m b & r i s e v & n th r E n I n z E r o t ü s e v & n f O r
(user:Roberto (attribute:telephone-num value:7360474))
MY NUMBER
IS SEVEN
THREE NINE
ZERO TWO
SEVEN FOUR
MY NUMBER
IS SEVEN
THREE NINE
ZERO TWO
SEVEN FOUR
NP NP VP
+ A look at the speech sounds n Grey whales
9
Words Grey Whales meow Phonemes g r ey w ey l z ?
Triphones - g r g r ey r ey w ey w y w ey l ey l z l z -
+ Contextual variability n Ey in grey, ey in whales
n G initial, g in “big gray”, g in “pink gray” (look at live in wavesurver)
10
+ Back to HMMs n Why Markov model?
n Why Hidden Markov model? n Output symbols are probabilistic distribution over all labels n The actual sequence of states for a particular output is “Hidden” n There is one sequence that is the most probable to generate the
output symbols
11
Markov chain models sequences
à Speech is sequential can be modeled as a sequence of states specified by the dictionary
Transitions in the chain are probabilistic
à Probabilities model uncertainly well
+
9/5/17 Speech and Language Processing - Jurafsky and Martin
12
n States Q = q1, q2…qN;
n Observations O= o1, o2…oN; n Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
n Transition probabilities n Transition probability matrix A = {aij}
n Observation likelihoods n Output probability matrix B={bi(k)}
n Special initial probability vector π
€
π i = P(q1 = i) 1≤ i ≤ N€
aij = P(qt = j |qt−1 = i) 1≤ i, j ≤ N
€
bi(k) = P(Xt = ok |qt = i)
Hidden Markov Models formally
+ Different types of HMM structure
Thanks to Dan Jurafsky for these slides
Bakis = left-to-right (allowing skips)
Ergodic = fully-connected
+ HMMs for speech n Dictionary
SIX S IH K S
n State sequence for every word
n Each phone has 3 subphones
Thanks to Dan Jurafsky for these slides
+ HMM for digit recognition task
Thanks to Dan Jurafsky for these slides
+ Back to the Noisy Channel Model
n Search through space of all possible sentences n Defined by the HMM
n Pick the one that is most probable given the waveform. n Based on the transition and output probabilities in the HMM Thanks to Dan Jurafsky for these slides
+ The Noisy Channel Model n What is the most likely sentence out of all sentences
in the language L given some acoustic input O?
n Treat acoustic input O as sequence of individual observations n O = o1,o2,o3,…,ot
n Define a sentence as a sequence of words: n W = w1,w2,w3,…,wn
Thanks to Dan Jurafsky for these slides
+ “Output symbols”: the great hack n Markov models are “generative” models: Find the most likely sequence
that generates the output symbols
n “Output symbols” = Observations n What we normally think of as input
n But speech is continuous. Where are the symbols?
n Early speech hacked this by “quantizing” vectors, e.g. giving them a unique integer value n (More on how this hack works later, for now just believe it) n This is what is described in Makhoul and Schwartz
18
+ Noisy Channel Model n Probabilistic implication: Pick the
highest probability word sequence Ŵ:
n We can use Bayes rule to rewrite this:
n Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: Thanks to Dan Jurafsky for these slides
€
ˆ W = argmaxW ∈L
P(W | O)
€
ˆ W = argmaxW ∈L
P(O |W )P(W )€
ˆ W = argmaxW ∈L
P(O |W )P(W )P(O)
+ Noisy channel model
€
ˆ W = argmaxW ∈L
P(O |W )P(W )
Thanks to Dan Jurafsky for these slides
likelihood prior
+Speech Architecture meets Noisy Channel
€
ˆ W = argmaxW ∈L
P(O |W )P(W )
Thanks to Dan Jurafsky for these slides
likelihood prior
+ M&S speech architecture 9958 Colloquium Paper: Makhoul and Schwartz
bl(s)
-T I. tA B C D E
b2(S)
A B C D E
b3 (s)
ATBt,A B C D E
FIG. 3. A three-state HMM.
one state to another is probabilistic, but the production of theoutput symbols is deterministic.Now, given a sequence of output symbols that were gener-
ated by a Markov chain, one can retrace the correspondingsequence of states completely and unambiguously (providedthe output symbol for each state was unique). For example, thesample symbol sequence B A A C B B A C C C A is producedby transitioning into the following sequence of states: 2 1 1 32 2 1 3 3 3 1.Hidden Markov Models. A hidden Markov model (HMM)
is the same as a Markov chain, except for one importantdifference: the output symbols in an HMM are probabilistic.Instead of associating a single output symbol per state, in anHMM all symbols are possible at each state, each with its ownprobability. Thus, associated with each state is a probabilitydistribution of all the output symbols. Furthermore, the num-ber of output symbols can be arbitrary. The different statesmay then have different probability distributions defined onthe set of output symbols. The probabilities associated withstates are known as output probabilities.
Fig. 3 shows an example of a three-state HMM. It has thesame transition probabilities as the Markov chain of Fig. 2.What is different is that we associate a probability distributionbi(s) with each state i, defined over the set of output symbolss-in this case we have five output symbols-A, B, C, D, andE. Now, when we transition from one state to another, theoutput symbol is chosen according to the probability distribu-tion corresponding to that state. Compared to a Markov chain,the output sequences generated by an HMM are what is knownas doubly stochastic: not only is the transitioning from onestate to another stochastic (probabilistic) but so is the outputsymbol generated at each state.
Speech Feature Feature Recognition LieostInput Extraction Vectors Search Sentence
FIG. 5. General system for training and recognition.
Now, given a sequence of symbols generated by a particularHMM, it is not possible to retrace the sequence of statesunambiguously. Every sequence of states of the same length asthe sequence of symbols is possible, each with a differentprobability. Given the sample output sequence-C D AA B ED B A C C-there is no way for sure to know which sequenceof states produced these output symbols. We say that thesequence of states is hidden in that it is hidden from theobserver if all one sees is the output sequence, and that is whythese models are known as hidden Markov models.Even though it is not possible to determine for sure what
sequence of states produced a particular sequence of symbols,one might be interested in the sequence of states that has thehighest probability of having generated the given sequence.
Phonetic HMMs. We now explain how HMMs are used tomodel phonetic speech events. Fig. 4 shows an example of athree-state HMM for a single phoneme. The first stage in thecontinuous-to-discrete mapping that is required for recogni-tion is performed by the analysis or feature extraction boxshown in Fig. 5. Typically, the analysis consists of estimation ofthe short-term spectrum of the speech signal over a frame(window) of about 20 ms. The spectral computation is thenupdated about every 10 ms, which corresponds to a frame rateof 100 frames per second. This completes the initial discreti-zation in time. However, the HMM, as depicted in this paper,also requires the definition of a discrete set of "outputsymbols." So, we need to discretize the spectrum into one ofa finite set of spectra. Fig. 4 depicts a set of spectral templates(known as a codebook) that represent the space of possible
3-STATE HMM
/TRANSITIONPROBABILITIES These probabilities
comprise the 'model' forOUTPUT 5 one phonePRfOBABILITIES
O CODE 255 0 CODE 255 0 CODE 255
MODEL LEARNED BY FORWARD-BACKWARD ALGORITIHM
CODEBOOK OFREPRESENTATIVE
SPECTRA
I
0
255~
COLDBOOK DERIVEDBY CLUSTERING PROCESS
FIG. 4. Basic structure of a phonetic HMM.
Proc. Natl. Acad. Sci. USA 92 (1995) 22
Observations
Observations
Language model P(W)
Acoustic model
P(O|W)
Decoding Ŵ = P(O|W)P(W)
+ Acoustic Modeling Training
Words Nine night
Phonemes N-AY-N N-AY-T
Triphoness <S>NAY----NAYN---AYN<S> -<SIL>-<S>NAY—NAYT--AYT<S>
2/10/15 © MM Consulting 2015
23
learns the relationship between feature vectors and triphones
Observations
+ Really learning the observation and transition probabilities of the HMM
24
no
n1 n2 no
n1 n2 ay o
ay1
ay2
no n1 n2 to t1 t2 ay o
ay1
ay2
“Nine”
“Night”
+ Embedded Training
9/5/17 CS 224S Winter 2007
25
+Speech Recognition Architecture
Thanks to Dan Jurafsky for these slides
€
ˆ W = argmaxW ∈L
P(O |W )P(W )
+ Front End
9/5/17 CS 224S Winter 2007
27
Observations
State Sequence