Page 1
EE E6820: Speech & Audio Processing & Recognition
Lecture 9:Speech Recognition
Dan Ellis <[email protected] >Michael Mandel <[email protected] >
Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820
April 7, 2009
1 Recognizing speech
2 Feature calculation
3 Sequence recognition
4 Large vocabulary, continuous speech recognition (LVCSR)
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 1 / 43
Page 2
Outline
1 Recognizing speech
2 Feature calculation
3 Sequence recognition
4 Large vocabulary, continuous speech recognition (LVCSR)
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 2 / 43
Page 3
Recognizing speech
Time
Fre
quen
cy
0 0.5 1 1.5 2 2.5 30
2000
4000“So, I thought about that and I think it’s still possible”
What kind of information might we want from the speechsignal?
I wordsI phrasing, ‘speech acts’ (prosody)I mood / emotionI speaker identity
What kind of processing do we need to get at thatinformation?
I time scale of feature extractionI signal aspects to capture in featuresI signal aspects to exclude from features
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 3 / 43
Page 4
Speech recognition as Transcription
Transcription = “speech to text”I find a word string to match the utterance
Gives neat objective measure: word error rate (WER) %I can be a sensitive measure of performance
Reference:
Recognized:
THE
–
CAT SAT ON THE MAT
CAT SAT AN THE MATA
Deletion Substitution Insertion
Three kinds of errors:
WER = (S + D + I )/N
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 4 / 43
Page 5
Problems: Within-speaker variability
Timing variationI word duration varies enormously
0.5
Fre
quen
cy0
0 1.0 1.5 2.0 2.5 3.0
2000
4000
s
SO ITHOUGHT
ABOUTTHAT
ITHINK
IT'S STILL POSSIBLE
AND
ow ayth
aadx
ax b aw plihtstih
kn
ihth
ayn
axth
aa s b ax l
I fast speech ‘reduces’ vowels
Speaking style variationI careful/casual articulationI soft/loud speech
Contextual effectsI speech sounds vary with context, role:
“How do you do?”
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 5 / 43
Page 6
Problems: Between-speaker variability
Accent variationI regional / mother tongue
Voice quality variationI gender, age, huskiness, nasality
Individual characteristicsI mannerisms, speed, prosody
0
2000
4000
6000
8000
time / s
freq
/ H
z
0 0.5 1 1.5 2 2.50
2000
4000
6000
8000
mbma0
fjdm2
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 6 / 43
Page 7
Problems: Environment variability
Background noiseI fans, cars, doors, papers
ReverberationI ‘boxiness’ in recordings
Microphone/channelI huge effect on relative spectral gain
0
2000
4000
time / s
freq
/ H
z
0 0.2 0.4 0.6 0.8 1 1.2 1.40
2000
4000
Close mic
Tabletop mic
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 7 / 43
Page 8
How to recognize speech?
Cross correlate templates?I waveform?I spectrogram?I time-warp problems
Match short-segments & handle time-warp laterI model with slices of ∼10 msI pseudo-stationary model of words:
time / s
freq
/ H
z
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450
1000
2000
3000
4000sil silg w eh n
I other sources of variation. . .
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 8 / 43
Page 9
Probabilistic formulation
Probability that segment label is correctI gives standard form of speech recognizers
Feature calculation: s[n]→ Xm (m = nH )
I transforms signal into easily-classified domain
Acoustic classifier: p(qi |X )I calculates probabilities of each mutually-exclusive state qi
‘Finite state acceptor’ (i.e. HMM)
Q∗ = argmax{q0,q1,...qL}
p(q0, q1, . . . qL |X0,X1, . . .XL)
I MAP match of allowable sequence to probabilities:
X
q0 = “ay” q1
0 1 2 ...
...
time
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 9 / 43
Page 10
Standard speech recognizer structure
Fundamental equation of speech recognition:
Q∗ = argmaxQ
p(Q |X ,Θ)
= argmaxQ
p(X |Q,Θ)p(Q |Θ)
I X = acoustic featuresI p(X |Q,Θ) = acoustic modelI p(Q |Θ) = language modelI argmaxQ = search over sequences
Questions:I what are the best features?I how do we do model them?I how do we find/match the state sequence?
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 10 / 43
Page 11
Outline
1 Recognizing speech
2 Feature calculation
3 Sequence recognition
4 Large vocabulary, continuous speech recognition (LVCSR)
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 11 / 43
Page 12
Feature Calculation
Goal: Find a representational spacemost suitable for classification
I waveform: voluminous, redundant, variableI spectrogram: better, still quite variableI . . . ?
Pattern Recognition:representation is upper bound on performance
I maybe we should use the waveform. . .I or, maybe the representation can do all the work
Feature calculation is intimately bound to classifierI pragmatic strengths and weaknesses
Features develop by slow evolutionI current choices more historical than principled
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 12 / 43
Page 13
Features (1): Spectrogram
Plain STFT as features e.g.
Xm[k] = S [mH, k] =∑n
s[n + mH] w [n] e−j2πkn/N
Consider examples:
0
2000
4000
6000
8000
time / s
freq
/ H
z
0 0.5 1 1.5 2 2.50
2000
4000
6000
8000
Feature vector slice
Similarities between corresponding segmentsI but still large differences
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 13 / 43
Page 14
Features (2): Cepstrum
Idea: Decorrelate, summarize spectral slices:
Xm[`] = IDFT{log |S [mH, k]|}
I good for Gaussian modelsI greatly reduce feature dimension
0 0.5 1 1.5 2 2.5 time / s
2
4
6
8
0
4000
8000
0
4000
8000
2
4
6
8
Male
Female
spectrum
cepstrum
spectrum
cepstrum
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 14 / 43
Page 15
Features (3): Frequency axis warp
Linear frequency axis gives equal ‘space’to 0-1 kHz and 3-4 kHz
I but perceptual importance very different
Warp frequency axis closer to perceptual axisI mel, Bark, constant-Q . . .
X [c] =uc∑
k=`c
|S [k]|2
Male
Female
spectrum
audspec
audspec
0 0.5 1 1.5 2 2.5 time / s
5
10
15
0
4000
8000
5
10
15
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 15 / 43
Page 16
Features (4): Spectral smoothing
Generalizing across different speakers ishelped by smoothing (i.e. blurring) spectrum
Truncated cepstrum is one way:I MMSE approx to log |S [k]|
LPC modeling is a little different:I MMSE approx to |S [k]| → prefers detail at peaks
Male
audspec
plp smoothed 5
10
15
0 2 4 6 8 10 12 14 16 18 freq / chan
leve
l / d
B
30
40
50
5
10
15
0 0.5 1 1.5 2 2.5 time / s
audspec
plp
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 16 / 43
Page 17
Features (5): Normalization along timeIdea: feature variations, not absolute level
Hence: calculate average level and subtract it:
Y [n, k] = X [n, k]−meann{X [n, k]}
Factors out fixed channel frequency response
x [n] = hc ∗ s[n]
X [n, k] = log |X [n, k]| = log |Hc [k]|+ log |S [n, k]|
Male
Female
plp
mean norm
mean norm
0 0.5 1 1.5 2 2.5 time / s
5
10
15
5
10
15
5
10
15
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 17 / 43
Page 18
Delta features
Want each segment to have ‘static’ feature valsI but some segments intrinsically dynamic!→ calculate their derivatives—maybe steadier?
Append dX/dt (+ d2X/dt2) to feature vectors
Male
plp(µ,σ norm)
deltas
ddeltas
51015
51015
51015
0 0.5 1 1.5 2 2.5 time / s
Relates to onset sensitivity in humans?
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 18 / 43
Page 19
Overall feature calculation
MFCCs and/or RASTA-PLP
FFT X[k]
Mel scale freq. warp
log|X[k]|
IFFT
Truncate
Subtract mean
CMN MFCC features
Sound
spectra
audspec
cepstra
FFT X[k]
Bark scale freq. warp
log|X[k]|
Rasta band-pass
LPC smooth
Cepstral recursion
Rasta-PLP cepstral features
smoothed onsets
LPC spectra
Key attributes:
spectral, auditory scale
decorrelation
smoothed (spectral)detail
normalization of levels
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 19 / 43
Page 20
Features summary
0
4000
8000
5
10
15
5
10
15
5
10
15
0 0.5 1 time / s0 0.5 1 1.5
Male Female
spectrum
audspec
rasta
deltas
Normalize same phones
Contrast different phones
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 20 / 43
Page 21
Outline
1 Recognizing speech
2 Feature calculation
3 Sequence recognition
4 Large vocabulary, continuous speech recognition (LVCSR)
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 21 / 43
Page 22
Sequence recognition: Dynamic Time Warp (DTW)
Framewise comparison with stored templates:
10 20 30 40 50 time /frames
10
20
30
40
50
60
70
Ref
eren
ce
Test
ON
ET
WO
TH
RE
EF
OU
RF
IVE
I distance metric?I comparison across templates?
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 22 / 43
Page 23
Dynamic Time Warp (2)
Find lowest-cost constrained path:I matrix d(i , j) of distances
between input frame fi and reference frame rjI allowable predecessors and transition costs Txy
Input frames fi
Ref
eren
ce f
ram
es r
j
D(i,j) = d(i,j) + min{ }D(i-1,j) + T10D(i,j-1) + T01
D(i-1,j-1) + T11D(i-1,j)
D(i-1,j) D(i-1,j)
T10
T01
T 11 Local match cost
Lowest cost to (i,j)
Best predecessor (including transition cost)
Best path via traceback from final stateI store predecessors for each (i , j)
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 23 / 43
Page 24
DTW-based recognition
Reference templates for each possible word
For isolated words:I mark endpoints of input wordI calculate scores through each template (+prune)
Ref
eren
ce
Input frames
ON
ET
WO
TH
RE
EF
OU
R
I continuous speech: link together word ends
Successfully handles timing variationI recognize speech at reasonable cost
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 24 / 43
Page 25
Statistical sequence recognition
DTW limited because it’s hard to optimizeI learning from multiple observationsI interpretation of distance, transition costs?
Need a theoretical foundation: Probability
Formulate recognition as MAP choice among word sequences:
Q∗ = argmaxQ
p(Q |X ,Θ)
I X = observed featuresI Q = word-sequencesI Θ = all current parameters
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 25 / 43
Page 26
State-based modeling
Assume discrete-state model for the speech:I observations are divided up into time framesI model → states → observations:
q1Qk : q2 q3 q4 q5 q6 ...
x1X1 : x2 x3 x4 x5 x6 ...
time
states
Nobserved feature
vectors
Model Mj
Probability of observations given model is:
p(X |Θ) =∑all Q
p(XN1 |Q,Θ) p(Q |Θ)
I sum over all possible state sequences Q
How do observations XN1 depend on states Q?
How do state sequences Q depend on model Θ?
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 26 / 43
Page 27
HMM review
HMM is specified by parameters Θ:
k a t
k a t •
k a t •
• •
•
•
• k a t
k a t •
0.9 0.1 0.0 0.01.0 0.0 0.0 0.0
0.0 0.9 0.1 0.00.0 0.0 0.9 0.1
p(x|q)
x
- states qi
- transition probabilities aij
- emission distributions bi(x)
(+ initial state probabilities πi )
aij ≡ p(q jn | qi
n−1) bi (x) ≡ p(x | qi ) πi ≡ p(qi1)
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 27 / 43
Page 28
HMM summary (1)
HMMs are a generative model: recognition is inference ofp(Q |X )
During generation, behavior of model depends only on currentstate qn:
I transition probabilities p(qn+1 | qn) = aij
I observation distributions p(xn | qn) = bi (x)
Given states Q = {q1, q2, . . . , qN}and observations X = XN
1 = {x1, x2, . . . , xN}Markov assumption makes
p(X ,Q |Θ) =∏n
p(xn | qn)p(qn | qn−1)
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 28 / 43
Page 29
HMM summary (2)
Calculate p(X |Θ) via forward recursion:
p(X n1 , q
jn) = αn(j) =
[S∑
i=1
αn−1(i)aij
]bj(xn)
Viterbi (best path) approximation
α∗n(j) =
[max
i
{α∗n−1(i)aij
}]bj(xn)
I then backtrace. . .
Q∗ = argmaxQ
(X ,Q |Θ)
Pictorially:
Q = {q1,q2,...qn}
M = M*
Q*X
assumed, hidden observed inferred
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 29 / 43
Page 30
Outline
1 Recognizing speech
2 Feature calculation
3 Sequence recognition
4 Large vocabulary, continuous speech recognition (LVCSR)
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 30 / 43
Page 31
Recognition with HMMs
Isolated wordI choose best p(M |X ) ∝ p(X |M)p(M)
Model M1p(X | M1)·p(M1) = ...
Model M2p(X | M2)·p(M2) = ...
Model M3p(X | M3)·p(M3) = ...
Input
w ah n
th r iy
t uw
Continuous speechI Viterbi decoding of one large HMM gives words
Inputp(M1)
p(M2)
p(M3)sil
w ah n
th r iy
t uw
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 31 / 43
Page 32
Training HMMs
Probabilistic foundation allows us to train HMMs to ‘fit’training data
i.e. estimate aij , bi (x) given dataI better than DTW. . .
Algorithms to improve p(Θ |X ) are key to success of HMMsI maximum-likelihood of models. . .
State alignments Q for training examples are generallyunknown
I ... else estimating parameters would be easy
Viterbi trainingI ‘Forced alignment’I choose ‘best’ labels (heuristic)
EM trainingI ‘fuzzy labels’ (guaranteed local convergence)
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 32 / 43
Page 33
Overall training procedure
Word modelsLabelled training data“two one”
“four three”
“five”
Data Models
one
two
three
w ah n
w ah n
th r iy
th r iy
th r iy
t uw
f ao
t uw
Fit models to data Repeat until
convergenceRe-estimate model parameters
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 33 / 43
Page 34
Language models
Recall, fundamental equation of speech recognition
Q∗ = argmaxQ
p(Q |X ,Θ)
= argmaxQ
p(X |Q,ΘA)p(Q |ΘL)
So far, looked at p(X |Q,ΘA)
What about p(Q |ΘL)?I Q is a particular word sequenceI ΘL are parameters related to the language
Two components:I link state sequences to words p(Q |wi )I priors on word sequences p(wi |Mj)
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 34 / 43
Page 35
HMM Hierarchy
HMMs support compositionI can handle time dilation, pronunciation, grammar all within
the same framework
ae1 ae2 ae3
kae
aat
THE
CAT
DOGSAT
ATE
p(q |M) = p(q, φ,w |M)
= p(q |φ)
· p(φ |w)
· p(wn |wn−11 ,M)
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 35 / 43
Page 36
Pronunciation models
Define states within each word p(Q |wi )
Can have unique states for each word (‘whole-word’modeling), or . . .
Sharing (tying) subword units between words to reflectunderlying phonology
I more training examples for each unitI generalizes to unseen wordsI (or can do it automatically. . . )
Start e.g. from pronunciation dictionary:
ZERO(0.5) z iy r owZERO(0.5) z ih r owONE(1.0) w ah nTWO(1.0) tcl t uw
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 36 / 43
Page 37
Learning pronunciations
‘Phone recognizer’ transcribes training data as phonesI align to ‘canonical’ pronunciations
Surface Phone String
f ay v y iy r ow l d
f ah ay v y uh r ow l
Baseform Phoneme String
I infer modification rulesI predict other pronunciation variants
e.g. ‘d deletion’:
d → ∅|`stop p = 0.9
Generate pronunciation variants; use forced alignment to findweights
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 37 / 43
Page 38
Grammar
Account for different likelihoods of different words and wordsequences p(wi |Mj)
‘True’ probabilities are very complex for LVCSRI need parses, but speech often agrammatic
→ Use n-grams:
p(wn |wL1 ) = p(wn |wn−K , . . . ,wn−1)
e.g. n-gram models of Shakespeare:
n=1 To him swallowed confess hear both. Which. Of save on . . .n=2 What means, sir. I confess she? then all sorts, he is trim, . . .n=3 Sweet prince, Falstaff shall die. Harry of Monmouth’s grave. . .n=4 King Henry. What! I will go seek the traitor Gloucester. . . .
Big win in recognizer WERI raw recognition results often highly ambiguousI grammar guides to ‘reasonable’ solutions
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 38 / 43
Page 39
Smoothing LVCSR grammars
n-grams (n = 3 or 4) are estimated from large text corporaI 100M+ wordsI but: not like spoken language
100,000 word vocabulary → 1015 trigrams!I never see enough examplesI unobserved trigrams should NOT have Pr = 0!
Backoff to bigrams, unigramsI p(wn) as an approx to p(wn |wn−1) etc.I interpolate 1-gram, 2-gram, 3-gram with learned weights?
Lots of ideas e.g. category grammarsI p(PLACE | “went”, “to”)p(wn |PLACE)I how to define categories?I how to tag words in training corpus?
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 39 / 43
Page 40
Decoding
How to find the MAP word sequence?
States, pronunciations, words define one big HMMI with 100,000+ individual states for LVCSR!
→ Exploit hierarchic structureI phone states independent of wordI next word (semi) independent of word history
k
axr
z
s
dowiy
d
oy
uw
b
root
DO
DECOY DECODES
DECODES
DECODER
DECODE
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 40 / 43
Page 41
Decoder pruning
Searching ‘all possible word sequences’?I need to restrict search to most promising ones: beam searchI sort by estimates of total probability
= Pr(so far)+ lower bound estimate of remainsI trade search errors for speed
Start-synchronous algorithm:I extract top hypothesis from queue:
[Pn, {w1, . . . ,wk}, n]pr. so far words next time frame
I find plausible words {wi} starting at time n→ new hypotheses:
[Pnp(X n+N−1n |w i )p(w i |wk . . .), {w1, . . . ,wk ,w
i}, n + N]
I discard if too unlikely, or queue is too longI else re-insert into queue and repeat
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 41 / 43
Page 42
Summary
Speech signal is highly variableI need models that absorb variabilityI hide what we can with robust features
Speech is modeled as a sequence of featuresI need temporal aspect to recognitionI best time-alignment of templates = DTW
Hidden Markov models are rigorous solutionI self-loops allow temporal dilationI exact, efficient likelihood calculations
Language modeling captures larger structureI pronunciation, word sequencesI fits directly into HMM state structureI need to ‘prune’ search space in decoding
Parting thought
Forward-backward trains to generate, can we train to discriminate?
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 42 / 43
Page 43
References
Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted finite-state transducersin speech recognition. Computer Speech & Language, 16(1):69–88, 2002.
Wendy Holmes. Speech Synthesis and Recognition. CRC, December 2001. ISBN0748408576.
Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition.Prentice Hall PTR, April 1993. ISBN 0130151572.
Daniel Jurafsky and James H. Martin. Speech and Language Processing: AnIntroduction to Natural Language Processing, Computational Linguistics andSpeech Recognition. Prentice Hall, January 2000. ISBN 0130950696.
Frederick Jelinek. Statistical Methods for Speech Recognition (Language, Speech, andCommunication). The MIT Press, January 1998. ISBN 0262100665.
Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken Language Processing: AGuide to Theory, Algorithm and System Development. Prentice Hall PTR, April2001. ISBN 0130226165.
E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 43 / 43