Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1
Automatic Speech RecognitionIntroduction
Readings: Jurafsky & Martin 7.1-2
HLT Survey Chapter 1
The Human Dialogue System
The Human Dialogue System
Computer Dialogue Systems
AuditionAutomatic
SpeechRecognition
NaturalLanguage
Understanding
DialogueManagement
Planning
NaturalLanguageGeneration
Text-to-speech
signal words
logical form
words signalsignal
Computer Dialogue Systems
Audition ASR NLU
DialogueMgmt.
Planning
NLGText-to-speech
signal words
logical form
words signalsignal
Parameters of ASR Capabilities
• Different types of tasks with different difficulties– Speaking mode (isolated words/continuous speech)
– Speaking style (read/spontaneous)
– Enrollment (speaker-independent/dependent)
– Vocabulary (small < 20 wd/large >20kword)
– Language model (finite state/context sensitive)
– Perplexity (small < 10/large >100)
– Signal-to-noise ratio (high > 30 dB/low < 10dB)
– Transducer (high quality microphone/telephone)
The Noisy Channel Model
message
noisy channel
message
Message Channel+ =Signal
Decoding model: find Message*= argmax P(Message|Signal)But how do we represent each of these things?
ASR using HMMs
• Try to solve P(Message|Signal) by breaking the problem up into separate components
• Most common method: Hidden Markov Models– Assume that a message is composed of words
– Assume that words are composed of sub-word parts (phones)
– Assume that phones have some sort of acoustic realization
– Use probabilistic models for matching acoustics to phones to words
HMMs: The Traditional View
go home
g o h o m
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov modelbackbone composedof phones(hidden because wedon’t knowcorrespondences)
Acoustic observations
Each line represents a probability estimate (more later)
HMMs: The Traditional View
go home
g o h o m
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov modelbackbone composedof phones(hidden because wedon’t knowcorrespondences)
Acoustic observations
Even with same word hypothesis, can have different alignments.Also, have to search over all word hypotheses
HMMs as Dynamic Bayesian Networks
go home
q0=g
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov modelbackbone composedof phones
Acoustic observations
q1=o q2=o q3=o q4=h q5=o q6=o q7=o q8=m q9=m
HMMs as Dynamic Bayesian Networks
go home
q0=g
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov modelbackbone composedof phones
q1=o q2=o q3=o q4=h q5=o q6=o q7=o q8=m q9=m
ASR: What is best assignment to q0…q9 given x0…x9?
Hidden Markov Models & DBNs
DBN representation Markov Model representation
Parts of an ASR System
FeatureCalculation
LanguageModeling
AcousticModeling
k @
PronunciationModeling
cat: k@tdog: dogmail: mAlthe: D&, DE…
cat dog: 0.00002cat the: 0.0000005the cat: 0.029the dog: 0.031the mail: 0.054 …
The cat chased the dog
S E A R C H
Parts of an ASR System
FeatureCalculation
LanguageModeling
AcousticModeling
k @
PronunciationModeling
cat: k@tdog: dogmail: mAlthe: D&, DE…
cat dog: 0.00002cat the: 0.0000005the cat: 0.029the dog: 0.031the mail: 0.054 …
Produces acoustics (xt)
Maps acousticsto phones
Maps phonesto words
Strings wordstogether
Feature calculation
Feature calculationF
requ
ency
Time
Find energy at each time step ineach frequency channel
Feature calculationF
requ
ency
Time
Take inverse Discrete FourierTransform to decorrelate frequencies
Feature calculation
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
…
Input:
Output:
Robust Speech Recognition
• Different schemes have been developed for dealing with noise, reverberation– Additive noise: reduce effects of particular
frequencies– Convolutional noise: remove effects of linear
filters (cepstral mean subtraction)
Now what?
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
That you …???
Machine Learning!
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
That you …Pattern recognition
with HMMs
Hidden Markov Models (again!)
P(acousticst|statet)Acoustic Model
P(statet+1|statet)Pronunciation/Language models
Acoustic Model
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
dh a a t • Assume that you can label each vector with a phonetic label
• Collect all of the examples of a phone together and build a Gaussian model (or some other statistical model, e.g. neural networks)
Na()P(X|state=a)
Building up the Markov Model
• Start with a model for each phone
• Typically, we use 3 states per phone to give a minimum duration constraint, but ignore that here…
a p
1-p
transition probability
a p
1-p
a p
1-p
a p
1-p
Building up the Markov Model
• Pronunciation model gives connections between phones and words
• Multiple pronunciations:
owt m
dh pdh
1-pdh
a pa
1-pa
t pt
1-pt
ah
ow ey
ah
t
Building up the Markov Model
• Language model gives connections between words (e.g., bigram grammar)
dh a t
h iy
y uw
p(he|that)
p(you|that)
ASR as Bayesian Inference
q1w1 q2w1 q3w1
x1 x2 x3
th a t
h iy
y uw
p(he|that)
p(you|that)
h iy
sh uh d
argmaxW P(W|X)=argmaxW P(X|W)P(W)/P(X)=argmaxW P(X|W)P(W)=argmaxW Q P(X,Q|W)P(W)≈argmaxW maxQ P(X,Q|W)P(W)≈argmaxW maxQ P(X|Q) P(Q|W) P(W)
ASR Probability Models
• Three probability models– P(X|Q): acoustic model– P(Q|W): duration/transition/pronunciation
model– P(W): language model
• language/pronunciation models inferred from prior knowledge
• Other models learned from data (how?)
Parts of an ASR System
FeatureCalculation
LanguageModeling
AcousticModeling
k @
PronunciationModelingcat: k@tdog: dogmail: mAlthe: D&, DE…
cat dog: 0.00002cat the: 0.0000005the cat: 0.029the dog: 0.031the mail: 0.054 …
The cat chased the dog
S E A R C H
P(X|Q) P(Q|W) P(W)
EM for ASR: The Forward-Backward Algorithm
• Determine “state occupancy” probabilities– I.e. assign each data vector to a state
• Calculate new transition probabilities, new means & standard deviations (emission probabilities) using assignments
ASR as Bayesian Inference
q1w1 q2w1 q3w1
x1 x2 x3
th a t
h iy
y uw
p(he|that)
p(you|that)
h iy
sh uh d
argmaxW P(W|X)=argmaxW P(X|W)P(W)/P(X)=argmaxW P(X|W)P(W)=argmaxW Q P(X,Q|W)P(W)≈argmaxW maxQ P(X,Q|W)P(W)≈argmaxW maxQ P(X|Q) P(Q|W) P(W)
Search
• When trying to find W*=argmaxW P(W|X), need to look at (in theory)– All possible word sequences W – All possible segmentations/alignments of W&X
• Generally, this is done by searching the space of W– Viterbi search: dynamic programming approach that
looks for the most likely path– A* search: alternative method that keeps a stack of
hypotheses around
• If |W| is large, pruning becomes important
How to train an ASR system
• Have a speech corpus at hand– Should have word (and preferrably phone)
transcriptions
– Divide into training, development, and test sets
• Develop models of prior knowledge– Pronunciation dictionary
– Grammar
• Train acoustic models– Possibly realigning corpus phonetically
How to train an ASR system
• Test on your development data (baseline)• **Think real hard• Figure out some neat new modification• Retrain system component • Test on your development data• Lather, rinse, repeat **• Then, at the end of the project, test on the test
data.
Judging the quality of a system
• Usually, ASR performance is judged by the word error rateErrorRate = 100*(Subs + Ins + Dels) / Nwords
REF: I WANT TO GO HOME ***
REC: * WANT TWO GO HOME NOW
SC: D C S C C I
100*(1S+1I+1D)/5 = 60%
Judging the quality of a system
• Usually, ASR performance is judged by the word error rate
• This assumes that all errors are equal– Also, a bit of a mismatch between optimization
criterion and error measurement
• Other (task specific) measures sometimes used– Task completion– Concept error rate