Top Banner
EE E6820: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition Dan Ellis <[email protected]> Michael Mandel <[email protected]> Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/dpwe/e6820 April 7, 2009 1 Recognizing speech 2 Feature calculation 3 Sequence recognition 4 Large vocabulary, continuous speech recognition (LVCSR) E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 1 / 43
43

Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

May 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

EE E6820: Speech & Audio Processing & Recognition

Lecture 9:Speech Recognition

Dan Ellis <[email protected]>Michael Mandel <[email protected]>

Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820

April 7, 2009

1 Recognizing speech

2 Feature calculation

3 Sequence recognition

4 Large vocabulary, continuous speech recognition (LVCSR)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 1 / 43

Page 2: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Outline

1 Recognizing speech

2 Feature calculation

3 Sequence recognition

4 Large vocabulary, continuous speech recognition (LVCSR)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 2 / 43

Page 3: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Recognizing speech

Time

Fre

quen

cy

0 0.5 1 1.5 2 2.5 30

2000

4000“So, I thought about that and I think it’s still possible”

What kind of information might we want from the speechsignal?

I wordsI phrasing, ‘speech acts’ (prosody)I mood / emotionI speaker identity

What kind of processing do we need to get at thatinformation?

I time scale of feature extractionI signal aspects to capture in featuresI signal aspects to exclude from features

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 3 / 43

Page 4: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Speech recognition as Transcription

Transcription = “speech to text”I find a word string to match the utterance

Gives neat objective measure: word error rate (WER) %I can be a sensitive measure of performance

Reference:

Recognized:

THE

CAT SAT ON THE MAT

CAT SAT AN THE MATA

Deletion Substitution Insertion

Three kinds of errors:

WER = (S + D + I )/N

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 4 / 43

Page 5: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Problems: Within-speaker variability

Timing variationI word duration varies enormously

0.5

Fre

quen

cy0

0 1.0 1.5 2.0 2.5 3.0

2000

4000

s

SO ITHOUGHT

ABOUTTHAT

ITHINK

IT'S STILL POSSIBLE

AND

ow ayth

aadx

ax b aw plihtstih

kn

ihth

ayn

axth

aa s b ax l

I fast speech ‘reduces’ vowels

Speaking style variationI careful/casual articulationI soft/loud speech

Contextual effectsI speech sounds vary with context, role:

“How do you do?”

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 5 / 43

Page 6: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Problems: Between-speaker variability

Accent variationI regional / mother tongue

Voice quality variationI gender, age, huskiness, nasality

Individual characteristicsI mannerisms, speed, prosody

0

2000

4000

6000

8000

time / s

freq

/ H

z

0 0.5 1 1.5 2 2.50

2000

4000

6000

8000

mbma0

fjdm2

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 6 / 43

Page 7: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Problems: Environment variability

Background noiseI fans, cars, doors, papers

ReverberationI ‘boxiness’ in recordings

Microphone/channelI huge effect on relative spectral gain

0

2000

4000

time / s

freq

/ H

z

0 0.2 0.4 0.6 0.8 1 1.2 1.40

2000

4000

Close mic

Tabletop mic

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 7 / 43

Page 8: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

How to recognize speech?

Cross correlate templates?I waveform?I spectrogram?I time-warp problems

Match short-segments & handle time-warp laterI model with slices of ∼10 msI pseudo-stationary model of words:

time / s

freq

/ H

z

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

1000

2000

3000

4000sil silg w eh n

I other sources of variation. . .

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 8 / 43

Page 9: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Probabilistic formulation

Probability that segment label is correctI gives standard form of speech recognizers

Feature calculation: s[n]→ Xm (m = nH )

I transforms signal into easily-classified domain

Acoustic classifier: p(qi |X )I calculates probabilities of each mutually-exclusive state qi

‘Finite state acceptor’ (i.e. HMM)

Q∗ = argmax{q0,q1,...qL}

p(q0, q1, . . . qL |X0,X1, . . .XL)

I MAP match of allowable sequence to probabilities:

X

q0 = “ay” q1

0 1 2 ...

...

time

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 9 / 43

Page 10: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Standard speech recognizer structure

Fundamental equation of speech recognition:

Q∗ = argmaxQ

p(Q |X ,Θ)

= argmaxQ

p(X |Q,Θ)p(Q |Θ)

I X = acoustic featuresI p(X |Q,Θ) = acoustic modelI p(Q |Θ) = language modelI argmaxQ = search over sequences

Questions:I what are the best features?I how do we do model them?I how do we find/match the state sequence?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 10 / 43

Page 11: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Outline

1 Recognizing speech

2 Feature calculation

3 Sequence recognition

4 Large vocabulary, continuous speech recognition (LVCSR)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 11 / 43

Page 12: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Feature Calculation

Goal: Find a representational spacemost suitable for classification

I waveform: voluminous, redundant, variableI spectrogram: better, still quite variableI . . . ?

Pattern Recognition:representation is upper bound on performance

I maybe we should use the waveform. . .I or, maybe the representation can do all the work

Feature calculation is intimately bound to classifierI pragmatic strengths and weaknesses

Features develop by slow evolutionI current choices more historical than principled

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 12 / 43

Page 13: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Features (1): Spectrogram

Plain STFT as features e.g.

Xm[k] = S [mH, k] =∑n

s[n + mH] w [n] e−j2πkn/N

Consider examples:

0

2000

4000

6000

8000

time / s

freq

/ H

z

0 0.5 1 1.5 2 2.50

2000

4000

6000

8000

Feature vector slice

Similarities between corresponding segmentsI but still large differences

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 13 / 43

Page 14: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Features (2): Cepstrum

Idea: Decorrelate, summarize spectral slices:

Xm[`] = IDFT{log |S [mH, k]|}

I good for Gaussian modelsI greatly reduce feature dimension

0 0.5 1 1.5 2 2.5 time / s

2

4

6

8

0

4000

8000

0

4000

8000

2

4

6

8

Male

Female

spectrum

cepstrum

spectrum

cepstrum

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 14 / 43

Page 15: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Features (3): Frequency axis warp

Linear frequency axis gives equal ‘space’to 0-1 kHz and 3-4 kHz

I but perceptual importance very different

Warp frequency axis closer to perceptual axisI mel, Bark, constant-Q . . .

X [c] =uc∑

k=`c

|S [k]|2

Male

Female

spectrum

audspec

audspec

0 0.5 1 1.5 2 2.5 time / s

5

10

15

0

4000

8000

5

10

15

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 15 / 43

Page 16: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Features (4): Spectral smoothing

Generalizing across different speakers ishelped by smoothing (i.e. blurring) spectrum

Truncated cepstrum is one way:I MMSE approx to log |S [k]|

LPC modeling is a little different:I MMSE approx to |S [k]| → prefers detail at peaks

Male

audspec

plp smoothed 5

10

15

0 2 4 6 8 10 12 14 16 18 freq / chan

leve

l / d

B

30

40

50

5

10

15

0 0.5 1 1.5 2 2.5 time / s

audspec

plp

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 16 / 43

Page 17: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Features (5): Normalization along timeIdea: feature variations, not absolute level

Hence: calculate average level and subtract it:

Y [n, k] = X [n, k]−meann{X [n, k]}

Factors out fixed channel frequency response

x [n] = hc ∗ s[n]

X [n, k] = log |X [n, k]| = log |Hc [k]|+ log |S [n, k]|

Male

Female

plp

mean norm

mean norm

0 0.5 1 1.5 2 2.5 time / s

5

10

15

5

10

15

5

10

15

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 17 / 43

Page 18: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Delta features

Want each segment to have ‘static’ feature valsI but some segments intrinsically dynamic!→ calculate their derivatives—maybe steadier?

Append dX/dt (+ d2X/dt2) to feature vectors

Male

plp(µ,σ norm)

deltas

ddeltas

51015

51015

51015

0 0.5 1 1.5 2 2.5 time / s

Relates to onset sensitivity in humans?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 18 / 43

Page 19: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Overall feature calculation

MFCCs and/or RASTA-PLP

FFT X[k]

Mel scale freq. warp

log|X[k]|

IFFT

Truncate

Subtract mean

CMN MFCC features

Sound

spectra

audspec

cepstra

FFT X[k]

Bark scale freq. warp

log|X[k]|

Rasta band-pass

LPC smooth

Cepstral recursion

Rasta-PLP cepstral features

smoothed onsets

LPC spectra

Key attributes:

spectral, auditory scale

decorrelation

smoothed (spectral)detail

normalization of levels

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 19 / 43

Page 20: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Features summary

0

4000

8000

5

10

15

5

10

15

5

10

15

0 0.5 1 time / s0 0.5 1 1.5

Male Female

spectrum

audspec

rasta

deltas

Normalize same phones

Contrast different phones

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 20 / 43

Page 21: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Outline

1 Recognizing speech

2 Feature calculation

3 Sequence recognition

4 Large vocabulary, continuous speech recognition (LVCSR)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 21 / 43

Page 22: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Sequence recognition: Dynamic Time Warp (DTW)

Framewise comparison with stored templates:

10 20 30 40 50 time /frames

10

20

30

40

50

60

70

Ref

eren

ce

Test

ON

ET

WO

TH

RE

EF

OU

RF

IVE

I distance metric?I comparison across templates?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 22 / 43

Page 23: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Dynamic Time Warp (2)

Find lowest-cost constrained path:I matrix d(i , j) of distances

between input frame fi and reference frame rjI allowable predecessors and transition costs Txy

Input frames fi

Ref

eren

ce f

ram

es r

j

D(i,j) = d(i,j) + min{ }D(i-1,j) + T10D(i,j-1) + T01

D(i-1,j-1) + T11D(i-1,j)

D(i-1,j) D(i-1,j)

T10

T01

T 11 Local match cost

Lowest cost to (i,j)

Best predecessor (including transition cost)

Best path via traceback from final stateI store predecessors for each (i , j)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 23 / 43

Page 24: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

DTW-based recognition

Reference templates for each possible word

For isolated words:I mark endpoints of input wordI calculate scores through each template (+prune)

Ref

eren

ce

Input frames

ON

ET

WO

TH

RE

EF

OU

R

I continuous speech: link together word ends

Successfully handles timing variationI recognize speech at reasonable cost

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 24 / 43

Page 25: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Statistical sequence recognition

DTW limited because it’s hard to optimizeI learning from multiple observationsI interpretation of distance, transition costs?

Need a theoretical foundation: Probability

Formulate recognition as MAP choice among word sequences:

Q∗ = argmaxQ

p(Q |X ,Θ)

I X = observed featuresI Q = word-sequencesI Θ = all current parameters

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 25 / 43

Page 26: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

State-based modeling

Assume discrete-state model for the speech:I observations are divided up into time framesI model → states → observations:

q1Qk : q2 q3 q4 q5 q6 ...

x1X1 : x2 x3 x4 x5 x6 ...

time

states

Nobserved feature

vectors

Model Mj

Probability of observations given model is:

p(X |Θ) =∑all Q

p(XN1 |Q,Θ) p(Q |Θ)

I sum over all possible state sequences Q

How do observations XN1 depend on states Q?

How do state sequences Q depend on model Θ?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 26 / 43

Page 27: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

HMM review

HMM is specified by parameters Θ:

k a t

k a t •

k a t •

• •

• k a t

k a t •

0.9 0.1 0.0 0.01.0 0.0 0.0 0.0

0.0 0.9 0.1 0.00.0 0.0 0.9 0.1

p(x|q)

x

- states qi

- transition probabilities aij

- emission distributions bi(x)

(+ initial state probabilities πi )

aij ≡ p(q jn | qi

n−1) bi (x) ≡ p(x | qi ) πi ≡ p(qi1)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 27 / 43

Page 28: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

HMM summary (1)

HMMs are a generative model: recognition is inference ofp(Q |X )

During generation, behavior of model depends only on currentstate qn:

I transition probabilities p(qn+1 | qn) = aij

I observation distributions p(xn | qn) = bi (x)

Given states Q = {q1, q2, . . . , qN}and observations X = XN

1 = {x1, x2, . . . , xN}Markov assumption makes

p(X ,Q |Θ) =∏n

p(xn | qn)p(qn | qn−1)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 28 / 43

Page 29: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

HMM summary (2)

Calculate p(X |Θ) via forward recursion:

p(X n1 , q

jn) = αn(j) =

[S∑

i=1

αn−1(i)aij

]bj(xn)

Viterbi (best path) approximation

α∗n(j) =

[max

i

{α∗n−1(i)aij

}]bj(xn)

I then backtrace. . .

Q∗ = argmaxQ

(X ,Q |Θ)

Pictorially:

Q = {q1,q2,...qn}

M = M*

Q*X

assumed, hidden observed inferred

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 29 / 43

Page 30: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Outline

1 Recognizing speech

2 Feature calculation

3 Sequence recognition

4 Large vocabulary, continuous speech recognition (LVCSR)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 30 / 43

Page 31: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Recognition with HMMs

Isolated wordI choose best p(M |X ) ∝ p(X |M)p(M)

Model M1p(X | M1)·p(M1) = ...

Model M2p(X | M2)·p(M2) = ...

Model M3p(X | M3)·p(M3) = ...

Input

w ah n

th r iy

t uw

Continuous speechI Viterbi decoding of one large HMM gives words

Inputp(M1)

p(M2)

p(M3)sil

w ah n

th r iy

t uw

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 31 / 43

Page 32: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Training HMMs

Probabilistic foundation allows us to train HMMs to ‘fit’training data

i.e. estimate aij , bi (x) given dataI better than DTW. . .

Algorithms to improve p(Θ |X ) are key to success of HMMsI maximum-likelihood of models. . .

State alignments Q for training examples are generallyunknown

I ... else estimating parameters would be easy

Viterbi trainingI ‘Forced alignment’I choose ‘best’ labels (heuristic)

EM trainingI ‘fuzzy labels’ (guaranteed local convergence)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 32 / 43

Page 33: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Overall training procedure

Word modelsLabelled training data“two one”

“four three”

“five”

Data Models

one

two

three

w ah n

w ah n

th r iy

th r iy

th r iy

t uw

f ao

t uw

Fit models to data Repeat until

convergenceRe-estimate model parameters

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 33 / 43

Page 34: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Language models

Recall, fundamental equation of speech recognition

Q∗ = argmaxQ

p(Q |X ,Θ)

= argmaxQ

p(X |Q,ΘA)p(Q |ΘL)

So far, looked at p(X |Q,ΘA)

What about p(Q |ΘL)?I Q is a particular word sequenceI ΘL are parameters related to the language

Two components:I link state sequences to words p(Q |wi )I priors on word sequences p(wi |Mj)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 34 / 43

Page 35: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

HMM Hierarchy

HMMs support compositionI can handle time dilation, pronunciation, grammar all within

the same framework

ae1 ae2 ae3

kae

aat

THE

CAT

DOGSAT

ATE

p(q |M) = p(q, φ,w |M)

= p(q |φ)

· p(φ |w)

· p(wn |wn−11 ,M)

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 35 / 43

Page 36: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Pronunciation models

Define states within each word p(Q |wi )

Can have unique states for each word (‘whole-word’modeling), or . . .

Sharing (tying) subword units between words to reflectunderlying phonology

I more training examples for each unitI generalizes to unseen wordsI (or can do it automatically. . . )

Start e.g. from pronunciation dictionary:

ZERO(0.5) z iy r owZERO(0.5) z ih r owONE(1.0) w ah nTWO(1.0) tcl t uw

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 36 / 43

Page 37: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Learning pronunciations

‘Phone recognizer’ transcribes training data as phonesI align to ‘canonical’ pronunciations

Surface Phone String

f ay v y iy r ow l d

f ah ay v y uh r ow l

Baseform Phoneme String

I infer modification rulesI predict other pronunciation variants

e.g. ‘d deletion’:

d → ∅|`stop p = 0.9

Generate pronunciation variants; use forced alignment to findweights

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 37 / 43

Page 38: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Grammar

Account for different likelihoods of different words and wordsequences p(wi |Mj)

‘True’ probabilities are very complex for LVCSRI need parses, but speech often agrammatic

→ Use n-grams:

p(wn |wL1 ) = p(wn |wn−K , . . . ,wn−1)

e.g. n-gram models of Shakespeare:

n=1 To him swallowed confess hear both. Which. Of save on . . .n=2 What means, sir. I confess she? then all sorts, he is trim, . . .n=3 Sweet prince, Falstaff shall die. Harry of Monmouth’s grave. . .n=4 King Henry. What! I will go seek the traitor Gloucester. . . .

Big win in recognizer WERI raw recognition results often highly ambiguousI grammar guides to ‘reasonable’ solutions

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 38 / 43

Page 39: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Smoothing LVCSR grammars

n-grams (n = 3 or 4) are estimated from large text corporaI 100M+ wordsI but: not like spoken language

100,000 word vocabulary → 1015 trigrams!I never see enough examplesI unobserved trigrams should NOT have Pr = 0!

Backoff to bigrams, unigramsI p(wn) as an approx to p(wn |wn−1) etc.I interpolate 1-gram, 2-gram, 3-gram with learned weights?

Lots of ideas e.g. category grammarsI p(PLACE | “went”, “to”)p(wn |PLACE)I how to define categories?I how to tag words in training corpus?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 39 / 43

Page 40: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Decoding

How to find the MAP word sequence?

States, pronunciations, words define one big HMMI with 100,000+ individual states for LVCSR!

→ Exploit hierarchic structureI phone states independent of wordI next word (semi) independent of word history

k

axr

z

s

dowiy

d

oy

uw

b

root

DO

DECOY DECODES

DECODES

DECODER

DECODE

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 40 / 43

Page 41: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Decoder pruning

Searching ‘all possible word sequences’?I need to restrict search to most promising ones: beam searchI sort by estimates of total probability

= Pr(so far)+ lower bound estimate of remainsI trade search errors for speed

Start-synchronous algorithm:I extract top hypothesis from queue:

[Pn, {w1, . . . ,wk}, n]pr. so far words next time frame

I find plausible words {wi} starting at time n→ new hypotheses:

[Pnp(X n+N−1n |w i )p(w i |wk . . .), {w1, . . . ,wk ,w

i}, n + N]

I discard if too unlikely, or queue is too longI else re-insert into queue and repeat

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 41 / 43

Page 42: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

Summary

Speech signal is highly variableI need models that absorb variabilityI hide what we can with robust features

Speech is modeled as a sequence of featuresI need temporal aspect to recognitionI best time-alignment of templates = DTW

Hidden Markov models are rigorous solutionI self-loops allow temporal dilationI exact, efficient likelihood calculations

Language modeling captures larger structureI pronunciation, word sequencesI fits directly into HMM state structureI need to ‘prune’ search space in decoding

Parting thought

Forward-backward trains to generate, can we train to discriminate?

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 42 / 43

Page 43: Lecture 9: Speech Recognition - Columbia Universitydpwe/e6820/lectures/L09-asr.pdfRecognizing speech Time Frequency 0 0.5 1 1.5 2 2.5 3 0 2000 4000 ÒSo, I thought about that and I

References

Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted finite-state transducersin speech recognition. Computer Speech & Language, 16(1):69–88, 2002.

Wendy Holmes. Speech Synthesis and Recognition. CRC, December 2001. ISBN0748408576.

Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition.Prentice Hall PTR, April 1993. ISBN 0130151572.

Daniel Jurafsky and James H. Martin. Speech and Language Processing: AnIntroduction to Natural Language Processing, Computational Linguistics andSpeech Recognition. Prentice Hall, January 2000. ISBN 0130950696.

Frederick Jelinek. Statistical Methods for Speech Recognition (Language, Speech, andCommunication). The MIT Press, January 1998. ISBN 0262100665.

Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken Language Processing: AGuide to Theory, Algorithm and System Development. Prentice Hall PTR, April2001. ISBN 0130226165.

E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 43 / 43