Top Banner
Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1
37

Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Dec 22, 2015

Download

Documents

Martina Hensley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Automatic Speech RecognitionIntroduction

Readings: Jurafsky & Martin 7.1-2

HLT Survey Chapter 1

Page 2: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

The Human Dialogue System

Page 3: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

The Human Dialogue System

Page 4: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Computer Dialogue Systems

AuditionAutomatic

SpeechRecognition

NaturalLanguage

Understanding

DialogueManagement

Planning

NaturalLanguageGeneration

Text-to-speech

signal words

logical form

words signalsignal

Page 5: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Computer Dialogue Systems

Audition ASR NLU

DialogueMgmt.

Planning

NLGText-to-speech

signal words

logical form

words signalsignal

Page 6: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Parameters of ASR Capabilities

• Different types of tasks with different difficulties– Speaking mode (isolated words/continuous speech)

– Speaking style (read/spontaneous)

– Enrollment (speaker-independent/dependent)

– Vocabulary (small < 20 wd/large >20kword)

– Language model (finite state/context sensitive)

– Perplexity (small < 10/large >100)

– Signal-to-noise ratio (high > 30 dB/low < 10dB)

– Transducer (high quality microphone/telephone)

Page 7: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

The Noisy Channel Model

message

noisy channel

message

Message Channel+ =Signal

Decoding model: find Message*= argmax P(Message|Signal)But how do we represent each of these things?

Page 8: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

ASR using HMMs

• Try to solve P(Message|Signal) by breaking the problem up into separate components

• Most common method: Hidden Markov Models– Assume that a message is composed of words

– Assume that words are composed of sub-word parts (phones)

– Assume that phones have some sort of acoustic realization

– Use probabilistic models for matching acoustics to phones to words

Page 9: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

HMMs: The Traditional View

go home

g o h o m

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

Markov modelbackbone composedof phones(hidden because wedon’t knowcorrespondences)

Acoustic observations

Each line represents a probability estimate (more later)

Page 10: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

HMMs: The Traditional View

go home

g o h o m

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

Markov modelbackbone composedof phones(hidden because wedon’t knowcorrespondences)

Acoustic observations

Even with same word hypothesis, can have different alignments.Also, have to search over all word hypotheses

Page 11: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

HMMs as Dynamic Bayesian Networks

go home

q0=g

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

Markov modelbackbone composedof phones

Acoustic observations

q1=o q2=o q3=o q4=h q5=o q6=o q7=o q8=m q9=m

Page 12: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

HMMs as Dynamic Bayesian Networks

go home

q0=g

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

Markov modelbackbone composedof phones

q1=o q2=o q3=o q4=h q5=o q6=o q7=o q8=m q9=m

ASR: What is best assignment to q0…q9 given x0…x9?

Page 13: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Hidden Markov Models & DBNs

DBN representation Markov Model representation

Page 14: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Parts of an ASR System

FeatureCalculation

LanguageModeling

AcousticModeling

k @

PronunciationModeling

cat: k@tdog: dogmail: mAlthe: D&, DE…

cat dog: 0.00002cat the: 0.0000005the cat: 0.029the dog: 0.031the mail: 0.054 …

The cat chased the dog

S E A R C H

Page 15: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Parts of an ASR System

FeatureCalculation

LanguageModeling

AcousticModeling

k @

PronunciationModeling

cat: k@tdog: dogmail: mAlthe: D&, DE…

cat dog: 0.00002cat the: 0.0000005the cat: 0.029the dog: 0.031the mail: 0.054 …

Produces acoustics (xt)

Maps acousticsto phones

Maps phonesto words

Strings wordstogether

Page 16: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Feature calculation

Page 17: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Feature calculationF

requ

ency

Time

Find energy at each time step ineach frequency channel

Page 18: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Feature calculationF

requ

ency

Time

Take inverse Discrete FourierTransform to decorrelate frequencies

Page 19: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Feature calculation

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

Input:

Output:

Page 20: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Robust Speech Recognition

• Different schemes have been developed for dealing with noise, reverberation– Additive noise: reduce effects of particular

frequencies– Convolutional noise: remove effects of linear

filters (cepstral mean subtraction)

Page 21: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Now what?

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

That you …???

Page 22: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Machine Learning!

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

That you …Pattern recognition

with HMMs

Page 23: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Hidden Markov Models (again!)

P(acousticst|statet)Acoustic Model

P(statet+1|statet)Pronunciation/Language models

Page 24: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Acoustic Model

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

dh a a t • Assume that you can label each vector with a phonetic label

• Collect all of the examples of a phone together and build a Gaussian model (or some other statistical model, e.g. neural networks)

Na()P(X|state=a)

Page 25: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Building up the Markov Model

• Start with a model for each phone

• Typically, we use 3 states per phone to give a minimum duration constraint, but ignore that here…

a p

1-p

transition probability

a p

1-p

a p

1-p

a p

1-p

Page 26: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Building up the Markov Model

• Pronunciation model gives connections between phones and words

• Multiple pronunciations:

owt m

dh pdh

1-pdh

a pa

1-pa

t pt

1-pt

ah

ow ey

ah

t

Page 27: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Building up the Markov Model

• Language model gives connections between words (e.g., bigram grammar)

dh a t

h iy

y uw

p(he|that)

p(you|that)

Page 28: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

ASR as Bayesian Inference

q1w1 q2w1 q3w1

x1 x2 x3

th a t

h iy

y uw

p(he|that)

p(you|that)

h iy

sh uh d

argmaxW P(W|X)=argmaxW P(X|W)P(W)/P(X)=argmaxW P(X|W)P(W)=argmaxW Q P(X,Q|W)P(W)≈argmaxW maxQ P(X,Q|W)P(W)≈argmaxW maxQ P(X|Q) P(Q|W) P(W)

Page 29: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

ASR Probability Models

• Three probability models– P(X|Q): acoustic model– P(Q|W): duration/transition/pronunciation

model– P(W): language model

• language/pronunciation models inferred from prior knowledge

• Other models learned from data (how?)

Page 30: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Parts of an ASR System

FeatureCalculation

LanguageModeling

AcousticModeling

k @

PronunciationModelingcat: k@tdog: dogmail: mAlthe: D&, DE…

cat dog: 0.00002cat the: 0.0000005the cat: 0.029the dog: 0.031the mail: 0.054 …

The cat chased the dog

S E A R C H

P(X|Q) P(Q|W) P(W)

Page 31: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

EM for ASR: The Forward-Backward Algorithm

• Determine “state occupancy” probabilities– I.e. assign each data vector to a state

• Calculate new transition probabilities, new means & standard deviations (emission probabilities) using assignments

Page 32: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

ASR as Bayesian Inference

q1w1 q2w1 q3w1

x1 x2 x3

th a t

h iy

y uw

p(he|that)

p(you|that)

h iy

sh uh d

argmaxW P(W|X)=argmaxW P(X|W)P(W)/P(X)=argmaxW P(X|W)P(W)=argmaxW Q P(X,Q|W)P(W)≈argmaxW maxQ P(X,Q|W)P(W)≈argmaxW maxQ P(X|Q) P(Q|W) P(W)

Page 33: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Search

• When trying to find W*=argmaxW P(W|X), need to look at (in theory)– All possible word sequences W – All possible segmentations/alignments of W&X

• Generally, this is done by searching the space of W– Viterbi search: dynamic programming approach that

looks for the most likely path– A* search: alternative method that keeps a stack of

hypotheses around

• If |W| is large, pruning becomes important

Page 34: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

How to train an ASR system

• Have a speech corpus at hand– Should have word (and preferrably phone)

transcriptions

– Divide into training, development, and test sets

• Develop models of prior knowledge– Pronunciation dictionary

– Grammar

• Train acoustic models– Possibly realigning corpus phonetically

Page 35: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

How to train an ASR system

• Test on your development data (baseline)• **Think real hard• Figure out some neat new modification• Retrain system component • Test on your development data• Lather, rinse, repeat **• Then, at the end of the project, test on the test

data.

Page 36: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Judging the quality of a system

• Usually, ASR performance is judged by the word error rateErrorRate = 100*(Subs + Ins + Dels) / Nwords

REF: I WANT TO GO HOME ***

REC: * WANT TWO GO HOME NOW

SC: D C S C C I

100*(1S+1I+1D)/5 = 60%

Page 37: Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1.

Judging the quality of a system

• Usually, ASR performance is judged by the word error rate

• This assumes that all errors are equal– Also, a bit of a mismatch between optimization

criterion and error measurement

• Other (task specific) measures sometimes used– Task completion– Concept error rate