CS 188: Artificial Intelligence Spring 2009inst.eecs.berkeley.edu › ~cs188 › sp09 › slides › SP09 cs188 lecture 2… · Lecture 23: Speech Recognition 4/14/2009 John DeNero

1

CS 188: Artificial Intelligence

Spring 2009

Lecture 23: Speech Recognition

4/14/2009

John DeNero – UC Berkeley

Slides adapted from Dan Klein

Announcements

Written 3 due on Thursday in lecture

Please don‟t be late; no slip days allowed

Project 4 posted

Due next Wednesday 4/22 at 11:59pm

Use up to two slip days

Course contest update

You can qualify for the tournament starting tonight

First person to submit an agent will make me happy

2

2

Today

Particle filter recap

Speech recognition using HMMs

3

Particle Filtering Review

An approximate technique for filtering: P( Xt | e1 , … , et )

Idea: always keep N guesses (samples) of the value of Xt

Inititial samples, or particles, are drawn from the prior P(X1)

0.0 0.1

0.0 0.0

0.0

0.2

0.0 0.2 0.5

3





Three operations:1) Elapse time: draw a sample for Xt+1

from each particle using P(Xt+1 | Xt)






from each particle using P(Xt+1 | Xt)

2) Observe: weight all particles by the likelihood of the evidence et

Inco

rpo

rate

ne

w e

vid

en

ce

4






from each particle using P(Xt+1 | xt)

2) Observe: weight all particles by the likelihood of the evidence et

3) Resample: sample new particles in proportion to those weightsIn

co

rpo

rate

ne

w e

vid

en

ce

Resampling Step Details

Each particle is already weighted by

the evidence likelihood: P( et | xt )

We randomly choose particles in

proportion to those weights

Probability of choosing a particle

is proportional to its weight

Each new particle is chosen

independently, with replacement

The probability of selecting a set

of new particles is the product of

probabilities for each one

rain sun

0.7

0.7

0.3

0.3

X E P

rain umbrella 0.9

rain no „ella 0.1

sun umbrella 0.2

sun no „ella 0.8

5

SLAM

SLAM = Simultaneous Localization And Mapping

We do not know the map or our location

Our belief state is over maps and positions!

Main techniques: Kalman filtering (Gaussian HMMs) and particles

[DEMOS]

DP-SLAM, Ron Parr

Hidden Markov Models

An HMM is Initial distribution:

Transitions:

Emissions:

Query: most likely seq:

X5X2

E1

X1 X3 X4

E2 E3 E4 E5

10

6

State Path Trellis

State trellis: graph of states and transitions over time

Each arc represents some transition

Each arc has weight

Each path is a sequence of states

The product of weights on a path is the seq‟s probability

Can think of the Forward (and now Viterbi) algorithms as

computing sums of all paths (best paths) in this graph

sun

rain

sun

rain

sun

rain

sun

rain

11

Viterbi Algorithm

sun

rain

sun

rain

sun

rain

sun

rain

12

7

Example

13

Digitizing Speech

14

8

Speech in an Hour

Speech input is an acoustic wave form

s p ee ch l a b

Graphs from Simon Arnfield‟s web tutorial on speech, Sheffield:

http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/

“l” to “a”

transition:

15

Frequency gives pitch; amplitude gives volume

sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec)

Fourier transform of wave displayed as a spectrogram

darkness indicates energy at each frequency

s p ee ch l a b

Spectral Analysis

16

9

Adding 100 Hz + 1000 Hz Waves

Time (s)

0 0.05

–0.9654

0.99

0

17

Spectrum

100 1000Frequency in Hz

Am

pli

tud

e

Frequency components (100 and 1000 Hz) on x-axis

18

10

Part of [ae] from “lab”

Note complex wave repeating nine times in figure

Plus smaller waves which repeats 4 times for every large pattern

Large wave has frequency of 250 Hz (9 times in .036 seconds)

Small wave roughly 4 times this, or roughly 1000 Hz

Two little tiny waves on top of peak of 1000 Hz waves

19

Back to Spectra

Spectrum represents these freq components

Computed by Fourier transform, algorithm which

separates out each frequency component of wave.

x-axis shows frequency, y-axis shows magnitude (in

decibels, a log measure of amplitude)

Peaks at 930 Hz, 1860 Hz, and 3020 Hz.20

11

Resonances of the vocal tract

The human vocal tract as an open tube

Air in a tube of a given length will tend to vibrate at resonance frequency of tube.

Constraint: Pressure differential should be maximal at (closed) glottal end and minimal at (open) lip end.

Closed end Open end

Length 17.5 cm.

Figure from W. Barry Speech Science slides

21

From

Mark

Liberman’s

website22

12

Acoustic Feature Sequence

Time slices are translated into acoustic feature vectors (~39 real numbers per slice)

These are the observations, now we need the hidden states X

……………………………………………..e12e13e14e15e16………..

23

State Space

P(E|X) encodes which acoustic vectors are appropriate

for each phoneme (each kind of sound)

P(X|X‟) encodes how sounds can be strung together

We will have one state for each sound in each word

From some state x, can only:

Stay in the same state (e.g. speaking slowly)

Move to the next position in the word

At the end of the word, move to the start of the next word

We build a little state graph for each word and chain

them together to form our state space X

24

13

HMMs for Speech

25

Markov Process with Bigrams

Figure from Huang et al page 61826

14

Decoding

While there are some practical issues, finding the words

given the acoustics is an HMM inference problem

We want to know which state sequence x1:T is most likely

given the evidence e1:T:

From the sequence x, we can simply read off the words27

End of Part II!

Now we‟re done with our unit on

probabilistic reasoning

Last part of class: machine learning

28

CS 188: Artificial Intelligence Spring 2009inst.eecs.berkeley.edu › ~cs188 › sp09 › slides › SP09 cs188 lecture 2… · Lecture 23: Speech Recognition 4/14/2009 John DeNero

Documents