Top Banner
message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50 b/s) message Human Speech Communication
54

Human Speech Communication

Jan 15, 2016

Download

Documents

message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50 b/s) message. Human Speech Communication. PCM (Pulse Code Modulation). Transmit value of each speech sample - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Human Speech Communication

message

linguistic code (~ 50 b/s)

motor controlspeech production

SPEECH SIGNAL (~50 kb/s)speech perception

cognitive processeslinguistic code (~ 50 b/s)

message

Human Speech Communication

Page 2: Human Speech Communication

PCM (Pulse Code Modulation)

• Transmit value of each speech sample– dynamic range of speech is about 50-60 dB

• 11 bits/sample– maximum frequency in telephone speech is 3.4 kHz

• sampling frequency 8 kHz

8000 x 11 = 88 kb/sSimple and universal but not very efficient

Page 3: Human Speech Communication

Better quantization ?

• Less quantization noise for weaker signals IN

OUT

Page 4: Human Speech Communication

- law

A - law

Logarithmic PCM (-law, A-law)

• Finer quantization for each individual small amplitude sample– how about small signal samples surrounded by large ones?– it is the instantaneous signal energy which should determine the step

Page 5: Human Speech Communication

?

Page 6: Human Speech Communication

Differential coding

• For many natural signals, the difference between successive samples quantizes better than samples themselves

• Even better, predict the current sample from the past ones and transmit the error of the prediction

current sample

time

Page 7: Human Speech Communication

Differential predictive coding• DPCM

– a single predictor reflecting global predictability of speech

– predictor order up to 4-5– delta modulation - gross

quantization of prediction error into 1 bit (typically requires up-sampling well over the Nyquist rate)

• adaptive DPCM– new predictor for every new

speech block– predictor needs to be

transmitted together with the prediction error

Page 8: Human Speech Communication

Speech Coders

Page 9: Human Speech Communication

Linear model of speech production

Page 10: Human Speech Communication

A.G. Bell got it almost right

Page 11: Human Speech Communication

linear model of speech

source filter speech

changes slowly

Page 12: Human Speech Communication

current sample

time

short-term prediction

long-term prediction

short-term - resonance of vocal tractlong-term - periodicity of voiced speech (vocal cord vibration)

Page 13: Human Speech Communication

LPC vocoder

• The same principle as in H. Dudley’s Vocoder• Used by US Government (LPC-10) - 2.4 kbs

Page 14: Human Speech Communication

Residual Excited LPC (RELP)

• Transmitter:– Simplify prediction

error (low-pass filter and down-sample

• Receiver– re-introduce high

frequencies in the simplified residual (nonlinear distortion)

Page 15: Human Speech Communication

Analysis-by-synthesis

• Identical synthesizer in coder and in decoder– change parameters in coder– use for synthesizing speech– compare synthesized speech with real speech– when “close enough”, send parameters to the receiver

Page 16: Human Speech Communication

Future in speech coding?

• No need to transmit what we do not hear– study human hearing, especially masking

• No need to transmit what is predictable– speech production mechanism– speaker characteristics– linguistic code (recognition-synthesis)– thought-to-speech

Page 17: Human Speech Communication

Automatic recognition of speech

reduce information = decrease entropy

linguistic messagephoneme string(below 50 b/s)

knowledge

electric signal(more than 50 kb/s)prior knowledge

( textbook )

acquired knowledge( data )

• Automatic speech recognition (ASR)– derive proper response from speech

stimulus

• Auditory perception– how do biological systems respond

to acoustic stimuli

• Knowledge of auditory perception ?

Page 18: Human Speech Communication

Principle of stochastic ASR• Using a model of speech production process, generate all possible

acoustic sequences wi for all legal linguistic messages

• Compare all generated sequences with the unknown acoustic input x to find which one is the most similar

))|)(M((maxarg xww iPi

=

1. What is the model M ( wi ) ?

2. Form of the data x ?

Page 19: Human Speech Communication

One (simple) model

hello world

uh e l o w r dlo

• Two dominant sources of variability in speech1. people say the same thing with different speeds ( temporal

variability )2. different people sound different, communication environment

different, ( feature variability)

• “Doubly stochastic” process (Hidden Markov Model)– Speech as a sequence of hidden states - recover the state sequence

1. never know for sure in which state we are2. never know for sure which data can be generated from a given

state

Page 20: Human Speech Communication

Hidden Markov Model

hi hi hi hi hi hi hi hi hi hi

sequence of male and female groups?

f0=160 Hz 170 Hz 160 Hz 170 Hz 200 Hz 110 Hz 140 Hz 240 Hz170 Hz 190 Hz

m f m m m m m f m m

m f

pm

pm-f mf

The model P(sound|gender)pf

pf-m

f0

p1m

Page 21: Human Speech Communication

m f m f m

160 170 160 170 200 110 140 240 170 190

units of speech(phonemes)

x

What the x should be ?

Page 22: Human Speech Communication

Speech signal ?

• always also carry some irrelevant information – additional processing is

necessary to alleviate it

• Reflects changes in acoustic pressure– its original purpose is

reconstruction of speech– does carry relevant information

Page 23: Human Speech Communication

histogram

speech signal

correlations

Page 24: Human Speech Communication

/u/ /o/ /a/ // /iy/

beer

Isaac Newton

• it is in the spectrum !!

Where Is The Message ?

/uw//ao//ah//eh//ih//iy/

averaged fft spectra of some vowels from

3 hours of fluent speech

Page 25: Human Speech Communication

Steam Engine (1769)Internal Combustion Engine (2003)

Inertia in engineering

Page 26: Human Speech Communication

time

frequ

ency

j/ /u/ /ar/ /j/ /o/ /j/ /o/

10-20 ms

get spectral components

Short-term Spectrum

time

Page 27: Human Speech Communication

histogram

short-term speech spectral envelope

correlations

Page 28: Human Speech Communication

histogram

logarithmic short-term speech spectral envelope

correlations

Page 29: Human Speech Communication

histogram

cosine transform of logarithmic short-term speech spectral envelope

(cepstrum)

correlations

Page 30: Human Speech Communication

What Is Wrong With the Short-term Spectrum ?

1) inconsistent (same message, different representation)

frequency

short-term spectrum

“auditory-like”spectrum

auditory-likemodifications

Page 31: Human Speech Communication

Pitch of the tone (Mel scale)

• Frequency resolution of human ear decreases with frequency

Page 32: Human Speech Communication

FFT

“critical-band energy”

f

t

Emulating frequency resolution of human ear with FFT

Page 33: Human Speech Communication

Equal Loudness Curves

Page 34: Human Speech Communication
Page 35: Human Speech Communication

Perceptual Linear Prediction (PLP)[Hermansky 1990]

• Auditory-like modifications of short-term speech spectrum prior to its approximation by all-pole autoregressive model– critical-band spectral

resolution– equal-loundness

sensitivity– intensity-loudness

nonlinearity• Today applied in virtually all

state-of-the-art experimental ASR systems

Page 36: Human Speech Communication
Page 37: Human Speech Communication
Page 38: Human Speech Communication

/j/ /u/ /ar/ /j/ /o/ /j/ /o/

LDA gives basis for projection of spectral space

time

freq

uen

cy

Spectral Basis from LDA

Page 39: Human Speech Communication

LDA vectors from Fourier Spectrum

Spectral resolution of LDA-derived spectral basis is higher at low frequencies

Critical bands of human hearing are narrower at lower frequencies

63 % 16 %

12 % 2 %

Page 40: Human Speech Communication

Sensitivity to Spectral Change

(Malayath 1999)Cosine basis LDA-derived bases Critical-band filterbank

Page 41: Human Speech Communication

if the receiver could be controlled– put more resources (introduce less

noise) where there is more signal – biological system optimized for

information extraction from sensory signals

Combination of channel and signal spectrum should be as flat (as random-like) as possible.– Shannon, Communication in presence of noise (1949)

resource space

energy of the signal

resource space

level of noise in the channel

energy of the signal

level of noise in the channel

if signal could be controlled (e.g. in communication)

– put more signal where there is less noise– sensory signal optimized for a given

communication channel

Page 42: Human Speech Communication

What Is Wrong With the Short-term Spectral Envelope?

2) Fragile (easily corrupted by minor disturbances)

f

spectrum

f

additive band-limited noise

ignore the noisy parts of the spectrum

f

linear (high-pass) filtering

remove means from parts of the spectrum

Page 43: Human Speech Communication

tone at f

threshold ofperception of the tone

noise bandwidth

• Nonlinear frequency resolution of hearing– Critical bands

• up to ~600 Hz constant bandwidth

• above 1 kHz constant Q

band-pass filterednoise centered at f

Simultaneous Masking

critical bandwidth

Page 44: Human Speech Communication

More Important Outcome of Masking Experiments

• What happens outside the critical band does not affect detection of events within the band !!!

• Independent processing of parts of the spectrum ?

S ( frequency )

pf2 pf3 pf4 pf5 pf6pf1

( Hermansky, Sharma and Pavel 1996, Bourlard and Dupont 1996 )frequency

{p(f)}

Replace spectral vector by a matrix of posterior probabilities of acoustic events

Page 45: Human Speech Communication

uh e l o w r dlocoarticulation

What Is Wrong With the Short-term Spectral Envelope?

3) Coarticulation (inertia of organs of speech production)

human auditory perception

Page 46: Human Speech Communication

Masking in Time

• suggests ~200 ms buffer in auditory system– also seen in perception of loudness, detection of short stimuli, gaps in tones,

auditory afterimages, binaural release from masking, …..– what happens outside this buffer, does no affect detection of signal within the

buffer

signal

masker

time

stronger masker

increasein threshold

0 200 ms

Page 47: Human Speech Communication

Short-term Features?

time~10 ms

data x

processing

time

longer time span ? (~250 ms?)

Page 48: Human Speech Communication

Cortical Receptive FieldsAverage of the first two principal components ( 83% of variance ) along temporal axis from about 180 cortical receptive fields ( from D. Klein 2004, unpublished )

• time-frequency distribution of the linear component of the most efficient stimulus that excites the given auditory neuron

Page 49: Human Speech Communication

Data for Deriving Posterior Probabilities of Speech Events

1-3 critical bands

250-1000 ms

TIME [s]

FREQUENCY

Page 50: Human Speech Communication

How to Get Estimates of Temporal Evolution of Spectral Energy ?- with M. Athineos, D. Ellis (Columbia Univ), and P. Fousek (CTU Prague)

data x

time

10-20 ms

200-1000 ms

1-3 Bark

time200-1000 ms

200-1000 ms

1-3 Barkall-pole model of part oftime-frequency plane

Page 51: Human Speech Communication

All-pole Model of Temporal Trajectoryof Spectral Energy

the signal

signal power spectrum

all-polemodel of

the powerspectrum

DCTof

the signal

Hilbertenvelope

of the signal

all-polemodel of

the Hilbertenvelope

conventional LPspectral domain LP

Page 52: Human Speech Communication

signaldiscretecosine

transform

low frequency

high frequency

prediction

prediction

all-pole modelof low-

frequencyHilbert

envelope

all-pole modelof high-frequencyHilbert envelope

All-pole Models of Sub-band Energy Contours

Page 53: Human Speech Communication

Critical-band Spectrum From FFT

time

tona

lity

Critical-band Spectrum From All-pole Models Of Hilbert Envelopes in Critical Bands

time

tona

lity

Page 54: Human Speech Communication

Putting It All Together

• TRAP-TANDEM– data-guided features based on frequency-independent

processing of relatively long spans of signal• with S. Sharma, P. Jain, S. Sivadas, ICSI Berkeley and TU Brno

time

frequ

ency

data processing( trained NN )

processing( trained NN )

some functionof phoneme posteriors

data processing( trained NN )

class posteriors