Context and Motivation · • A uniform linear quantizer is called Pulse Code Modulation(PCM). • Pulse code modulation (PCM): Encoding the quantized signals into a digital word

Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 1


Context and Motivation

• What : Find an efficient representation of speech so that it can be transmitted with a minimum bandwidth, depending on the desired quality.

• How : Exploit the redundancy of the speech waveform.

• Applications :

– Telephony, PBX

– Wireless/Cellular Telephony

– Internet Telephony

– Speech Storage (Automated call-centers)

– Text-to-speech (machine generated speech)


Types of CodersSpeech Coders

Waveform Coders Vocoders

Time Domain : PCM. ADPCM

Frequency Domain : Sub-band coders,

Adaptive transform coder

Linear Predictive Coder Formant Coders

• Waveform based coders : Preserve the signal waveform, not the speech.– Pulse Coded Modulation (PCM)– Differential PCM (DPCM)– Adaptive DPCM (ADPCM)

• Model based coders: Preserve speech , not waveform.– LPC10(e) Federal Standard 101 – Mixed Excitation Linear Prediction (MELP)

• Hybrid coders– Coded Excitation Linear Prediction (CELP)– Vector Sum Excitation Linear Prediction (VSELP)


Types of CodersSpeech Coders

Waveform Coders Vocoders

Time Domain : PCM. ADPCM

Frequency Domain : Sub-band coders,

Adaptive transform coder

Linear Predictive Coder Formant Coders


Quantization • Amplitude quantizing: Mapping samples of a continuous amplitude

waveform to a finite set of amplitudes.

Qua

ntiz

edva

lues

Continuous signal

Discrete signal


Uniform Quantizer

• A uniform linear quantizer is called Pulse Code Modulation (PCM).

• Pulse code modulation (PCM): Encoding the quantized signals into a digital word (PCM word or codeword).

– Each quantized sample is digitally encoded into an l bits codeword where Lin the number of quantization levels and


Quantization example

tTs: sampling time

x(nTs): sampled valuesxq(nTs): quantized values

boundaries

Quant. levels

111 3.1867

110 2.2762

101 1.3657

100 0.4552

011 -0.4552

010 -1.3657

001 -2.2762

000 -3.1867

PCMcodeword 110 110 111 110 100 010 011 100 100 011 PCM sequence

amplitudex(t)


Quantization Error

• Quantizing error: The difference between the output and input of a quantizer

)()(ˆ)( txtxte −=

+

)(tx )(ˆ tx

)()(ˆ)(

txtxte−=

AGC

x

)(xqy =Qauntizer

Process of quantizing noise


Quantization error …

• Quantizing error:– Granular or linear errors happen for inputs within the dynamic range of

quantizer– Saturation errors happen for inputs outside the dynamic range of quantizer

» Saturation errors are larger than linear errors» Saturation errors can be avoided by proper tuning of AGC

• Quantization noise variance: 2Sat

2Lin

22 }]{[ σσσ +=−= xxq)E

Value of Input Signal

Value of Output Signal

-1-2-3-4-5 1 2 3 4

1

2

3

4

-1

-2

-3

-4

5

Quantizing Error

(output-input)


Quantization error


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1

0

1

2

3

Time (ms)

Am

plitu

de (

Qua

ntiz

atio

n Le

vels

)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−15

−10

−5

0

5

10

15

Time (ms)

Am

plitu

de (

Qua

ntiz

atio

n Le

vels

)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−250

−200

−150

−100

−50

0

50

100

150

200

250

Time (ms)

Am

plitu

de (

Qua

ntiz

atio

n Le

vels

)

Quantization error


Quantization error

• “mid-tread” vs. a “mid-riser” quantizer design is significant when large quantizing steps are used.

– Mid-tread has zero output unless analog input exceeds voltage step size, so background noise is suppressed, but produces worse quantizing error at low voice levels.

– Mid-riser produces worse idle channel noise by increasing the miniscule background room noise or circuit noise, but has less average quantizing noise at low signal levels.

• Quantizing error can be characterized as an equivalent additive quantizing “noise”

Quantizeroutputcode value

Analog voltage

mid-tread

mid-risercode value

Analog voltage


– The quantization noise is characterize as a realization of a stationary random process q in which each of the random variables q(n) has uniform pdf.

» Where the step size of the quantizer is

2)(

2Δ

≤≤Δ

− xq

2Δ

Δ/1

2Δ

−

dqqpdfnqnqq ⋅⋅== ∫∞

∞−)()(})]({[ 222 Eσ

Quantization error

B

A2max=Δ

B: Number of bits


– :maximum swing of signal.

– The mean square value of the quantization error is :

– For the case of , the mean square value of the quantization noise is in dB :

Quantization Error

B

A2max=ΔmaxA

[ ]

12212|)(

31

1)()(

2

2max

22/

2/3

2/

2/

22

×=

Δ=

Δ=

⋅Δ⋅=

ΔΔ−

Δ

Δ−∫

B

Anq

dqnqnqE

.dB 8.10612

2log1012

log102

10

2

10 −−==Δ −

BB

1max =A


)23()(

)(ofpower averagethedenoteLet

231

;log

sampleper bitsofnumber theiswhere2

form,binary in expressedissamplequatizedWhen the

22max

2o

22max

2

2

B

Q

BQ

B

mPPSNR

tmP

m

LB

BL

==⇒

=

=

=

−

σ

σ

6dB per bit

Quantization SNR

2max

22max

10dB3 10log6B )23( log*10 )(

mP

mPSNR B +=⎥

⎦

⎤⎢⎣

⎡=

BB

mA2

22

maxmax ==Δ


How many bits?

• 16 bits resolution is more than is needed for telephone purposes.

– the voice waveform has already been band-limited to ~3.5kHz bandwidth

– Filter imperfections add about -30 dB noise – Carbon microphone is not high-fidelity– Extra bits cost more in hardware and precision of design and

manufacture, and in transmission cost.

• Empirical listener testing indicates about 12-13 bits of uniform resolution is adequate

– No perception of degradation in telephone voice quality

• Logarithmically compressed (“companded”) steps at low level permit equivalent quality with even less bits


Types of Quantization

– Uniform (linear) quantizing:• No assumption about amplitude statistics and correlation properties

of the input.• Robust to small changes in input statistic by not finely tuned to a

specific set of input parameters• Simply implemented

– Non-uniform quantizing:• Using the input statistics to tune quantizer parameters• Larger SNR than uniform quantizing with same number of levels• Non-uniform intervals in the dynamic range with same quantization

noise variance• Commonly used for speech


Statistics of Speech Signals

• In speech, weak signals are more frequent than strong ones.

• Using equal step sizes (uniform quantizer) gives low for weak signals and high for strong signals.

– Thus, adjusting the step size of the quantizer by taking into account the speech statistics improves the SNR for the input range.

0.0

1.0

0.5

1.0 2.0 3.0Normalized magnitude of speech signalPr

obab

ility

den

sity

func

tion

qNS⎟⎠⎞

⎜⎝⎛

qNS⎟⎠⎞

⎜⎝⎛


Non-Uniform Quantizer

Input SignalO

utpu

t Sig

nal

Input Signal

Out

put S

igna

l

Uniform Transfer

Characteristic

Non-Uniform Transfer

Characteristic

Input Signal

Uniform Error

Characteristic

Non-Uniform Error

Characteristic

Input Signal

2-44


Uniform vs Non-Linear Quantizing


Non-uniform quantization

• It is done by uniformly quantizing the “compressed” signal. • At the receiver, an inverse compression characteristic, called “expansion”

is employed to avoid signal distortion.

compression+expansion companding

)(ty)(tx )(ˆ ty )(ˆ tx

x

)(xCy = x

yCompress Quantize

ChannelExpand

Transmitter Receiver


μ Law/A Law

• The μ-law algorithm (μ-law) is a companding algorithm, primarily used in the digital telecommunication systems of North America and Japan. Its purpose is to reduce the dynamic range of an audio signal. In the analog domain, this can increase the signal to noise ratio achieved during transmission, and in the digital domain, it can reduce the quantization error (hence increasing signal to quantization noise ratio).

• A-law algorithm used in the rest of world.

• A-law algorithm provides a slightly larger dynamic range than the mu-lawat the cost of worse proportional distortion for small signals. By convention, A-law is used for an international connection if at least one country uses it.


μ Law


A Law


μ Law/A Law

|x| |x|


A Companding Law (Europe - ITU)

163248648096

128

16 32 48 64 80 96 112

128

144

160

176

192

208

224

240

256

272

11 bitinput

8 bitoutput


Compounding in music recording

• Recall the human ear’s “masking” phenomena – a noise signal is not perceived as objectionable unless it is sufficiently

large in relation to a desired sound present simultaneously – Small noises are objectionable in a quiet library– The same small noise is imperceptible at a rock concert!

• This principle is the basis of noise reduction systems like the Dolby™ system for sound recording

– The recording audio level is automatically increased for soft passages– The playback level is automatically reduced, to match, via an auxiliary

control signal, so desired signal has the original loudness. In Dolby system, this is typically a low frequency control signal.

– Therefore, noise added by the recording medium (e.g., magnetic tape “hiss”) is not noticeable during “soft” music intervals

– Dolby systems treat different audio frequency bands separately (high frequency is noisiest in magnetic tape), and use different types of auxiliary signals (Dolby B, C, etc.)


G.711

• The most commonplace codec– Used in circuit-switched telephone network– PCM, Pulse-Code Modulation

• If uniform quantization– 12 bits * 8 k/sec = 96 kbps

• Non-uniform quantization– 64 kbps DS0 rate– mu-law

» North America– A-law

» Other countries, a little friendlier to lower signal levels– An MOS of about 4.3


Differential PCM

• Basic idea– Since speech signals are slowly varying, it is possible to eliminate the

temporal redundancy by prediction– For many natural signals, the difference between successive samples

quantizes better than samples themselves– Even better, predict the current sample from the past one(s) and transmit

the error of the prediction to the decoder on the other side.

• Linear prediction– Fixed: the same predictor is used again and again

– Adaptive: predictor is adjusted on-the-fly


First-order Prediction

- Encodinge1=x1

en=xn-xn-1 n = 2,…,N

x1 x2 … … xN

_

+D

xn

xnxn-1

en

xn-1

+xn-1

en xn

D

EncoderDecoder

DPCM Loop

- Decoding e1 e2 … … eN

x1=e1

xn=en+xn-1 n = 2,…,N


Open-loop DPCM

_

+D

+

D

EncoderDecoder

Q

Note: • Prediction is based on the past unquantized sample

• But quantization is located outside the DPCM loop

nene) nx)

nx1−nx

nx

1−nx1−nx)


DPCM

Σ Quantizer

Σ

ΣCommunicationChannel

PredictorPredictor

)(nx )(ne )(ne )(~ ne

Coder Decoder

)1( −nx)

)1( −nx)

)1( −nx)

)(nx)

)(nx)

Bring the quantizer into the ‘prediction loop’


Numerical Example

90 92 91 93 93 95 …

90 2 -2 3 0 2

90 93 90 93 93 96 …

90 3 -3 3 0 3

Q 33

)( ⋅⎥⎦⎤

⎢⎣⎡=

xxQ

a

b

a-b

a

b a+b

)(nx

)(ne

)(ne

)(nx)


DPCM

1−−= nnn xxe )

1−+= nnn xex )) nnnn eexx −=− )A:

B:

The distortion due to quantization of the prediction residue en is identical to the distortion introduced to the original sample xn

Σ Quantizer

ΣPredictor

)(nx )(ne )(ne

)1( −nx)

)1( −nx)

)(nx)

AB


Higher Order Prediction

- Encoding

initialize

prediction Nknxaxek

iininn ,...,1

1

+=−= ∑=

−

Nknxaexk

iininn ,...,1

1

+=+= ∑=

−

knxe nn :1==

- Decodinginitialize knex nn :1==

prediction


DPCM

Σ Quantizer

Σ

ΣCommunicationChannel

PredictorPredictor

)(nx )(ne )(ne

Coder Decoder

∑=

−=k

iinin xax

1

~~

nx~)(~ nx

)(~ nx

nx~)(ne

Prediction of the current sample from past estimated ones

Est of current sample = predicted + error prediction

error

nnn xxe ~−=


Linear predictor coefficients

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−

−

)(

)2()1(

)0()1()1()1(

)0()1()1()1()0(

2

1

KR

RR

a

aa

RRKRR

RRKRRR

n

n

n

Knnn

n

nn

nnn

MM

L

OM

MO

L

∑ ∑∑= ==

−−==N

n

K

kk

N

nknxanxneMSE

1

2

11

2 ])()([)(minimize

Note that in fixed prediction, auto-correlation is calculatedover the whole segment of speech (NOT short-time features)


Adaptive DPCM

• Forward adaptation

– The prediction parameters are estimated from the current speech data

which is available only at the transmitter. The quantized prediction

coefficients are transmitted to the decoder as side information .

• Backward adaptation

– The parameters are estimated from past data, which is available at both

transmitter and receiver, thus there is no need for side information (no

overhead), but the operation is suseptible to transmission errors.


More suitable for high-bit rate coding

More suitable for low-bit rate coding

sensitive to errorsrobust to errors

No overheadOverhead non-negligible

Symmetric complexity allocation (encoder=decoder)

Asymmetric complexity allocation (encoder>decoder)

Backward adaptive predictionForward adaptive prediction

Forward / Backward Adaptation


Adaptive DPCM with Forward Adaptation

+ AdaptiveSpeech Input

+Adaptive

Quantizer

Predictor

DecoderEncoder

-

+

order p

+

AdaptivePredictor

Speech Output

Q-1

PredictorAdaptation

PredictorAdaptation

Step sizeAdaptation

Step sizeAdaptation

side info

side info


Adaptive DPCM with Backward Adaptation


Short and long-term ADPCM

( ) ∑=

−=P

k

kk zazP

1

If we wish to model the short and long term prediction nature of speech, we can use a predictor of very large order P

But modeling the pitch periodicity of speech as well as the short term redundancy would require a very large order, P = 50 to 100.

Instead, the predictor is split into 2 portions, one modelling the short term redundancy of speech, and one modeling the long term redundancy , due to pitch periodicity. The long term predictor can be a single coeficient filter of the form :

( ) ML zzA −⋅= β



β is a scaling factor that relates to the degree of periodicity of the waveform and M is the estimated period (in samples). The predictor time response is a single impulse delayed by M samples. M is the estimated pitch period. βThe synthesis filter is of the form :

0.5

1

1.5

2

2.5

3

3.5

0 500 1000 1500 2000 2500 3000 3500 4000 Freq. (Hz)

|H(f)|

Frequency response of the synthesis filter:

( ) ( ) ML

L zzAzS −⋅−

=−

=β1

11

1

Peaks are spaced by 1/M

Width of the peaks is function of β , which can be estimated as

[ ][ ])(

)()(2 MnxE

MnxnxE−−

=β



+

AL(z) +

A(z)

+ Qeq(n)

s(n)

-+ - + +

+Speech Output

+Q-1eq(n)

+

+

+

A(z)

AL(z)

+

Long-term prediction Encoder

Decoder

( ) ML zzA −⋅= β( ) ∑

=

−=P

k

kk zazA

1


Higher-order LT predictor

The true pitch period is unlikely to be an exact multiple of 1/Fs. Thus, a predictor of multiple orders can be used to better synthesize the pitch periods .

( ) 132

11

−−−+− ⋅+⋅+⋅= MMML zzzzA βββ

Another way to deal with the varying degree of voicing across the spectrum (lower spectrum is more harmonically pronounced than the higher), separate bands can be considered separately. This allows the pitch predictor in different bands to have different b : high values lead to narrow bandwidth for the lower frequencies and lower values for the less periodic higher frequencies.

Hi band

Low band

LT prediction

LT prediction


ITU-T G.726 - Adaptive Differential Pulse Code Modulation (ADPCM)

Encoder

Decoder


ITU-T G.722 7 kHz Audio Coding within 64 kbit/s

simultaneous speech- and data-transmission with data-rate BD=8 or16 kbit/s possible, B+BD= 64 kbit/s

overall signal delay 1.5ms

ADPCM (G.726 like) coding in both subbands with w = 4,5 or 6 (32,40,48 kbit/s) in the lower subband ans w = 2 (16 kbit/s) in the higher subband


ITU waveform coders


ITU waveform coders

G.722(48 kbps)

G.726(32 kbps)

http://www-lns.tf.uni-kiel.de/demo/demo_speech.htm

G.711(64 kbps)


Delta Modulation : (DM)

• Predictor : one-step delay function

• Quantizer : 1-bit quantizer

[ ])()(~)1(~)()(

1 neQnenunune

bit−=−−=

)1(~)(~ −= nunu


Delta Modulation : (DM)

• Primary Limitation of DM– Slope overload : large jump region

» Max. slope = (step size)X(sampling freq.)

– Granularity Noise : almost constant region

– Instability to channel noise

)(nu

)(~ nu


DM:

Unit Delay (Ts)

Unit Delay (Ts)

Integrator

)(nu )(ne )(~ ne

)(~ nu)1(~)(~ −= nunu

)(~ ne )(~ nu

)(~ nu

Coder

Decoder

1-bit quantizer


DM

Step size effect : Step Size (i) slope overload

(sampling frequency ) (ii) granular Noise


DM – step size conditions

Afdt

tdxT 02)(max π=≥Δ

00 22 ff

TfA s

ππ⋅Δ

=Δ

≤

• The choice of step size is crucial to successful performance in DM. Since the output magnitude can change only by Δ each sample interval T, then Δ must be large enough to accommodate rapid changes.

Tnxnx

dttdx

T)1()(

max)(max−−

≈≥Δ

For Sinusoidal Signals ( )tfAtx 02cos)( π⋅=


DM – step size example

– Q. Consider a Speech Signal with maximum frequency of 3.4KHz and maximum amplitude of 1volt. This speech signal is applied to a delta modulator whose bit rate is set at 60kbit/sec. What is an appropriate step size for the modulator ?

– Bandwidth of the signal = 3.4 KHz.– Maximum amplitude = 1 volt– Bit Rate = 60Kbits/sec– Sampling rate = 60K Samples/sec.– STEP SIZE = 0.356 Volts

sATf02π≥Δ


Adaptive DM:

1+kX

1+kE1+ks

Adaptive Function

Unit DelaykX 1+Δ k

Storedk mink ,E, ΔΔ

Input signal is varying fast - Step Size is increased

Input signal is varying slow - Step Size is reduced

Variable Step Size

Context and Motivation · • A uniform linear quantizer is called Pulse Code Modulation(PCM). • Pulse code modulation (PCM): Encoding the quantized signals into a digital word

Documents