Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 2 Context and Motivation • What : Find an efficient representation of speech so that it can be transmitted with a minimum bandwidth, depending on the desired quality. • How : Exploit the redundancy of the speech waveform. • Applications : – Telephony, PBX – Wireless/Cellular Telephony – Internet Telephony – Speech Storage (Automated call-centers) – Text-to-speech (machine generated speech)
28
Embed
Context and Motivation · • A uniform linear quantizer is called Pulse Code Modulation(PCM). • Pulse code modulation (PCM): Encoding the quantized signals into a digital word
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 1
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 2
Context and Motivation
• What : Find an efficient representation of speech so that it can be transmitted with a minimum bandwidth, depending on the desired quality.
• How : Exploit the redundancy of the speech waveform.
• Applications :
– Telephony, PBX
– Wireless/Cellular Telephony
– Internet Telephony
– Speech Storage (Automated call-centers)
– Text-to-speech (machine generated speech)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 3
Types of CodersSpeech Coders
Waveform Coders Vocoders
Time Domain : PCM. ADPCM
Frequency Domain : Sub-band coders,
Adaptive transform coder
Linear Predictive Coder Formant Coders
• Waveform based coders : Preserve the signal waveform, not the speech.– Pulse Coded Modulation (PCM)– Differential PCM (DPCM)– Adaptive DPCM (ADPCM)
• Model based coders: Preserve speech , not waveform.– LPC10(e) Federal Standard 101 – Mixed Excitation Linear Prediction (MELP)
• Hybrid coders– Coded Excitation Linear Prediction (CELP)– Vector Sum Excitation Linear Prediction (VSELP)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 4
Types of CodersSpeech Coders
Waveform Coders Vocoders
Time Domain : PCM. ADPCM
Frequency Domain : Sub-band coders,
Adaptive transform coder
Linear Predictive Coder Formant Coders
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 5
Quantization • Amplitude quantizing: Mapping samples of a continuous amplitude
waveform to a finite set of amplitudes.
Qua
ntiz
edva
lues
Continuous signal
Discrete signal
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 6
Uniform Quantizer
• A uniform linear quantizer is called Pulse Code Modulation (PCM).
• Pulse code modulation (PCM): Encoding the quantized signals into a digital word (PCM word or codeword).
– Each quantized sample is digitally encoded into an l bits codeword where Lin the number of quantization levels and
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 7
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 8
Quantization Error
• Quantizing error: The difference between the output and input of a quantizer
)()(ˆ)( txtxte −=
+
)(tx )(ˆ tx
)()(ˆ)(
txtxte−=
AGC
x
)(xqy =Qauntizer
Process of quantizing noise
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 9
Quantization error …
• Quantizing error:– Granular or linear errors happen for inputs within the dynamic range of
quantizer– Saturation errors happen for inputs outside the dynamic range of quantizer
» Saturation errors are larger than linear errors» Saturation errors can be avoided by proper tuning of AGC
• Quantization noise variance: 2Sat
2Lin
22 }]{[ σσσ +=−= xxq)E
Value of Input Signal
Value of Output Signal
-1-2-3-4-5 1 2 3 4
1
2
3
4
-1
-2
-3
-4
5
Quantizing Error
(output-input)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 10
Quantization error
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 11
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1
0
1
2
3
Time (ms)
Am
plitu
de (
Qua
ntiz
atio
n Le
vels
)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−15
−10
−5
0
5
10
15
Time (ms)
Am
plitu
de (
Qua
ntiz
atio
n Le
vels
)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−250
−200
−150
−100
−50
0
50
100
150
200
250
Time (ms)
Am
plitu
de (
Qua
ntiz
atio
n Le
vels
)
Quantization error
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 12
Quantization error
• “mid-tread” vs. a “mid-riser” quantizer design is significant when large quantizing steps are used.
– Mid-tread has zero output unless analog input exceeds voltage step size, so background noise is suppressed, but produces worse quantizing error at low voice levels.
– Mid-riser produces worse idle channel noise by increasing the miniscule background room noise or circuit noise, but has less average quantizing noise at low signal levels.
• Quantizing error can be characterized as an equivalent additive quantizing “noise”
Quantizeroutputcode value
Analog voltage
mid-tread
mid-risercode value
Analog voltage
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 13
– The quantization noise is characterize as a realization of a stationary random process q in which each of the random variables q(n) has uniform pdf.
» Where the step size of the quantizer is
2)(
2Δ
≤≤Δ
− xq
2Δ
Δ/1
2Δ
−
dqqpdfnqnqq ⋅⋅== ∫∞
∞−)()(})]({[ 222 Eσ
Quantization error
B
A2max=Δ
B: Number of bits
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 14
– :maximum swing of signal.
– The mean square value of the quantization error is :
– For the case of , the mean square value of the quantization noise is in dB :
Quantization Error
B
A2max=ΔmaxA
[ ]
12212|)(
31
1)()(
2
2max
22/
2/3
2/
2/
22
×=
Δ=
Δ=
⋅Δ⋅=
ΔΔ−
Δ
Δ−∫
B
Anq
dqnqnqE
.dB 8.10612
2log1012
log102
10
2
10 −−==Δ −
BB
1max =A
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 15
)23()(
)(ofpower averagethedenoteLet
231
;log
sampleper bitsofnumber theiswhere2
form,binary in expressedissamplequatizedWhen the
22max
2o
22max
2
2
B
Q
BQ
B
mPPSNR
tmP
m
LB
BL
==⇒
=
=
=
−
σ
σ
6dB per bit
Quantization SNR
2max
22max
10dB3 10log6B )23( log*10 )(
mP
mPSNR B +=⎥
⎦
⎤⎢⎣
⎡=
BB
mA2
22
maxmax ==Δ
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 16
How many bits?
• 16 bits resolution is more than is needed for telephone purposes.
– the voice waveform has already been band-limited to ~3.5kHz bandwidth
– Filter imperfections add about -30 dB noise – Carbon microphone is not high-fidelity– Extra bits cost more in hardware and precision of design and
manufacture, and in transmission cost.
• Empirical listener testing indicates about 12-13 bits of uniform resolution is adequate
– No perception of degradation in telephone voice quality
• Logarithmically compressed (“companded”) steps at low level permit equivalent quality with even less bits
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 17
Types of Quantization
– Uniform (linear) quantizing:• No assumption about amplitude statistics and correlation properties
of the input.• Robust to small changes in input statistic by not finely tuned to a
specific set of input parameters• Simply implemented
– Non-uniform quantizing:• Using the input statistics to tune quantizer parameters• Larger SNR than uniform quantizing with same number of levels• Non-uniform intervals in the dynamic range with same quantization
noise variance• Commonly used for speech
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 18
Statistics of Speech Signals
• In speech, weak signals are more frequent than strong ones.
• Using equal step sizes (uniform quantizer) gives low for weak signals and high for strong signals.
– Thus, adjusting the step size of the quantizer by taking into account the speech statistics improves the SNR for the input range.
0.0
1.0
0.5
1.0 2.0 3.0Normalized magnitude of speech signalPr
obab
ility
den
sity
func
tion
qNS⎟⎠⎞
⎜⎝⎛
qNS⎟⎠⎞
⎜⎝⎛
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 19
Non-Uniform Quantizer
Input SignalO
utpu
t Sig
nal
Input Signal
Out
put S
igna
l
Uniform Transfer
Characteristic
Non-Uniform Transfer
Characteristic
Input Signal
Uniform Error
Characteristic
Non-Uniform Error
Characteristic
Input Signal
2-44
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 20
Uniform vs Non-Linear Quantizing
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 21
Non-uniform quantization
• It is done by uniformly quantizing the “compressed” signal. • At the receiver, an inverse compression characteristic, called “expansion”
is employed to avoid signal distortion.
compression+expansion companding
)(ty)(tx )(ˆ ty )(ˆ tx
x
)(xCy = x
yCompress Quantize
ChannelExpand
Transmitter Receiver
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 22
μ Law/A Law
• The μ-law algorithm (μ-law) is a companding algorithm, primarily used in the digital telecommunication systems of North America and Japan. Its purpose is to reduce the dynamic range of an audio signal. In the analog domain, this can increase the signal to noise ratio achieved during transmission, and in the digital domain, it can reduce the quantization error (hence increasing signal to quantization noise ratio).
• A-law algorithm used in the rest of world.
• A-law algorithm provides a slightly larger dynamic range than the mu-lawat the cost of worse proportional distortion for small signals. By convention, A-law is used for an international connection if at least one country uses it.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 23
μ Law
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 24
A Law
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 25
μ Law/A Law
|x| |x|
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 26
A Companding Law (Europe - ITU)
163248648096
128
16 32 48 64 80 96 112
128
144
160
176
192
208
224
240
256
272
11 bitinput
8 bitoutput
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 27
Compounding in music recording
• Recall the human ear’s “masking” phenomena – a noise signal is not perceived as objectionable unless it is sufficiently
large in relation to a desired sound present simultaneously – Small noises are objectionable in a quiet library– The same small noise is imperceptible at a rock concert!
• This principle is the basis of noise reduction systems like the Dolby™ system for sound recording
– The recording audio level is automatically increased for soft passages– The playback level is automatically reduced, to match, via an auxiliary
control signal, so desired signal has the original loudness. In Dolby system, this is typically a low frequency control signal.
– Therefore, noise added by the recording medium (e.g., magnetic tape “hiss”) is not noticeable during “soft” music intervals
– Dolby systems treat different audio frequency bands separately (high frequency is noisiest in magnetic tape), and use different types of auxiliary signals (Dolby B, C, etc.)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 28
G.711
• The most commonplace codec– Used in circuit-switched telephone network– PCM, Pulse-Code Modulation
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 40
Adaptive DPCM with Forward Adaptation
+ AdaptiveSpeech Input
+Adaptive
Quantizer
Predictor
DecoderEncoder
-
+
order p
+
AdaptivePredictor
Speech Output
Q-1
PredictorAdaptation
PredictorAdaptation
Step sizeAdaptation
Step sizeAdaptation
side info
side info
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 41
Adaptive DPCM with Backward Adaptation
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 42
Short and long-term ADPCM
( ) ∑=
−=P
k
kk zazP
1
If we wish to model the short and long term prediction nature of speech, we can use a predictor of very large order P
But modeling the pitch periodicity of speech as well as the short term redundancy would require a very large order, P = 50 to 100.
Instead, the predictor is split into 2 portions, one modelling the short term redundancy of speech, and one modeling the long term redundancy , due to pitch periodicity. The long term predictor can be a single coeficient filter of the form :
( ) ML zzA −⋅= β
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 43
Short and long-term ADPCM
β is a scaling factor that relates to the degree of periodicity of the waveform and M is the estimated period (in samples). The predictor time response is a single impulse delayed by M samples. M is the estimated pitch period. βThe synthesis filter is of the form :
Width of the peaks is function of β , which can be estimated as
[ ][ ])(
)()(2 MnxE
MnxnxE−−
=β
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 44
Short and long-term ADPCM
+
AL(z) +
A(z)
+ Qeq(n)
s(n)
-+ - + +
+Speech Output
+Q-1eq(n)
+
+
+
A(z)
AL(z)
+
Long-term prediction Encoder
Decoder
( ) ML zzA −⋅= β( ) ∑
=
−=P
k
kk zazA
1
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 45
Higher-order LT predictor
The true pitch period is unlikely to be an exact multiple of 1/Fs. Thus, a predictor of multiple orders can be used to better synthesize the pitch periods .
( ) 132
11
−−−+− ⋅+⋅+⋅= MMML zzzzA βββ
Another way to deal with the varying degree of voicing across the spectrum (lower spectrum is more harmonically pronounced than the higher), separate bands can be considered separately. This allows the pitch predictor in different bands to have different b : high values lead to narrow bandwidth for the lower frequencies and lower values for the less periodic higher frequencies.
Hi band
Low band
LT prediction
LT prediction
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 46
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 50
Delta Modulation : (DM)
• Predictor : one-step delay function
• Quantizer : 1-bit quantizer
[ ])()(~)1(~)()(
1 neQnenunune
bit−=−−=
)1(~)(~ −= nunu
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 51
Delta Modulation : (DM)
• Primary Limitation of DM– Slope overload : large jump region
» Max. slope = (step size)X(sampling freq.)
– Granularity Noise : almost constant region
– Instability to channel noise
)(nu
)(~ nu
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 52
DM:
Unit Delay (Ts)
Unit Delay (Ts)
Integrator
)(nu )(ne )(~ ne
)(~ nu)1(~)(~ −= nunu
)(~ ne )(~ nu
)(~ nu
Coder
Decoder
1-bit quantizer
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 53
DM
Step size effect : Step Size (i) slope overload
(sampling frequency ) (ii) granular Noise
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 54
DM – step size conditions
Afdt
tdxT 02)(max π=≥Δ
00 22 ff
TfA s
ππ⋅Δ
=Δ
≤
• The choice of step size is crucial to successful performance in DM. Since the output magnitude can change only by Δ each sample interval T, then Δ must be large enough to accommodate rapid changes.
Tnxnx
dttdx
T)1()(
max)(max−−
≈≥Δ
For Sinusoidal Signals ( )tfAtx 02cos)( π⋅=
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 55
DM – step size example
– Q. Consider a Speech Signal with maximum frequency of 3.4KHz and maximum amplitude of 1volt. This speech signal is applied to a delta modulator whose bit rate is set at 60kbit/sec. What is an appropriate step size for the modulator ?
– Bandwidth of the signal = 3.4 KHz.– Maximum amplitude = 1 volt– Bit Rate = 60Kbits/sec– Sampling rate = 60K Samples/sec.– STEP SIZE = 0.356 Volts
sATf02π≥Δ
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 56
Adaptive DM:
1+kX
1+kE1+ks
Adaptive Function
Unit DelaykX 1+Δ k
Storedk mink ,E, ΔΔ
Input signal is varying fast - Step Size is increased
Input signal is varying slow - Step Size is reduced