-
Chapter2
AUDIOCOMPRESSION
DigitalAudio,Lossysoundcompression,MlawandAlawCompanding,DPCMandADPCMaudiocompression,MPEGaudiostandard,frequencydomaincoding,formatofcompresseddata.
1.Introduction:
Two important features of audio compression are (1) it can be
lossy and (2) it requires fast
decoding.Textcompressionmustbelossless,butimagesandaudiocanlosemuchdatawithouta
noticeable degradation of quality. Thus, there are both lossless
and lossy audio compression
algorithms. Often, audio is stored in compressed form and has to
be decompressed in realtime
whentheuserwantstolistentoit.Thisiswhymostaudiocompressionmethodsareasymmetric.
The encoder can be slow, but the decoder has to be fast. This is
also why audio compression
methods are not dictionary based. A dictionarybased compression
method may have many
advantages,butfastdecodingisnotoneofthem.
Wecandefinesoundas:
(a)An intuitivedefinition: Sound is the sensation detected by
our ears and interpreted by our
braininacertainway.
(b) A scientific definition: Sound is a physical disturbance in
a medium. It propagates in the
mediumasapressurewavebythemovementofatomsormolecules.
Likeanyotherwave,soundhasthreeimportantattributes,itsspeed,amplitude,andperiod.
Thespeedofsounddependsmostlyonthemediumitpassesthrough,andonthetemperature.The
humanearissensitivetoawiderangeofsoundfrequencies,normallyfromabout20Hztoabout
22,000Hz,dependingonapersonsageandhealth.Thisistherangeofaudiblefrequencies.Some
animals, most notably dogs and bats, can hear higher frequencies
(ultrasound). Loudness is
commonly measured in units of dB SPL (sound pressure level)
instead of sound power. The
definitionis,
10
10
20
-
2.DigitalAudio:
Sound can be digitized and broken up into numbers. Digitizing
sound is done bymeasuring the
voltage at many points in time, translating each measurement
into a number, and writing the
numbersona file.Thisprocess is calledsampling.Thesoundwave is
sampled,and thesamples
becomethedigitizedsound.Thedeviceusedforsamplingiscalledananalogtodigitalconverter
(ADC).
Since theaudio samples arenumbers, theyareeasy toedit.However,
themainuseof an
audiofileistoplayitback.Thisisdonebyconvertingthenumericsamplesbackintovoltagesthat
are continuously fed into a speaker. The device that does that
is called a digitaltoanalog
converter (DAC). Intuitively, it is clear that a high sampling
rate would result in better sound
reproduction,butalsoinmanymoresamplesandthereforebiggerfiles.Thus,themainproblemin
audiosamplingishowoftentosampleagivensound.
Figure1:SamplingofaSoundWave
-
Figure1ashowstheeffectof lowsamplingrate.Thesoundwave in the
figure issampled
fourtimes,andallfoursampleshappentobeidentical.Whenthesesamplesareusedtoplayback
the sound, the result is silence. Figure 1b shows seven samples,
and they seem to follow the
original wave fairly closely. Unfortunately, when they are used
to reproduce the sound, they
produce the curve shown in dashed. There simply are not enough
samples to reconstruct the
originalsoundwave.Thesolutiontothesamplingproblemis
tosamplesoundata littleoverthe
Nyquistfrequency,whichistwicethemaximumfrequencycontainedinthesound.
The sampling rate plays a different role in determining the
quality of digital sound reproduction. One classic
law in digital signal processing was published by Harry Nyquist.
He determined that to accurately reproduce a
signal of frequency f, the sampling rate has to be greater than
2*f. This is commonly called the Nyquist Rate. It
is used in many practical situations. The range of human
hearing, for instance, is between 16 Hz and 22,000 Hz.
When sound is digitized at high quality (such as music recorded
on a CD), it is sampled at the rate of 44,100 Hz.
Anything lower than that results in distortions.
Thus, ifasoundcontains frequenciesofupto2kHz,
itshouldbesampledata littlemore
than4kHz.Suchasamplingrateguaranteestruereproductionofthesound.This
is illustratedin
Figure1c,whichshows10equallyspacedsamplestakenoverfourperiods.Noticethatthesamples
donothavetobetakenfromthemaximaorminimaofthewave;theycancomefromanypoint.
The range of humanhearing is typically from1620Hz to
20,00022,000Hz, depending on the
personandonage.Whensoundisdigitizedathighfidelity,itshouldthereforebesampledatalittle
overtheNyquistrateof222000=44000Hz.Thisiswhyhighqualitydigitalsoundisbasedona
44,100Hz sampling rate. Anything lower than this rate results in
distortions, while higher
samplingratesdonotproduceanyimprovementinthereconstruction(playback)ofthesound.We
canconsider thesamplingrateof44,100Hza lowpass filter, since it
effectively removesall the
frequenciesabove22,000Hz.
The telephone system, originally designed for conversations, not
for digital
communications, samples sound at only 8 kHz. Thus, any frequency
higher than 4000 Hz gets
distortedwhensentoverthephone,whichiswhyitishardtodistinguish,onthephone,between
thesoundsof fands.Thesecondprobleminsoundsampling is
thesamplesize.Eachsample
becomesanumber,buthowlargeshouldthisnumberbe?Inpractice,samplesarenormallyeither
8 or 16 bits. Assuming that the highest voltage in a soundwave
is 1 volt, an 8bit sample can
distinguishvoltagesaslowas1/2560.004volt,or4millivolts(mv).Aquietsound,generatinga
-
wavelowerthan4mv,wouldbesampledaszeroandplayedbackassilence.Incontrast,witha16
bitsampleitispossibletodistinguishsoundsaslowas1/65,53615microvolt(v).Wecanthink
ofthesamplesizeasaquantizationoftheoriginalaudiodata.
Audio sampling is also calledpulse codemodulation (PCM). The
termpulsemodulation
refers to techniques for converting a continuous wave to a
stream of binary numbers (audio
samples). Possible pulsemodulation methods include pulse
amplitude modulation (PAM), pulse
positionmodulation(PPM),pulsewidthmodulation(PWM),andpulsenumbermodulation(PNM)
isagoodsourceofinformationonthesemethods.Inpractice,however,PCMhasprovedthemost
effective form of converting soundwaves to numbers.When stereo
sound is digitized, the PCM
encodermultiplexes the left and right sound samples. Thus,
stereo sound sampled at 22,000Hz
with16bitsamplesgenerates44,00016bitsamplespersecond,foratotalof704,000bits/sec,or
88,000bytes/sec.
2.1DigitalAudioandLaplaceDistribution:Alargeaudiofilewithalong,complexpieceofmusictendstohaveallthepossiblevaluesofaudio
samples.Considerthesimplecaseof8bitaudiosamples,whichhavevaluesintheinterval[0,255].
A large audio file, with millions of audio samples, will tend to
have many audio samples
concentrated around the center of this interval (around 128),
fewer large samples (close to the
maximum255),andfewsmallsamples(althoughtheremaybemanyaudiosamplesof0,because
manytypesofsoundtendtohaveperiodsofsilence).Thedistributionofthesamplesmayhavea
maximumatitscenterandanotherspikeat0.Thus,theaudiosamplesthemselvesdonotnormally
haveasimpledistribution.
However,whenweexaminethedifferencesofadjacentsamples,weobserveacompletely
differentbehavior.Consecutiveaudiosamplestendtobecorrelated,whichiswhythedifferencesof
consecutivesamplestendtobesmallnumbers.Experimentswithmanytypesofsoundindicatethat
the distribution of audio differences resembles the Laplace
distribution. The differences of
consecutivecorrelatedvaluestendtohaveanarrow,peakeddistribution,resemblingtheLaplace
distribution. This is true for the differences of audio samples
as well as for the differences of
consecutive pixels of an image. A compression algorithm may take
advantage of this fact and
encode the differences with variablesize codes that have a
Laplace distribution. A more
sophisticated versionmay compute differences between actual
values (audio samples or pixels)
and their predicted values, and then encode the (Laplace
distributed) differences. Two such
methodsareimageMLPandFLAC.
-
2.2.TheHumanAuditorySystem
The frequency range of the human ear is from about 20 Hz to
about 20,000 Hz, but the ears
sensitivity to sound isnotuniform. Itdependson the frequency. It
shouldalsobenoted that the
rangeofthehumanvoiceismuchmorelimited.It
isonlyfromabout500Hztoabout2kHz.The
existenceofthehearingthresholdsuggestsanapproachtolossyaudiocompression.Justdeleteany
audio samples that are below the threshold. Since the threshold
depends on the frequency, the
encoderneeds to know the frequency spectrumof the soundbeing
compressed at any time. If a
signalforfrequencyfissmallerthanthehearingthresholdatf,it(thesignal)shouldbedeleted.In
additiontothis,twomorepropertiesofthehumanhearingsystemareusedinaudiocompression.
Theyarefrequencymaskingandtemporalmasking.
2.2.1SpectralMaskingorFrequencyMasking:
Frequencymasking(alsoknownasauditorymaskingorSpectralmasking)occurswhenasound
thatwecannormallyhear(becauseit is
loudenough)ismaskedbyanothersoundwithanearby
frequency. The thick arrow in Figure 2 represents a strong sound
source at 8 kHz. This source
raisesthenormalthresholdinitsvicinity(thedashedcurve),withtheresultthatthenearbysound
representedby thearrowat x, a sound
thatwouldnormallybeaudiblebecause it is above the
threshold,isnowmasked,andisinaudible.Agoodlossyaudiocompressionmethodshouldidentify
this case anddelete the signals corresponding to sound x,
because it cannot be heard anyway.
Thisisonewaytolossilycompresssound.
Figure2:Spectralorfrequencymasking
The frequency masking (the width of the dashed curve of Figure
2) depends on the
frequency.Itvariesfromabout100Hzforthelowestaudiblefrequenciestomorethan4kHzfor
-
thehighest.Therangeofaudiblefrequenciescanthereforebepartitionedintoanumberofcritical
bands that indicate the declining sensitivity of the ear
(rather, its declining resolvingpower) for
higherfrequencies.Wecanthinkofthecriticalbandsasameasuresimilartofrequency.However,
incontrasttofrequency,whichisabsoluteandhasnothingtodowithhumanhearing,thecritical
bands are determined according to the sound perception of the
ear. Thus, they constitute a
perceptuallyuniformmeasureoffrequency.Table1lists27approximatecriticalbands.
Table1:TwentySevenApproximateCriticalBands.
This alsopoints theway todesigning apractical lossy compression
algorithm.The audio
signal should first be transformed into its frequency domain,
and the resulting values (the
frequencyspectrum)shouldbedividedintosubbandsthatresemblethecriticalbandsasmuchas
possible. Once this is done, the signals in each subband should
be quantized such that the
quantization noise (the difference between the original sound
sample and its quantized value)
shouldbeinaudible.
2.2.2TemporalMasking
TemporalmaskingmayoccurwhenastrongsoundAoffrequencyfisprecededorfollowedintime
byaweakersoundBatanearby(orthesame)frequency.Ifthetimeintervalbetweenthesoundsis
short, sound Bmay not be audible. Figure 3 illustrates an
example of temporal masking. The
thresholdoftemporalmaskingduetoaloudsoundattime0goesdown,firstsharply,thenslowly.
Aweakersoundof30dBwillnotbeaudibleifitoccurs10msbeforeoraftertheloudsound,but
willbeaudibleifthetimeintervalbetweenthesoundsis20ms.
-
Figure3:ThresholdandMaskingofSound.
If the masked sound occurs prior to the masking tone, this is
called premasking or
backwardmasking, and if the sound beingmasked occurs after
themasking tone this effect is
called postmasking or forwardmasking. The forward masking
remains in effect for a much
longertimeintervalthanthebackwardmasking.
3.LossySoundCompressionItispossibletogetbettersoundcompressionbydevelopinglossymethodsthattakeadvantageof
our perception of sound, and discard data to which the human ear
is not sensitive. We briefly
describetwoapproaches,silencecompressionandcompanding.
Theprincipleofsilencecompressionistotreatsmallsamplesasiftheyweresilence(i.e.,as
samples of 0). This generates run lengthsof zero, so silence
compression is actually a variant of
RLE,suitableforsoundcompression.Thismethodusesthefactthatsomepeoplehavelesssensitive
hearingthanothers,andwilltoleratethelossofsoundthatissoquiettheymaynothearitanyway.
Audiofilescontaininglongperiodsoflowvolumesoundwillrespondtosilencecompressionbetter
than other fileswith highvolume sound. Thismethod requires a
usercontrolledparameter that
specifiesthelargestsamplethatshouldbesuppressed.
Companding(shortforcompressing/expanding)usesthefactthattheearrequiresmore
precise samples at low amplitudes (soft sounds), but is more
forgiving at higher amplitudes. A
typicalADCusedinsoundcardsforpersonalcomputersconvertsvoltagestonumberslinearly.Ifan
amplitudeaisconvertedtothenumbern,thenamplitude2awillbeconvertedtothenumber2n.A
compressionmethodusing companding examines every sample in the
sound file, and employs a
-
nonlinearformulatoreducethenumberofbitsdevotedtoit.Moresophisticatedmethods,suchas
lawandAlaw,arecommonlyused.
4.LawandALawCompandingTheLawandALawcompanding standardsemploy
logarithmbased functions to encodeaudio
samples for ISDN (integrated services digital network) digital
telephony services, by means of
nonlinearquantization.TheISDNhardwaresamplesthevoicesignalfromthetelephone8KHz,and
generates14bitsamples(13forAlaw).ThemethodoflawcompandingisusedinNorthAmerica
andJapan,andAlawisusedelsewhere.
Experiments indicate that the lowamplitudesof speechsignals
containmore information
thanthehighamplitudes.Thisiswhynonlinearquantizationmakessense.Imagineanaudiosignal
sentonatelephonelineanddigitizedto14bitsamples.Theloudertheconversation,thehigherthe
amplitude,andthebiggerthevalueof
thesample.Sincehighamplitudesare less important, they
canbecoarselyquantized.Ifthelargestsample,whichis2141=16,383,isquantizedto255(the
largest8bitnumber),thenthecompressionfactoris14/8=1.75.Whendecoded,acodeof255will
become very different from the original 16,383.We say that
because of the coarse quantization,
largesamplesendupwithhighquantizationnoise.Smallersamplesshouldbefinelyquantized,so
theyendupwithlowquantizationnoise.Thelawencoderinputs14bitsamplesandoutputs
8bit codewords.TheAlaw inputs 13bit samples and also outputs
8bit codewords. The
telephone signals are sampled at 8 kHz (8,000 timesper second),
so thelaw encoder receives
8,00014=112,000bits/sec.Atacompressionfactorof1.75,theencoderoutputs64,000bits/sec.
4.1LawEncoder:
Thelawencoderreceivesa14bitsignedinputsamplex.Thus,theinputisintherange[8192,
+8191]. The sample isnormalized to the interval [1,+1], and
theencoderuses the logarithmic
expression
||
Where
1, 00, 01, 0
(andisapositiveinteger),tocomputeandoutputan8bitcodeinthesameinterval[1,+1].The
output is thenscaled to the range [256,+255]. Figure4 shows
thisoutputasa functionof the
-
input for the three values 25, 255, and 2555. It is clear that
large values of cause coarser
quantizationforlargeramplitudes.Suchvaluesallocatemorebitstothesmaller,moreimportant,
amplitudes. The G.711 standard recommends the use of = 255. The
diagram shows only the
nonnegativevaluesoftheinput(i.e.,from0to8191).Thenegativesideofthediagramhasthesame
shapebutwithnegativeinputsandoutputs.
Figure4:TheLawforValuesof25,255,and2555.
The following simple examples illustrate the nonlinear nature of
the law. The two
(normalized)inputsamples0.15and0.16aretransformedbylawtooutputs0.6618and0.6732.
Thedifferencebetweentheoutputsis0.0114.Ontheotherhand,thetwoinputsamples0.95and
0.96 (bigger inputs but with the same difference) are
transformed to 0.9908 and 0.9927. The
differencebetween these twooutputs is0.0019;muchsmaller.Bigger
samplesaredecodedwith
-
morenoise,andsmallersamplesaredecodedwithlessnoise.However,thesignaltonoiseratiois
constantbecauseboththelawandtheSNRuselogarithmicexpressions.
P S2 S1 S0 Q3 Q2 Q1 Q0
Figure5:G.711LawCodeword.
Logarithmsareslowtocompute,sothelawencoderperformsmuchsimplercalculations
thatproduceanapproximation.Theoutputspecifiedby theG.711standard
isan8bitcodeword
whoseformatisshowninFigure5.BitPinFigure5isthesignbitoftheoutput(sameasthesignbit
ofthe14bitsignedinputsample).BitsS2,S1,andS0arethesegmentcode,andbitsQ3throughQ0
arethequantizationcode.Theencoderdeterminesthesegmentcodeby(1)addingabiasof33to
theabsolutevalueoftheinputsample,(2)determiningthebitpositionofthemostsignificant1bit
among bits 5 through 12 of the input, and (3) subtracting 5 from
that position. The 4bit
quantizationcodeissettothefourbitsfollowingthebitpositiondeterminedinstep2.Theencoder
ignores the remaining bits of the input sample, and it inverts
(1s complements) the codeword
beforeitisoutput.
ExampleofLawCodeword:
(a)Encoding:Weusetheinputsample656asanexample.Thesampleis
Q3 Q2 Q1 Q0
0 0 0 1 0 1 0 1 1 0 0 0 1
12 11 10 9 8 7 6 5 4 3 2 1 0
Figure6:EncodingInputSample656.
negative, so bit P becomes 1. Adding 33 to the absolute value of
the input yields 689 =
0010101100012(Figure6).Themostsignificant1bitinpositions5through12isfoundatposition
9.Thesegmentcodeisthus95=4.Thequantizationcodeisthefourbits0101atpositions85,
and the remaining five bits 10001 are ignored. The 8bit codeword
(which is later inverted)
becomes
P S2 S1 S0 Q3 Q2 Q1 Q0
1 1 0 0 0 1 0 1
-
(b)Decoding:Thelaw decoder inputs an 8bit codeword and inverts
it. It then decodes it as
follows:
1.
Multiplythequantizationcodeby2andadd33(thebias)totheresult.
2. Multiplytheresultby2raisedtothepowerofthesegmentcode.
3. Decrementtheresultbythebias.
4. UsebitPtodeterminethesignoftheresult.
Applyingthesestepstoourexampleproduces
1. Thequantizationcodeis01012=5,so52+33=43.
2. Thesegmentcodeis1002=4,so4324=688.
3. Decrementbythebias68833=655.
4. BitP is1,sothe finalresult
is655.Thus,thequantizationerror(thenoise)is1;very
small.
Figure 7 illustrates the nature of the law midtread
quantization. Zero is one of the valid
output values, and the quantization steps are centered at the
input value of 0. The steps are
organizedineightsegmentsof16stepseach.Thestepswithineachsegmenthavethesamewidth,
Figure7:LawMidtreadQuantization.
but they double inwidth from one segment to the next. If we
denote the segment number by i
(wherei=0,1...7)andthewidthofasegmentbyk(wherek=1,2...16),thenthemiddleofthe
treadofeachstepinFigure7(i.e.,thepointslabeledxj)isgivenby,
16
wheretheconstantsT(i)andD(i)aretheinitialvalueandthestepsizeforsegment
i,respectively.
Theyaregivenby,
-
i 0 1 2 3 4 5 6 7
T(i) 1 35 103 239 511 1055 2143 4319
D(i) 2 4 8 16 32 64 128 256
4.2TheAlawencoder:TheAlawencoderusesthesimilarexpression
||
, ||
||
, || 1
TheG.711standardrecommendstheuseofA=87.6.
Figure8:ALawMidriserQuantization.
TheoperationoftheAlawencoderissimilar,exceptthatthequantization(Figure8)isofthemidriservariety.ThebreakpointsxjaregivenbyEquation,
16
buttheinitialvalueT(i)andthestepsizeD(i)forsegmentiaredifferentfromthoseusedbythelawencoderandaregivenby,
i 0 1 2 3 4 5 6 7
T(i) 0 32 64 128 256 512 1024 2048
D(i) 2 2 4 8 16 32 64 128
-
TheAlawencodergeneratesan8bitcodewordwiththesameformatasthelawencoder.
ItsetsthePbittothesignoftheinputsample.Itthendeterminesthesegmentcodeinthefollowing
steps:
1.Determinethebitpositionofthemostsignificant1bitamongthesevenmostsignificantbitsof
theinput.
2.Ifsucha1bitisfound,thesegmentcodebecomesthatpositionminus4.Otherwise,thesegment
codebecomeszero.
The4bitquantizationcodeissettothe fourbits
followingthebitpositiondeterminedin
step1,ortohalftheinputvalueifthesegmentcodeiszero.Theencoderignorestheremainingbits
of the inputsample,and it invertsbitPand theevennumberedbitsof
thecodewordbefore it is
output.
TheAlawdecoderdecodesan8bitcodewordintoa13bitaudiosampleasfollows:
1.ItinvertsbitPandtheevennumberedbitsofthecodeword.
2.Ifthesegmentcodeisnonzero,thedecodermultipliesthequantizationcodeby2andincrements
thisbythebias(33).Theresultisthenmultipliedby2andraisedtothepowerofthe(segmentcode
minus1).Ifthesegmentcodeis0,thedecoderoutputstwicethequantizationcode,plus1.
3.BitPisthenusedtodeterminethesignoftheoutput.
5.ADPCMAudioCompression:Adjacent audio samples tend to be
similar inmuch the sameway that neighboring pixels in an
image tend to have similar colors. The simplest way to exploit
this redundancy is to subtract
adjacentsamplesandcodethedifferences,whichtendtobesmallintegers.Anyaudiocompression
method based on this principle is called DPCM (differential
pulse code modulation). Such
methods,however,areinefficient,becausetheydonotadaptthemselvestothevaryingmagnitudes
of the audio stream. Better results are achieved by an adaptive
version, and any such version is
calledADPCM.
ADPCM: Short for Adaptive Differential Pulse Code Modulation, a
form of pulse code modulation (PCM) that
produces a digital signal with a lower bit rate than standard
PCM. ADPCM produces a lower bit rate by
recording only the difference between samples and adjusting the
coding scale dynamically to accommodate
large and small differences.
ADPCM employs linear prediction. It uses the previous sample (or
several previous
samples)topredictthecurrentsample.Itthencomputesthedifferencebetweenthecurrentsample
-
anditsprediction,andquantizesthedifference.ForeachinputsampleX[n],theoutputC[n]ofthe
encoderissimplyacertainnumberofquantizationlevels.Thedecodermultipliesthisnumberby
thequantizationstep(andmayaddhalf thequantizationstep,
forbetterprecision) toobtain the
reconstructedaudiosample.Themethodisefficientbecausethequantizationstepisupdatedallthe
time,bybothencoderanddecoder,inresponsetothevaryingmagnitudesoftheinputsamples.Itis
alsopossibletomodifyadaptivelythepredictionalgorithm.VariousADPCMmethodsdifferinthe
waytheypredictthecurrentaudiosampleandinthewaytheyadapttotheinput(bychangingthe
quantizationstepsizeand/orthepredictionmethod).
Inadditiontothequantizedvalues,anADPCMencodercanprovidethedecoderwithside
information.Thisinformationincreasesthesizeofthecompressedstream,butthisdegradationis
acceptable to the users, because it makes the compressed audio
data more useful. Typical
applications of side information are (1) help the decoder
recover from errors and (2) signal an
entrypointintothecompressedstream.Anoriginalaudiostreammayberecordedincompressed
formonamediumsuchasaCDROM.Iftheuser(listener)wantstolistentosong5,thedecodercan
usethesideinformationtoquicklyfindthestartofthatsong.
Figure9:(a)ADPCMEncoderand(b)Decoder.
-
Figure9showsthegeneralorganizationoftheADPCMencoderanddecoder.Theadaptive
quantizer receives thedifferenceD[n]between the current input
sampleX[n] and theprediction
Xp[n1].ThequantizercomputesandoutputsthequantizedcodeC[n]ofX[n].Thesamecodeis
senttotheadaptivedequantizer(thesamedequantizerusedbythedecoder),whichproducesthe
nextdequantizeddifferencevalueDq[n].ThisvalueisaddedtothepreviouspredictoroutputXp[n
1],andthesumXp[n]issenttothepredictortobeusedinthenextstep.
Better prediction would be obtained by feeding the actual input
X[n] to the predictor.
However,thedecoderwouldntbeabletomimicthat,sinceitdoesnothaveX[n].Weseethatthe
basicADPCMencoderissimple,andthedecoderisevensimpler.ItinputsacodeC[n],dequantizes
ittoadifferenceDq[n],whichisaddedtotheprecedingpredictoroutputXp[n1]toformthenext
outputXp[n].Thenextoutputisalsofedintothepredictor,tobeusedinthenextstep.
6.SpeechCompression:
Certainaudiocodecsaredesignedspecifically
tocompressspeechsignals.Suchsignalsareaudio
andaresampledlikeanyotheraudiodata,butbecauseofthenatureofhumanspeech,theyhave
propertiesthatcanbeexploitedforefficientcompression.
6.1PropertiesofSpeech
Weproducesoundbyforcingair fromthe lungs throughthevocalcords
intothevocal tract.The
vocalcordscanopenandclose,andtheopeningbetweenthemiscalledtheglottis.Themovements
of the glottis and vocal tract give rise to different types of
sound. The three main types are as
follows:
1.Voiced sounds. These are the soundswemakewhenwe talk. The
vocal cords vibratewhich
opensandclosestheglottis,therebysendingpulsesofairatvaryingpressurestothetract,whereit
isshapedintosoundwaves.Thefrequenciesofthehumanvoice,ontheotherhand,aremuchmore
restrictedandaregenerallyintherangeof500Hztoabout2kHz.Thisisequivalenttotimeperiods
of2msto20ms.Thus,voicedsoundshavelongtermperiodicity.
2.Unvoiced sounds. These are sounds that are emitted and can be
heard, but are not parts of
speech.Suchasoundistheresultofholdingtheglottisopenandforcingairthroughaconstriction
inthevocaltract.Whenanunvoicedsoundissampled,thesamplesshowlittlecorrelationandare
randomorclosetorandom.
3.Plosivesounds.These resultwhen theglottis closes, the
lungsapply airpressureon it, and it
suddenlyopens,lettingtheairescapesuddenly.Theresultisapoppingsound.
-
6.2Speechcodecs
Therearethreemaintypesofspeechcodecs.
1. Waveform speech codecs: It produce good to excellent speech
after compressing and
decompressingit,butgeneratebitratesof1064kbps.
2.
Sourcecodecs(alsocalledvocoders):Vocodersgenerallyproducepoortofairspeechbut
cancompressittoverylowbitrates(downto2kbps).
3. Hybrid codecs: These codecs are combinations of the former
two types and produce
speechthatvariesfromfairtogood,withbitratesbetween2and16kbps.
Figure10illustratesthespeechqualityversusbitrateofthesethreetypes.
Figure10:SpeechQualityversusBitrateforSpeechCodecs.
6.3WaveformCodecs
Waveformcodecdoesnotattempttopredicthowtheoriginalsoundwasgenerated.Itonlytriesto
produce,afterdecompression,audiosamplesthatareasclosetotheoriginalonesaspossible.Thus,
such codecs are not designed specifically for speech coding and
can perform equallywell on all
kindsofaudiodata.AsFigure10illustrates,whensuchacodecisforcedtocompresssoundtoless
than16kbps,thequalityofthereconstructedsounddropssignificantly.
The simplest waveform encoder is pulse codemodulation (PCM).
This encoder simply
quantizeseachaudiosample.Speechistypicallysampledatonly8kHz.Ifeachsampleisquantized
to 12 bits, the resulting bitrate is 8k 12 = 96 kbps and the
reproduced speech sounds almost
-
natural. Better results are obtained with a logarithmic
quantizer, such as the law and Alaw
compandingmethods.Theyquantizeaudiosamplestovaryingnumbersofbitsandmaycompress
speech to8bitspersampleonaverage, therebyresulting ina
bitrateof64kbps,withverygood
qualityofthereconstructedspeech.
AdifferentialPCM speechencoderuses the fact that
theaudiosamplesofvoiced speech
arecorrelated.This typeofencodercomputes thedifferencebetween
thecurrent sampleand its
predecessorandquantizes
thedifference.Anadaptiveversion(ADPCM)maycompressspeechat
goodqualitydowntoabitrateof32kbps.
Waveform coders may also operate in the frequency domain. The
subband coding
algorithm (SBC) transforms the audio samples to the
frequencydomain,partitions the resulting
coefficientsintoseveralcriticalbands(orfrequencysubbands),andcodeseachsubbandseparately
withADPCMorasimilarquantizationmethod.TheSBCdecoderdecodesthefrequencycoefficients,
recombinesthem,andperformstheinversetransformationto(lossily)reconstructaudiosamples.
TheadvantageofSBCisthattheearissensitivetocertainfrequenciesandlesssensitivetoothers
Subbands of frequencies to which the ear is less sensitive can
therefore be coarsely quantized
without loss of sound quality. This type of coder typically
produces good reconstructed speech
qualityatbitratesof1632kbps.Theyare,however,morecomplextoimplementthanPCMcodecs
andmayalsobeslower.
The adaptive transform coding (ATC) speech compression algorithm
transforms audio
samplestothefrequencydomainwiththediscretecosinetransform(DCT).Theaudiofileisdivided
into blocks of audio samples and the DCT is applied to each
block, resulting in a number of
frequency coefficients. Each coefficient is quantized according
to the frequency to which it
corresponds.Goodqualityreconstructedspeechcanbeachievedatbitratesaslowas16kbps.
6.4SourceCodecs
Ingeneral,asourceencoderusesamathematicalmodelofthesourceofdata.Themodeldepends
oncertainparameters,andtheencoderusestheinputdatatocomputethoseparameters.Oncethe
parameters are obtained, they are written (after being suitably
encoded) on the compressed
stream.Thedecoderinputstheparametersandemploysthemathematicalmodeltoreconstructthe
originaldata.Iftheoriginaldataisaudio,thesourcecoderiscalledvocoder(fromvocalcoder).
6.4.1LinearPredictiveCoder(LPC):
Figure 11 shows a simplified model of speech production. Part
(a) illustrates the process in a
person, whereas part (b) shows the corresponding LPCmathematical
model. In this model, the
outputisthesequenceofspeechsampless(n)comingoutoftheLPCfilter(whichcorrespondsto
-
thevocal tractand lips).The inputu(n) to themodel(andto the
filter) iseithera trainofpulses
(whenthesoundisvoicedspeech)orwhitenoise(whenthesoundisunvoicedspeech).The
Figure11:SpeechProduction:(a)Real.(b)LPCModel
quantitiesu(n)arealsotermedinnovation.Themodelillustrateshowsampless(n)ofspeechcanbe
generated by mixing innovations (a train of pulses and white
noise). Thus, it represents
mathematically the relation between speech samples and
innovations. The task of the speech
encoder is to input samples s(n) of actual speech, use the
filter as a mathematical function to
determineanequivalentsequenceofinnovationsu(n),andoutputtheinnovationsincompressed
form. The correspondence between the models parameters and the
parts of real speech is as
follows:
1.ParameterV(voiced)correspondstothevibrationsofthevocalcords.UVexpressestheunvoiced
sounds.
2.Tistheperiodofthevocalcordsvibrations.
3.G(gain)correspondstotheloudnessortheairvolumesentfromthelungseachsecond.
4.Theinnovationsu(n)correspondtotheairpassingthroughthevocaltract.
5.Thesymbolsanddenoteamplificationandcombination,respectively.
-
ThemainequationoftheLPCmodeldescribestheoutputoftheLPCfilteras,
wherezistheinputtothefilter[thevalueofoneoftheu(n)].Anequivalentequationdescribesthe
relationbetweenthe innovationsu
(n)ontheonehandandthe10coefficientsaiandthespeech
audiosampless(n)ontheotherhand.Therelationis,
Thisrelationimpliesthateachnumberu(n)inputtotheLPCfilteristhesumofthecurrentaudio
samples(n)andaweightedsumofthe10precedingsamples.TheLPCmodelcanbewrittenasthe
13tuple
, , , , , /, ,
whereV/UV is a single bit specifying the source (voiced or
unvoiced) of the input samples. The
modelassumesthatAstaysstableforabout20ms,thengetsupdatedbytheaudiosamplesofthe
next20ms.Atasamplingrateof8kHz,thereare160audiosampless(n)every20ms.Themodel
computes the 13 quantities in A from these 160 samples, writes A
(as 13 numbers) on the
compressedstream,thenrepeatsforthenext20ms.Theresultingcompressionfactoristherefore
13numbersforeachsetof160audiosamples.
Its important to distinguish the operation of the encoder from
the diagram of the LPCs
mathematicalmodeldepictedinFigure11b.Thefigureshowshowasequenceofinnovationsu(n)
generatesspeechsampless(n).Theencoder,however,startswith
thespeechsamples. It inputsa
20ms sequence of speech samples s(n), computes an equivalent
sequence of innovations,
compresses them to 13 numbers, and outputs the numbers after
further encoding them. This
repeatsevery20ms.
LPCencoding(oranalysis)startswith160soundsamplesandcomputesthe10LPCparametersai
byminimizingtheenergyoftheinnovationu(n).Theenergyisthefunction
, , ,
and itsminimum is computed by differentiating it 10 times,with
respect to each of its 10 . The
autocorrelationfunctionofthesampless(n)isgivenby,
-
GKB
Whichisusedtoobtain10LPCparametersai.Theremainingthreeparameters,V/UV,G,andT,are
determinedfromthe160audiosamples.Ifthosesamplesexhibitperiodicity,thenTbecomesthat
periodandthe1bitparameterV/UVissettoV.Ifthe160samplesdonotfeatureanywelldefined
period,thenTremainsundefinedandV/UVissettoUV.ThevalueofGisdeterminedbythelargest
sample.
LPC decoding (or synthesis) starts with a set of 13 LPC
parameters and computes 160 audio
samplesastheoutputoftheLPCfilterby,
Thesesamplesareplayedat8,000samplespersecondandresultin20msof(voicedorunvoiced)
reconstructedspeech.
AdvantagesofLPC:
1. LPCprovidesagoodmodelofthespeechsignal.
2. Theway in which LPC is applied to the analysis of speech
signals leads to a reasonable
sourcevocaltractseparation.
3.
LPCisananalyticallytractablemodel.Themodelismathematicallypreciseandsimpleand
straightforwardtoimplementineithersoftwareorhardware.
6.5HybridCodecs
This type of speech codec combines features from bothwaveform
and source codecs. Themost
popular hybrid codecs are AnalysisbySynthesis (AbS) timedomain
algorithms. Like the LPC
vocoder,thesecodecsmodelthevocaltractbyalinearpredictionfilter,butuseanexcitationsignal
insteadof thesimple, twostatevoiceunvoicemodel tosupply theu(n)
(innovation) input to the
filter.Thus,anAbSencoderstartswithasetofspeechsamples(aframe),encodesthemsimilarto
LPC,decodesthem,andsubtractsthedecodedsamplesfromtheoriginalones.Thedifferencesare
sent through an error minimization process that outputs improved
encoded samples. These
samplesareagaindecoded,subtractedfromtheoriginalsamples,andnewdifferencescomputed.
Thisisrepeateduntilthedifferencessatisfyaterminationcondition.Theencoderthenproceedsto
thenextsetofspeechsamples(nextframe).
6.5.1CodeExcitedLinearPrediction(CELP):
Oneofthemostimportantfactorsingeneratingnaturalsoundingspeechistheexcitationsignal.As
-
thehumanearisespeciallysensitivetopitcherrors,agreatdealofefforthasbeendevotedtothe
developmentofaccuratepitchdetectionalgorithms.
In CELP instead of having a codebook of pulse patterns,we allow
a variety of excitation
signals.Foreachsegmenttheencoderfindstheexcitationvectorthatgeneratessynthesizedspeech
thatbestmatchesthespeechsegmentbeingencoded.Thisapproachiscloserinastrictsensetoa
waveform coding technique such as DPCM than to the
analysis/synthesis schemes. The main
components of the CELP coder include the LPC analysis, the
excitation codebook, and the
perceptualweightingfilter.BesidesCELP,theMPLPCalgorithmhadanotherdescendantthathas
becomeastandard.Insteadofusingexcitationvectorsinwhichthenonzerovaluesareseparated
byanarbitrarynumberofzerovalues,theyforcedthenonzerovaluestooccuratregularlyspaced
intervals.Furthermore,MPLPCallowedthenonzerovaluestotakeonanumberofdifferentvalues.
Thisschemeiscalledasregularpulseexcitation(RPE)coding.AvariationofRPE,calledregular
pulse excitationwith longterm prediction (RPELTP), was adopted
as a standard for digital
cellular telephony by the Group Speciale Mobile (GSM)
subcommittee of the European
TelecommunicationsStandardsInstituteattherateof13kbps.
ThevocaltractfilterusedbytheCELPcoderisgivenby
wherePisthepitchperiodandthetermynPisthecontributionduetothepitchperiodicity.
1. The inputspeech issampledat8000samplespersecondanddivided
into30millisecond
framescontaining240samples.
2.
Eachframeisdividedintofoursubframesoflength7.5milliseconds.
3. The coefficients for the 10thorder shortterm filter are
obtained using the
autocorrelationmethod.
4.
ThepitchperiodPiscalculatedonceeverysubframe.Inordertoreducethecomputational
load,thepitchvalueisassumedtoliebetween20and147everyoddsubframe.
5. In every even subframe, the pitch value is assumed to
liewithin 32 samples of the pitch
valueinthepreviousframe.
6. The algorithmuses two codebooks, a stochastic codebook and an
adaptive codebook. An
excitationsequenceisgeneratedforeachsubframebyaddingonescaledelementfromthe
stochasticcodebookandonescaledelementfromtheadaptivecodebook.
-
7.
Thestochasticcodebookcontains512entries.TheseentriesaregeneratedusingaGaussian
randomnumbergenerator, theoutputofwhich isquantized
to1,0,or1.Thecodebook
entriesareadjustedsothateachentrydiffersfromtheprecedingentryinonlytwoplaces.
8. The adaptive codebook consists of the excitation vectors from
the previous frame. Each
timeanewexcitationvector isobtained, it is added to the
codebook. In thismanner, the
codebookadaptstolocalstatistics.
9. The coder has been shown to provide excellent reproductions
in both quiet and noisy
environmentsatratesof4.8kbpsandabove.
10.
Thequalityofthereproductionofthiscoderat4.8kbpshasbeenshowntobeequivalentto
adeltamodulatoroperatingat32kbps.Thepriceforthisqualityismuchhighercomplexity
andamuchlongercodingdelay.
CCITTG.728CELPSpeechcodingStandard:
By their nature, the speech coding schemes have some coding
delay built into them. By coding
delay,wemeanthetimebetweenwhenaspeechsample isencodedtowhenit
isdecodedif the
encoderanddecoderwereconnectedbacktoback(i.e.,therewerenotransmissiondelays).Inthe
schemeswehavestudied,asegmentofspeechisfirststoredinabuffer.Wedonotstartextracting
thevariousparametersuntilacompletesegmentofspeechisavailabletous.Oncethesegmentis
completelyavailable, it isprocessed. If theprocessing is real
time, thismeansanother segments
worth of delay. Finally, once the parameters have been obtained,
coded, and transmitted, the
receiverhastowaituntilatleastasignificantpartoftheinformationisavailablebeforeitcanstart
decoding the first sample. Therefore, if a segment contains 20
milliseconds worth of data, the
codingdelaywouldbeapproximatelysomewherebetween40to60milliseconds.
Forsuchapplications,CCITTapprovedrecommendationG.728,aCELPcoderwithacoder
delayof2millisecondsoperatingat16kbps.Astheinputspeechissampledat8000samplesper
second,thisratecorrespondstoanaveragerateof2bitsper
sample.TheG.728recommendation
usesasegmentsizeoffivesamples.Withfivesamplesandarateof2bitspersample,weonlyhave
10bitsavailabletous.Usingonly10bits,itwouldbeimpossibletoencodetheparametersofthe
vocal tract filter aswell as theexcitationvector. Therefore,
thealgorithmobtains thevocal tract
filter parameters in a backward adaptivemanner; that is, the
vocal tract filter coefficients to be
usedtosynthesizethecurrentsegmentareobtainedbyanalyzingthepreviousdecodedsegments.
TheG.728algorithmusesa50thordervocaltract filter.Theorderofthe
filter is largeenoughto
modelthepitchofmostfemalespeakers.Notbeingabletousepitchinformationformalespeakers
doesnotcausemuchcorruptedbychannelerrors.Therefore,thevocaltractfilterisupdatedevery
-
fourth frame,which is once every20 samples or 2.5milliseconds.
The autocorrelationmethod is
usedtoobtainthevocaltractparameters.
FIGURE12:EncoderanddecoderfortheCCITTG.72816kbpsCELPspeechcodec
Tenbitswouldbeable to index1024excitationsequences.However,
toexamine1024excitation
sequences every 0.625milliseconds is a rather large
computational load. In order to reduce this
load,theG.728algorithmusesaproductcodebookwhereeachexcitationsequenceisrepresented
-
by a normalized sequence and a gain term. The final excitation
sequence is a product of the
normalizedexcitationsequenceandthegain.Ofthe10bits,3bitsareusedtoencodethegainusing
apredictiveencodingscheme,whiletheremaining7bitsformtheindextoacodebookcontaining
127sequences.
BlockdiagramsoftheencoderanddecoderfortheCCITTG.728coderareshowninFigure12.The
lowdelay CCITT G.728 CELP coder operating at 16 kbps provides
reconstructed speech quality
superiortothe32kbpsCCITTG.726ADPCMalgorithm.Variouseffortsareunderwaytoreducethe
bitrateforthisalgorithmwithoutcompromisingtoomuchonqualityanddelay.
6.5.3MixedExcitationLinearPrediction(MELP):
Themixedexcitationlinearprediction(MELP)coderwasselectedtobethenewfederalstandard
forspeechcodingat2.4kbpsbywhichusesthesameLPCfiltertomodelthevocaltract.However,it
usesamuchmorecomplexapproachtothegenerationoftheexcitationsignal.Ablockdiagramof
thedecoderfortheMELPsystemisshowninFigure13.Asevidentfromthefigure,theexcitation
signal forthesynthesis filter isno
longersimplynoiseoraperiodicpulsebutamultibandmixed
excitation.Themixedexcitationcontainsbothafilteredsignalfromanoisegeneratoraswellasa
contributionthatdependsdirectlyontheinputsignal.
Thefirststepinconstructingtheexcitationsignalispitchextraction.TheMELPalgorithm
obtains thepitchperiodusingamultistepapproach. In the first
stepan integerpitchvalueP1 is
obtainedby
1.firstfilteringtheinputusingalowpassfilterwithacutoffof1kHz
2.computingthenormalizedautocorrelationforlagsbetween40and160
Thenormalizedautocorrelationr()isdefinedas
,
, ,
,
ThefirstestimateofthepitchP1isobtainedasthevalueofthatmaximizesthenormalized
autocorrelation function. This stage uses two values of P1, one
from the current frame andone
fromthepreviousframe,ascandidates.Thenormalizedautocorrelationvaluesareobtainedforlags
fromfivesampleslesstofivesamplesmorethanthecandidateP1values.
Thelagsthatprovidethemaximumnormalizedautocorrelationvalueforeachcandidateareused
forfractionalpitchrefinement.
-
Figure13:BlockdiagramofMELPdecoder.
Thefinalrefinementsofthepitchvalueareobtainedusingthelinearpredictionresiduals.
Theresidualsequenceisgeneratedbyfilteringtheinputspeechsignalwiththefilterobtainedusing
theLPCanalysis.For thepurposesofpitchrefinement theresidual
signal is filteredusinga low
pass filter with a cutoff of 1 kHz. The normalized
autocorrelation function is computed for this
filteredresidualsignal
forlagsfromfivesampleslesstofivesamplesmorethanthecandidateP2
value,andacandidatevalueofP3isobtained.
Theinputisalsosubjectedtoamultibandvoicinganalysisusingfivefilterswithpassbands
0500,5001000,10002000,20003000,and30004000Hz.Thegoaloftheanalysisistoobtain
thevoicingstrengthsVbpiforeachbandusedintheshapingfilters.IfthevalueofVbp1issmall,this
indicatesalackoflowfrequencystructure,whichinturnindicatesanunvoicedortransitioninput.
Thus,ifVbp106,the
valuesoftheothervoicingstrengthsarequantizedto1iftheirvalueisgreaterthan0.6,andto0
otherwise. In thisway signal energy in the different bands is
turnedon or off depending on the
voicingstrength.
Inordertogeneratethepulseinput,thealgorithmmeasuresthemagnitudeofthediscrete
Fouriertransformcoefficientscorrespondingtothefirst10harmonicsofthepitch.Themagnitudes
-
oftheharmonicsarequantizedusingavectorquantizerwithacodebooksizeof256.Thecodebook
is searched using aweighted Euclidean distance that emphasizes
lower frequencies over higher
frequencies.
At the decoder, using the magnitudes of the harmonics and
information about the
periodicity of the pulse train, the algorithm generates one
excitation signal. Another signal is
generated using a random number generator. Both are shaped by
the multiband shaping filter
before being combined. This mixture signal is then processed
through an adaptive spectral
enhancementfilter,whichisbasedontheLPCcoefficients,toformthefinalexcitationsignal.Note
that inorder topreservecontinuity fromframeto frame,
theparametersused forgeneratingthe
excitationsignalareadjustedbasedontheircorrespondingvaluesinneighboringframes.
6.6MPEGAudioCoding
TheformalnameofMPEG1is the internationalstandard
formovingpicturevideocompression,IS
11172. It consists of five parts, of which part 3 [ISO/IEC 93]
is the definition of the audio
compressionalgorithm.ThedocumentdescribingMPEG1hasnormativeandinformativesections.
Anormative section is part of the standard specification. It is
intended for implementers, it is
written inaprecise language,and it shouldbestrictly followed in
implementing thestandardon
actualcomputerplatforms.Aninformativesection,ontheotherhand,illustratesconceptsthatare
discussedelsewhere, explains the reasons that led to certain
choices anddecisions, and contains
backgroundmaterial.Anexampleofanormativesectionisthetablesofvariousparametersandof
theHuffmancodesusedinMPEGaudio.Anexampleofaninformativesectionisthealgorithmused
by MPEG audio to implement a psychoacoustic model. MPEG does not
require any particular
algorithm,andanMPEGencodercanuseanymethodtoimplementthemodel.Thisinformative
sectionsimplydescribesvariousalternatives.
The MPEG1 and MPEG2 (or in short, MPEG1/2) audio standard
specifies three
compressionmethods called layers and designated I, II,and III.
All three layers are part of the
MPEG1 standard.Amovie compressedbyMPEG1usesonlyone layer, and
the layernumber is
specifiedinthecompressedstream.Anyofthelayerscanbeusedtocompressanaudiofilewithout
anyvideo.Aninterestingaspectofthedesignofthestandardisthatthelayersformahierarchyin
thesensethatalayerIIIdecodercanalsodecodeaudiofilescompressedbylayersIorII.
Theresultofhaving three layerswasan increasingpopularity of
layer III.Theencoder is
extremelycomplex,butitproducesexcellentcompression,andthis,combinedwiththefactthatthe
decoderismuchsimpler,hasproducedinthelate1990sanexplosionofwhatispopularlyknown
-
asmp3soundfiles.ItiseasytolegallyandfreelyobtainalayerIIIdecoderandmuchmusicthatis
alreadyencodedinlayerIII.Sofar,thishasbeenabigsuccessoftheaudiopartoftheMPEGproject.
The principle of MPEG audio compression is quantization. The
values being quantized,
however,arenottheaudiosamplesbutnumbers(calledsignals)takenfromthefrequencydomain
of the sound. The fact that the compression ratio (or
equivalently, the bitrate) is known to the
encodermeansthattheencoderknowsatanytimehowmanybits
itcanallocatetothequantized
signals. Thus, the (adaptive) bitallocationalgorithm is an
important part of the encoder. This
algorithmusestheknownbitrateandthefrequencyspectrumofthemostrecentaudiosamplesto
determinethesizeofthequantizedsignalssuchthatthequantizationnoise(thedifferencebetween
anoriginalsignalandaquantizedone)willbeinaudible.
Figure14:MPEGAudio:(a)Encoderand(b)Decoder
Thepsychoacousticmodelsuse the frequencyof the sound that
isbeing compressed,but
the input stream consists of audio samples, not sound
frequencies. The frequencies have to be
computed from the samples. This is why the first step in MPEG
audio encoding is a discrete
Fouriertransform,whereasetof512consecutiveaudiosamplesistransformedtothefrequency
domain. Since the number of frequencies can be huge, they are
grouped into 32 equalwidth
frequencysubbands(layerIIIusesdifferentnumbersbutthesameprinciple).Foreachsubband,a
number is obtained that indicates the intensity of the sound at
that subbands frequency range.
These numbers (called signals) are then quantized. The
coarseness of the quantization in each
-
subband is determinedby themasking threshold in the subband
andby thenumber of bits still
available to the encoder. The masking threshold is computed for
each subband using a
psychoacousticmodel.
MPEGusestwopsychoacousticmodelstoimplementfrequencymaskingandtemporal
masking.Eachmodeldescribeshowloudsoundmasksothersoundsthathappentobeclosetoitin
frequencyorintime.Themodelpartitionsthefrequencyrangeinto24criticalbandsandspecifies
how masking effects apply within each band. The masking effects
depend, of course, on the
frequencies and amplitudes of the tones.When the sound is
decompressed and played, the user
(listener)may select any playback amplitude,which iswhy the
psychoacousticmodel has to be
designed for theworst case. Themasking effects also dependon
thenature of the source of the
sound being compressed. The source may be tonelike or noiselike.
The two psychoacoustic
modelsemployedbyMPEGarebasedonexperimentalworkdonebyresearchersovermanyyears.
Thedecodermustbefast,sinceitmayhavetodecodetheentiremovie(videoandaudio)at
realtime,soitmustbesimple.Asaresultitdoesnotuseanypsychoacousticmodelorbitallocation
algorithm. The compressed stream must therefore contain all the
information that the decoder
needs for dequantizing the signals. This information (the size
of the quantized signals)must be
written by the encoder on the compressed stream, and it
constitutes overhead that should be
subtractedfromthenumberofremainingavailablebits.
Figure 14 is a block diagram of the main components of the MPEG
audio encoder and
decoder.Theancillarydataisuserdefinableandwouldnormallyconsistofinformationrelatedto
specificapplications.Thisdataisoptional.
6.7FrequencyDomainCoding
The first step in encoding the audio samples is to transform
them from the time domain to the
frequencydomain.Thisisdonebyabankofpolyphasefiltersthattransformthesamplesinto32
equalwidth frequency subbands. The filters were designed to
provide fast operation combined
withgoodtimeandfrequencyresolutions.Asaresult,theirdesigninvolvedthreecompromises.
1. The first compromise is the equal widths of the 32 frequency
bands. This simplifies the
filtersbutisincontrasttothebehaviorofthehumanauditorysystem,whosesensitivityis
frequencydependent. When several critical bands are covered by a
subband X, the bit
allocation algorithm selects the critical band with the least
noisemasking and uses that
criticalbandtocomputethenumberofbitsallocatedtothequantizedsignalsinsubbandX.
2.
Thesecondcompromiseinvolvestheinversefilterbank,theoneusedbythedecoder.The
original timetofrequency transformation involves loss of
information (even before any
-
quantization).Theinversefilterbankthereforereceivesdatathatisslightlybad,andusesit
to perform the inverse frequencytotime transformation, resulting
in more distortions.
Therefore,thedesignofthetwofilterbanks(fordirectandinversetransformations)hadto
usecompromisestominimizethislossofinformation.
3. The thirdcompromisehas todowith the individual
filters.Adjacent filters should ideally
pass different frequency ranges. In practice, they have
considerable frequency overlap.
Soundofasingle,pure,frequencycanthereforepenetratethroughtwofiltersandproduce
signals(thatarelaterquantized)intwoofthe32subbandsinsteadofinjustonesubband.
Thepolyphase filter bankuses (in addition to other intermediate
data structures) a bufferX
withroomfor512inputsamples.ThebufferisaFIFOqueueandalwayscontainsthemostrecent
512samplesinput.Figure15showsthefivemainstepsofthepolyphasefilteringalgorithm.
Figure15:PolyphaseFilterBank.6.8MPEGLayerICoding
The Layer I coding scheme provides a 4:1 compression. In Layer I
coding the time frequency
mappingisaccomplishedusingabankof32subbandfilters.Theoutputofthesubbandfiltersis
criticallysampled.Thatis,theoutputofeachfilterisdownsampledby32.Thesamplesaredivided
intogroupsof12sampleseach.Twelvesamplesfromeachofthe32subbandfilters,oratotalof
384 samples, make up one frame of the Layer I coder. Once the
frequency components are
obtained the algorithm examines each group of 12 samples to
determine a scalefactor. The
scalefactorisusedtomakesurethatthecoefficientsmakeuseoftheentirerangeofthequantizer.
Thesubbandoutputisdividedbythescalefactorbeforebeinglinearlyquantized.Thereareatotal
of63scalefactorsspecifiedintheMPEGstandard.Specificationofeachscalefactorrequires6bits.
-
Figure16:FramestructureforLayer1.
Todetermine thenumberofbits tobeused forquantization, the
codermakesuseof the
psychoacousticmodel.TheinputstothemodelincludeFastFourierTransform(FFT)oftheaudio
data as well as the signal itself. The model calculates the
masking thresholds in each subband,
which in turn determine the amount of quantization noise that
can be tolerated and hence the
quantization step size. As the quantizers all cover the same
range, selection of the quantization
stepsizeisthesameasselectionofthenumberofbitstobeusedforquantizingtheoutputofeach
subband. InLayer I theencoderhasachoiceof14differentquantizers
foreachband(plus the
optionof assigning0bits).Thequantizersare allmidtreadquantizers
ranging from3 levels to
65,535levels.Eachsubbandgetsassignedavariablenumberofbits.However,thetotalnumberof
bitsavailabletorepresentallthesubbandsamplesisfixed.Therefore,thebitallocationcanbean
iterativeprocess.Theobjectiveistokeepthenoisetomaskratiomoreorlessconstantacrossthe
subbands.
Theoutputofthequantizationandbitallocationstepsarecombinedintoaframeasshown
inFigure16.BecauseMPEGaudioisastreamingformat,eachframecarriesaheader,ratherthan
havingasingleheaderfortheentireaudiosequence.
1. Theheaderismadeupof32bits.
2. Thefirst12bitscompriseasyncpatternconsistingofall1s.
3. Thisisfollowedbya1bitversionID,
4. A2bitlayerindicator,
5. A 1bit CRC protection. The CRC protection bit is set to 0 if
there is no CRC
protectionandissettoa1ifthereisCRCprotection.
6. If the layer and protection information is known, all 16 bits
can be used for
providingframesynchronization.
-
7.
Thenext4bitsmakeupthebitrateindex,whichspecifiesthebitrateinkbits/sec.
Thereare14specifiedbitratestochosefrom.
8. This is followed by 2 bits that indicate the sampling
frequency. The sampling
frequencies for MPEG1 and MPEG2 are different (one of the few
differences
between the audio coding standards forMPEG1 andMPEG2) and are
shown in
Table2.
9.
Thesebitsarefollowedbyasinglepaddingbit.Ifthebitis1,theframeneedsan
additional bit to adjust the bit rate to the sampling frequency.
The next two bits
indicate themode.Thepossiblemodesare stereo, joint stereo, dual
channel,
and single channel. The stereomode consists of two channels that
are encoded
separatelybutintendedtobeplayedtogether.Thejointstereomodeconsistsoftwo
channelsthatareencodedtogether.
Table2:AllowablesamplingfrequenciesinMPEG1andMPEG2.
Theleftandrightchannelsarecombinedtoformamidandasidesignalasfollows:
Thedualchannelmodeconsistsoftwochannelsthatareencodedseparatelyandarenot
intended to be played together, such as a translation channel.
These are followed by twomode
extension bits that are used in the joint stereomode. The next
bit is a copyright bit (1 if the
materialiscopyrighted,0ifitisnot).Thenextbitissetto1fororiginalmediaand0forcopy.
The final twobits indicate the typeofdeemphasis tobeused. If
theCRCbit is set, theheader is
followedbya16bitCRC.Thisisfollowedbythebitallocationsusedbyeachsubbandandisinturn
followed by the set of 6bit scalefactors. The scalefactor data
is followed by the quantized 384
samples.
-
16.9LayerIICoding
The Layer II coder provides a higher compression rate by making
some relatively minor
modifications to the Layer I coding scheme. Thesemodifications
includehow the samples are
grouped together, the representation of the scalefactors, and
the quantization strategy.
WheretheLayerIcoderputs12samplesfromeachsubbandintoaframe,theLayerIIcodergroups
threesetsof12samplesfromeachsubbandintoaframe.Thetotalnumberofsamplesperframe
increasesfrom384samplesto1152samples.Thisreducestheamountofoverheadpersample.In
LayerIcodingaseparatescalefactorisselectedforeachblockof12samples.InLayerIIcodingthe
encodertriestoshareascalefactoramongtwoorallthreegroupsofsamplesfromeachsubband
filter.Theonlytimeseparatescalefactorsareusedforeachgroupof12samplesiswhennotdoing
so would result in a significant increase in distortion. The
particular choice used in a frame is
signaledthroughthescalefactorselectioninformationfieldinthebitstream.
The major difference between the Layer I and Layer II coding
schemes is in the
quantizationstep.IntheLayerIcodingschemetheoutputofeachsubbandisquantizedusingone
of 14 possibilities; the same 14 possibilities for each of the
subbands. In Layer II coding the
quantizers used for each of the subbands can be selected from a
different set of quantizers
depending on the sampling rate and the bit rates. For some
sampling rate and bit rate
combinations,manyofthehighersubbandsareassigned0bits.Thatis,theinformationfromthose
subbandsissimplydiscarded.Wherethequantizerselectedhas3,5,or9levels,theLayerIIcoding
schemeusesonemoreenhancement.Noticethat inthecaseof3
levelswehavetouse2bitsper
sample,whichwouldhaveallowedustorepresent4levels.Thesituationisevenworseinthecase
of5levels,whereweareforcedtouse3bits,wastingthreecodewords,andinthecaseof9levels
wherewehavetouse4bits,thuswasting7levels.
Toavoidthissituation,theLayerIIcodergroups3samplesintoagranule.Ifeachsample
cantakeon3levels,agranulecantakeon27levels.Thiscanbeaccommodatedusing5bits.Ifeach
samplehadbeenencoded separatelywewouldhaveneeded6bits.
Similarly, if each sample can
takeon9values,agranulecantakeon729values.Wecanrepresent729valuesusing10bits.If
eachsampleinthegranulehadbeenencodedseparately,wewouldhaveneeded12bits.Usingall
thesesavings,thecompressionratioinLayerIIcodingcanbeincreasefrom4:1to8:1or6:1.
TheframestructurefortheLayerIIcodercanbeseeninFigure17.Theonlyrealdifference
between this frame structure and the frame structure of the
Layer I coder is the scalefactor
selectioninformationfield.
-
Figure17:FramestructureforLayer2.
16.10LayerIIICodingmp3
Layer III coding, which has becomewidely popular under the
namemp3, is considerablymore
complexthantheLayerIandLayerIIcodingschemes.OneoftheproblemswiththeLayerIand
codingschemeswasthatwiththe32banddecomposition,thebandwidthofthesubbandsatlower
frequenciesissignificantlylargerthanthecriticalbands.Thismakesitdifficulttomakeanaccurate
judgmentof themasktosignalratio.
Ifwegetahighamplitudetonewithinasubbandandif the
subbandwasnarrowenough,wecouldassumethatitmaskedothertonesintheband.However,if
thebandwidthofthesubbandissignificantlyhigherthanthecriticalbandwidthatthatfrequency,it
becomesmoredifficulttodeterminewhetherothertonesinthesubbandwillbemasked.
Tosatisfythebackwardcompatibilityrequirement,thespectraldecompositionintheLayer
IIIalgorithmisperformedintwostages.Firstthe32bandsubbanddecompositionusedinLayerI
and Layer II is employed. The output of each subband is
transformed using amodified discrete
cosinetransform(MDCT)witha50%overlap.TheLayerIIIalgorithmspecifiestwosizesforthe
MDCT,6or18.Thismeansthattheoutputofeachsubbandcanbedecomposedinto18frequency
coefficientsor6frequencycoefficients.
Thereasonforhavingtwosizes fortheMDCTis
thatwhenwetransformasequence into
thefrequencydomain,welosetimeresolutionevenaswegainfrequencyresolution.The
larger
theblocksize themorewe lose in termsof time
resolution.Theproblemwith this is that any
quantizationnoiseintroducedintothefrequencycoefficientswillgetspreadovertheentireblock
size of the transform. Backward temporalmasking occurs for only
a short duration prior to the
maskingsound(approximately20msec).Therefore,quantizationnoisewillappearasapreecho.
-
Forthe
longwindowsweendupwith18frequenciespersubband,resultinginatotalof
576 frequencies. For the shortwindows we get 6 coefficients per
subband for a total of 192
frequencies.Thestandardallowsforamixedblockmodeinwhichthetwolowestsubbandsuse
longwindowswhiletheremainingsubbandsuseshortwindows.Noticethatwhilethenumberof
frequenciesmaychangedependingonwhetherweareusinglongorshortwindows,thenumberof
samples in a frame stays at 1152. That is 36 samples, or 3
groups of 12, from each of the 32
subbandfilters.
ThecodingandquantizationoftheoutputoftheMDCTisconductedinaniterativefashion
usingtwonestedloops.Thereisanouterloopcalledthedistortioncontrolloopwhosepurposeis
to ensure that the introduced quantization noise lies below the
audibility threshold. The
scalefactorsareusedtocontrolthelevelofquantizationnoise.InLayerIIIscalefactorsareassigned
togroupsorbandsofcoefficientsinwhichthebandsareapproximatelythesizeofcriticalbands.
Thereare21scalefactorbandsforlongblocksand12scalefactorbandsforshortblocks.
Figure19:FramesinLayerIII
Theinnerloopiscalledtheratecontrol
loop.Thegoalofthisloopistomakesurethata
targetbitrateisnotexceeded.ThisisdonebyiteratingbetweendifferentquantizersandHuffman
codes. The quantizers used in mp3 are companded nonuniform
quantizers. The scaled MDCT
coefficients are first quantized and organized into regions.
Coefficients at the higher end of the
frequencyscalearelikelytobequantizedtozero.Theseconsecutivezerooutputsaretreatedasa
single region and the runlength is Huffman encoded. Below this
region of zero coefficients, the
encoder identifies the set of coefficients that are quantized to
0 or 1. These coefficients are
grouped into groups of four. This set of quadruplets is the
second region of coefficients. Each
quadrupletisencodedusingasingleHuffmancodeword.
-
The remaining coefficients are divided into two or three
subregions. Each subregion is
assignedaHuffmancodebasedon itsstatisticalcharacteristics. If
theresultofusingthisvariable
lengthcodingexceedsthebitbudget,thequantizerisadjustedtoincreasethequantizationstepsize.
Theprocessisrepeateduntilthetargetrateissatisfied.Thepsychoacousticmodelisused
tocheckwhetherthequantizationnoiseinanybandexceedsthealloweddistortion.Ifitdoes,the
scalefactor isadjusted toreduce thequantizationnoise.Once all
scalefactorshavebeenadjusted,
control returns to the rate control loop. The iterations
terminate eitherwhen the distortion and
rateconditionsaresatisfiedorthescalefactorscannotbeadjustedanyfurther.
TherewillbeframesinwhichthenumberofbitsusedbytheHuffmancoderislessthanthe
amountallocated.Thesebitsaresavedinaconceptualbitreservoir.Inpracticewhatthismeansis
thatthestartofablockofdatadoesnotnecessarilycoincidewiththeheaderoftheframe.Consider
the three frames shown in Figure 19. In this example, themain
data for the first frame (which
includes scalefactor information and theHuffman codeddata) does
not occupy the entire frame.
Therefore,themaindataforthesecondframestartsbeforethesecondframeactuallybegins.The
sameistruefortheremainingdata.Themaindatacanbeginintheprevious
frame.However,the
main data for a particular frame cannot spill over into the
following frame. All this complexity
allowsforaveryefficientencodingofaudio
inputs.Thetypicalmp3audiofilehasacompression
ratioofabout10:1.Inspiteofthishighlevelofcompression,mostpeoplecannottellthedifference
betweentheoriginalandthecompressedrepresentation.