Top Banner
ISO/IEC MPEG-4 High-Definition Scalable Advanced Audio Coding* RALF GEIGER, 1 RONGSHAN YU, 2 ** JU ¨ RGEN HERRE, 1 SUSANTO RAHARDJA, 2 (ralf.geiger@iis.fraunhofer.de) (rzyu@dolby.com) (juergen.herre@iis.fraunhofer.de) (rsusanto@i2r.a-star.edu.sg) SANG-WOOK KIM, 3 XIAO LIN, 2 and MARKUS SCHMIDT 1 (sangwookkim@samsung.com) (linxiao@fortemedia.com.cn) (markus.schmidt@iis.fraunhofer.de) 1 Fraunhofer IIS, Erlangen, Germany 2 Institute for Infocomm Research, Singapore 3 Samsung Electronics, Suwon, Korea Recently the MPEG Audio standardization group has successfully concluded the standard- ization process on technology for lossless coding of audio signals. A summary of the scalable lossless coding (SLS) technology as one of the results of this standardization work is given. MPEG-4 scalable lossless coding provides a fine-grain scalable lossless extension of the well-known MPEG-4 AAC perceptual audio coder up to fully lossless reconstruction at word lengths and sampling rates typically used for high-resolution audio. The underlying innova- tive technology is described in detail and its performance is characterized for lossless and near lossless representation, both in conjunction with an AAC coder and as a stand-alone com- pression engine. A number of application scenarios for the new technology are discussed. 0 INTRODUCTION Perceptual coding of high-quality audio signals has ex- perienced a tremendous evolution over the past two de- cades, both in terms of research progress and in worldwide deployment in products. Examples for successful applica- tions include portable music players, Internet audio, audio for digital media (such as VCD, DVD), and digital broad- casting. Several phases of international standardization have been conducted successfully [1]–[3]. Many recent research and standardization efforts focus at achieving good sound quality at even lower bit rates [4]–[9] to ac- commodate storage and transmission channels with lim- ited quality (such as terrestrial and satellite broadcasting and third-generation mobile telecommunication). Never- theless, for other scenarios with higher transmission band- width available, there is a general trend toward providing the consumer with a media experience of extremely high fidelity [10], as it is frequently associated with the terms “high definition” and “high resolution” and dedicated me- dia types, such as DVD-Audio [11], SACD [12], HD- DVD [11], or Blu-Ray Disc [13]. In the realm of audio, this is achieved by employing lossless formats with high resolution (word length) and/or high sampling rate. There exist several proprietary lossless formats, the most prominent of which are MLP Lossless [14], [15], DTS-HD [16], and Apple Lossless [17]. Furthermore, sev- eral freeware coding systems have gained some promi- nence through the Internet. Among them are FLAC [18], Monkey’s Audio [19], and OptimFROG [20]. On the other end of the bit-rate scale, approaches for scalable perceptual audio coding have been developed in the recent years. Within the International Telecommuni- cation Union (ITU) scalability in wide-band audio coding has been adopted only recently [21]. Within MPEG-4 Au- dio [3], scalability has been developed and adopted some- time earlier [22]–[24]. This includes the realm of speech coding where scalability is provided both for harmonic vector excitation coding (HVXC) [25], [26] and code- excited linear prediction (CELP) [27], [28]. In this context the ISO/MPEG Audio standardization group decided to start a new work item to explore tech- nology for lossless and near-lossless coding of audio sig- nals by issuing a call for proposals for relevant technology in 2002 [29]. Three specifications emerged from this call *Manuscript received 2006 March 23; revised 2006 October 24 and November 27. **Currently with Dolby Laboratories, San Francisco, CA. PAPERS J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 27
17

ISO/IEC MPEG-4 High-Definition Scalable Advanced Audio Coding*

Sep 12, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
RALF GEIGER,1 RONGSHAN YU,2** JURGEN HERRE,1 SUSANTO RAHARDJA,2 (ralf.geiger@iis.fraunhofer.de) (rzyu@dolby.com) (juergen.herre@iis.fraunhofer.de) (rsusanto@i2r.a-star.edu.sg)
SANG-WOOK KIM,3 XIAO LIN,2 and MARKUS SCHMIDT1
(sangwookkim@samsung.com) (linxiao@fortemedia.com.cn) (markus.schmidt@iis.fraunhofer.de)
3Samsung Electronics, Suwon, Korea
Recently the MPEG Audio standardization group has successfully concluded the standard- ization process on technology for lossless coding of audio signals. A summary of the scalable lossless coding (SLS) technology as one of the results of this standardization work is given. MPEG-4 scalable lossless coding provides a fine-grain scalable lossless extension of the well-known MPEG-4 AAC perceptual audio coder up to fully lossless reconstruction at word lengths and sampling rates typically used for high-resolution audio. The underlying innova- tive technology is described in detail and its performance is characterized for lossless and near lossless representation, both in conjunction with an AAC coder and as a stand-alone com- pression engine. A number of application scenarios for the new technology are discussed.
0 INTRODUCTION
Perceptual coding of high-quality audio signals has ex- perienced a tremendous evolution over the past two de- cades, both in terms of research progress and in worldwide deployment in products. Examples for successful applica- tions include portable music players, Internet audio, audio for digital media (such as VCD, DVD), and digital broad- casting. Several phases of international standardization have been conducted successfully [1]–[3]. Many recent research and standardization efforts focus at achieving good sound quality at even lower bit rates [4]–[9] to ac- commodate storage and transmission channels with lim- ited quality (such as terrestrial and satellite broadcasting and third-generation mobile telecommunication). Never- theless, for other scenarios with higher transmission band- width available, there is a general trend toward providing the consumer with a media experience of extremely high fidelity [10], as it is frequently associated with the terms “high definition” and “high resolution” and dedicated me-
dia types, such as DVD-Audio [11], SACD [12], HD- DVD [11], or Blu-Ray Disc [13]. In the realm of audio, this is achieved by employing lossless formats with high resolution (word length) and/or high sampling rate.
There exist several proprietary lossless formats, the most prominent of which are MLP Lossless [14], [15], DTS-HD [16], and Apple Lossless [17]. Furthermore, sev- eral freeware coding systems have gained some promi- nence through the Internet. Among them are FLAC [18], Monkey’s Audio [19], and OptimFROG [20].
On the other end of the bit-rate scale, approaches for scalable perceptual audio coding have been developed in the recent years. Within the International Telecommuni- cation Union (ITU) scalability in wide-band audio coding has been adopted only recently [21]. Within MPEG-4 Au- dio [3], scalability has been developed and adopted some- time earlier [22]–[24]. This includes the realm of speech coding where scalability is provided both for harmonic vector excitation coding (HVXC) [25], [26] and code- excited linear prediction (CELP) [27], [28].
In this context the ISO/MPEG Audio standardization group decided to start a new work item to explore tech- nology for lossless and near-lossless coding of audio sig- nals by issuing a call for proposals for relevant technology in 2002 [29]. Three specifications emerged from this call
*Manuscript received 2006 March 23; revised 2006 October 24 and November 27.
**Currently with Dolby Laboratories, San Francisco, CA.
PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 27
as amendments to the MPEG-4 Audio standard. First the standard on lossless coding of 1-bit oversampled signals [30] specifies the lossless compression of highly over- sampled 1-bit sigma–delta-modulated audio signals as they are stored on the SACD media under the name Direct Stream Digital (DSD) [31]. Second the audio lossless cod- ing (ALS) specification [32] describes technology for the lossless coding of PCM-coded signals at sampling rates up to 192 kHz as well as floating-point audio.
This paper provides an overview of the third descen- dant, the scalable lossless coding (SLS) specification [33], which extends traditional methods for perceptual audio coding toward lossless coding of high-resolution audio signals in a scalable way. Specifically, it allows to scale up from a perceptually coded representation (MPEG-4 ad- vanced audio coding, AAC) to a fully lossless representa- tion with high definition, including a wide range of inter- mediate near-lossless representations. The paper starts with an explanation of the general concept, discusses the underlying novel technology components, and character- izes the codec in terms of compression performance and complexity. Finally a number of application scenarios are briefly outlined.
1 CONCEPT AND TECHNOLOGY OVERVIEW
This section provides an overview of the principle and the structure of the scalable lossless coding technology and its combination with the AAC perceptual audio coder. This combination will be referred to as high-definition advanced audio coding (HD-AAC) in the following.
1.1 General System Structure Fig. 1 shows a very high-level view of the structure of
the HD-AAC encoder. The input signal is first coded by an AAC (“core layer”) encoder. The SLS algorithm uses the output to enhance the system’s performance toward loss- less coding, resulting in an enhancement layer. The two layers of information are subsequently multiplexed into one high-definition bit stream. Decoding of an HD-AAC bit stream is illustrated in Fig. 2. From the HD-AAC bit stream, decoders are able either to decode the perceptually coded AAC part only, or to use the additional SLS infor-
mation to produce losslessly/near-losslessly coded audio for high-definition applications.
1.2 AAC Background Since the MPEG-4 SLS coder has been designed to
operate as an enhancement to MPEG-4 AAC [3], [34], its structure is closely related to that of the underlying AAC core coder [35]. This section sketches the architectural features of MPEG-4 AAC as they are relevant to the SLS enhancement technology.
The underlying AAC codec provides efficient percep- tual audio coding with high quality and achieves broadcast quality at a bit rate of about 64 kbps per channel [36]. Fig. 3 gives a very concise view of the AAC encoder’s struc- ture. The audio signal is processed in a blockwise spectral representation using the modified discrete cosine trans- form (MDCT) [37]. The resulting 1024 spectral values are quantized and coded considering the required accuracy as demanded by the perceptual model. This is done to mini- mize the perceptibility of the introduced quantization dis- tortion by exploiting masking effects. Several neighboring spectral values are grouped into so-called scale factor bands, sharing the same scale factor for quantization. Prior to the quantization/coding (Q/C) tool, a number of pro- cessing tools operate on the spectral coefficients in order to improve coding performance for certain situations. The most important tools are the following.
• The temporal noise shaping (TNS) tool [38] carries out predictive filtering across frequency in order to achieve a temporal shaping of the quantization noise according to the signal envelope and in this way optimize temporal masking of the quantization distortion.
• The M/S stereo coding tool [39] provides sum/ difference coding of channel pairs, exploits interchannel redundancy for near-monophonic signals, and avoids binaural unmasking.
1.3 HD-AAC/SLS Enhancement The SLS scalable lossless enhancement works on top of
this AAC architecture. The structures of an HD-AAC en- coder and decoder are shown in Figs. 4 and 5.
In the encoder the audio signal is converted into a spectral representation using the integer modified discrete cosine transform (IntMDCT) [40], [41]. This transform represents an invertible integer approximation of the MDCT and is well-suited for lossless coding in the frequency domain. Other AAC coding tools, such as mid/side coding or tempo- ral noise shaping, are also considered and performed on the IntMDCT spectral coefficients in an invertible integer fash- ion, thus maintaining the similarity between the spectral val- ues used in the AAC coder and in the lossless enhancement.Fig. 1. HD-AAC encoder structure.
Fig. 2. HD-AAC decoder structure. Fig. 3. Structure of AAC encoder (simplified).
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February28
The link between the perceptual core and the scalable lossless enhancement layer of the coder [42] is provided by an error-mapping process. The error-mapping process removes the information that has already been coded in the perceptual (AAC) path from the IntMDCT spectral coef- ficients such that only the resulting (IntMDCT) residuals are coded in the enhancement encoder, and in this way the coding efficiency benefits from the underlying AAC layer. The error-mapping process also preserves the probability distribution skew of the original IntMDCT coefficients and thus permits an efficient encoding of the residual by means of two bit-plane coding processes, namely, bit-plane Golomb coding (BPGC) [43] and context-based arithmetic coding (CBAC) [44], and a so-called low-energy-mode encoder, all of which will be described later in further detail.
By using the bit-plane coding process the lossless en- hancement is performed in a fine-grain scalable way. A lossless reconstruction is obtained if all the bit planes of the residual are coded, transmitted, and decoded com- pletely. If only parts of the bit planes are decoded, a lossy reconstruction of the signal is obtained with a quality be- tween the AAC layer’s and lossless reconstruction. In or- der to achieve optimal perceptual quality at intermediate bit rates, the bit-plane coding is started from the most significant bit (MSB) for all scale-factor bands, and pro- gresses toward the least significant bit (LSB) for all bands (see Fig. 6). In this way the bit-plane coding process pre- serves the overall spectral shape of the quantization noise,
as it results from the noise-shaping process of the AAC perceptual coder, and thus takes advantage of the AAC perceptual model.
The following sections will describe the principles un- derlying the design of SLS in greater detail by focusing on its IntMDCT filter bank, error-mapping strategy, and en- tropy coding parts.
2 FILTER BANK
2.1 IntMDCT The IntMDCT, as introduced in [40], is an invertible
integer approximation of the MDCT, which is obtained by utilizing the “lifting scheme” [45] or “ladder network” [46]. It enables efficient lossless coding of audio signals [40] by means of entropy coding of integer spectral coef- ficients. Furthermore, as the IntMDCT closely approxi- mates the behavior of the MDCT, it allows to combine the strategies of perceptual and lossless audio coding in the frequency domain into a common framework [42].
Fig. 7 illustrates the close relationship between IntMDCT and MDCT for a small audio segment by displaying the respective magnitude spectra of both filter banks. The dif- ference between MDCT and IntMDCT values is visible as a small noise floor that is typically much lower than the error introduced by perceptual coding. Thus the IntMDCT allows to code efficiently the quantization error of an MDCT-based perceptual codec in the frequency domain.
Fig. 4. Structure of HD-AAC encoder.
Fig. 5. Structure of HD-AAC decoder.
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 29
The following section provides some detail on the deriva- tion of the IntMDCT from the MDCT.
2.2 Decomposition of MDCT The MDCT, defined by
X!m" =!2 N #
4N ,
m = 0, . . . , N ! 1 (1)
with the time-domain input x(k) and the windowing func- tion w(k) can be decomposed into two blocks, namely, windowing and time-domain aliasing (TDA) and discrete cosine transform of type IV (DCT-IV). This is illustrated in Fig. 8 for both forward MDCT and inverse MDCT.
In the forward IntMDCT, the windowing/TDA block is calculated by 3N/2 so-called lifting steps:
"x!k"
w!k"
w!k"
! 1. (2)
After each lifting step a rounding operation is applied to stay in the integer domain. Every lifting step can be in- verted by simply adding the subtracted value.
2.3 Integer DCT-IV For the IntMDCT, the DCT-IV is calculated in an in-
vertible integer fashion, called the integer DCT-IV. The multidimensional lifting (MDL) scheme [41], [47] is ap- plied in order to reduce the required rounding operations in the invertible integer approximation as much as possible and in this way minimize the approximation error noise floor (see Fig. 7). The following block matrix decompo- sition for an invertible matrix T and the identity matrix I shows the basic principle of the MDL scheme,
"T 0
I T!1#. (3)
The three blocks in this decomposition are the so-called MDL steps. Similar to the conventional lifting steps, they can be transferred to invertible integer mappings by round- ing the floating-point values after being processed by T or T!1, and they can be inverted by subtracting the values that have been added.
Fig. 6. Bit-plane scan process in SLS.
Fig. 7. IntMDCT and MDCT magnitude spectra.
Fig. 8. MDCT and inverse MDCT by windowing/TDA and DCT-IV.
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February30
When coding stereo signals, this decomposition is used to obtain an integrated calculation of the M/S matrix and the integer DCT-IV for the left and right channels. The number of required rounding operations is 3N per channel pair, or 3N/2 per channel, which is the same number as for the windowing/TDA stage. As a whole, this so-called ste- reo IntMDCT requires only three rounding operations per sample, including M/S processing. Concerning mono sig- nals, the same structure can be used, but has to be ex- tended by some additional lifting steps to obtain the inte- ger DCT-IV of one block; see [47]. This mono IntMDCT requires four rounding operations per sample.
2.4 Noise Shaping The lossless coding efficiency of the IntMDCT is fur-
ther improved by utilizing a noise-shaping technique, in- troduced in [48]. In the lifting steps where time-domain signals are processed, the rounding operations include an error feedback mechanism to provide a spectral shaping of the approximation noise. This approximation noise affects the lossless coding efficiency mainly in the high- frequency region where audio signals usually carry only a small amount of energy, especially at sampling rates of 96 kHz and above. Hence the low-pass characteristics of the approximation noise improve the lossless coding effi- ciency. A first-order noise-shaping filter is employed in the three stages of lifting steps in the windowing/TDA processing and in the first rounding stage of the integer DCT-IV processing. Fig. 9 compares the resulting ap- proximation error between the IntMDCT values and the MDCT values rounded to integer, when the IntMDCT operates both with and without noise shaping.
3 ERROR MAPPING AND RESIDUAL CALCULATION
The objective of the error-mapping/residual calculation stage is to produce an integer enhancement signal that enables lossless reconstruction of the audio signal while
consuming as few bits as possible. Rather than encoding all IntMDCT coefficients c[k] directly, in the lossless en- hancement layer, it is more efficient to make use of the information that has already been coded by the AAC layer. This is achieved by the error-mapping process, which produces a deterministic residual signal between the IntMDCT spectral values and their counterparts from the AAC layer.
In order to produce a residual signal e[k] with the small- est possible entropy, given the core AAC quantized value i[k], it is sufficient in most cases to use the minimum mean square error (MMSE) residual obtained by subtracting the IntMDCT coefficient c[k] from its MMSE reconstruction c[k] ! E{c[k]|i[k]},
given i[k], that is,
e$k% = c$k% ! c$k%. (4)
Here E{·} denotes the expectation operation. However, in SLS a somewhat different approach is adopted, where the residual signal is given by
e$k% = &c$k% ! thr$k%, i $k% " 0
c$k%, i $k% = 0 (5)
where thr (i[k]) is the next quantized value closer to zero with respect to i[k], and is calculated via table lookup and linear interpolation to ensure the deterministic behavior necessary for lossless coding. This error-mapping process is illustrated in Fig. 10, where two different cases are shown. In the first case the IntMDCT coefficients c[k] belong to a scale-factor band that has been quantized and coded at the AAC encoder (significant band). For these coefficients the residual coefficients are obtained by sub- tracting the quantization thresholds from their correspond- ing c[k], resulting in a residual spectrum with reduced amplitude. In the other case c[k] belongs to a band that is not coded or has been quantized to zero in the AAC en- coder (insignificant band). In this case the residual spec- trum is simply the IntMDCT spectrum itself.
This error-mapping process offers many advantages that permit better coding efficiency in the enhancement layer. First, as illustrated in Fig. 11, we notice that if the ampli- tude of c[k] is distributed exponentially, the amplitude of
Fig. 9. Mean-squared approximation error of stereo IntMDCT (including M/S) with and without noise shaping. Fig. 10. Illustration of error-mapping process.
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 31
e[k] will likewise be distributed approximately exponen- tially. Thus it can be coded very efficiently by using the BPGC/CBAC coding process described in the next sec- tion. Secondly, for a significant band we find the follow- ing property for the coefficients c[k]:
| thr!i$k%" | # |c$k% | $ | thr!i$k%" | + %!i$k%" (6) where
%!i$k%" = thr!| i$k% | + 1" ! thr!| i$k% |" (7)
is the quantization step size of i[k]. Clearly, the magnitude of e[k] is bounded by the value of %(i[k]), and its sign is also identical to that of i[k] for i[k] " 0. As a result, no additional data transmission is needed for conveying the sign information. This is referred to as “implicit signaling” in the SLS specification.
This implicit signaling mechanism assumes that the in- put to the SLS layer and the AAC core quantization value are “well synchronized.” This may not be true in all cases, given that AAC encoders have all the freedom to optimize the encoding process and thus may produce an i[k] that does not satisfy Eq. (6). Thus SLS also includes an “ex- plicit signaling” mechanism, which employs an MMSE error mapping as defined in Eq. (4), where c[k] is approxi- mated by the AAC inverse quantization value. In this case all the side information necessary for decoding e[k] is signaled explicitly from the encoder to the decoder.
4 ENTROPY CODING
To achieve best efficiency in the compression of the en- hancement layer information, several methods for entropy coding of the IntMDCT residual information are employed.
4.1 Combined BPGC/CBAC Coding The bit-plane Golomb code (BPGC) coding process
used in SLS is basically a bit-plane coding scheme where
the bit-plane symbols are coded arithmetically with a structural frequency assignment rule. Considering an input data vector e ! {e[0], . . , e[N ! 1]} for which N is the dimension of e, each element e[k] in e is first represented in a binary format as
e$k% = !2s$k% ! 1" # j=0
M!1
(8)
s$k% != &1, e$k% ' 0
0, e$k% $ 0 , k = 0, . . . , N ! 1 (9)
and bit-plane symbols b[k, j] # {0, 1}, i ! 1, . . . , k, and M is the MSB for e that satisfies 2M!1 # max { |e[k] |} < 2M, k ! 0, . . . , N ! 1. The bit-plane symbols are then scanned and coded from MSB to LSB over all the ele- ments in e, and coded by using an arithmetic code with a structural frequency assignment QJ
L given by
1 + 22 j!L , j ' L (10)
where the “lazy-plane” parameter L can be selected using the adaptation rule
L = min&L" # " |2L"+1N ' A' (11)
and A is the absolute sum of the data vector e. Although this BPGC coding process delivers excellent
compression performance for data that stem from an in- dependent and identically distributed (iid) source with Laplacian distribution [43], it lacks the capability to ex- plore the statistical dependencies between data samples that may exist in certain sources to achieve better com- pression performance. These correlations can be captured very effectively by incorporating context modeling tech- nology into the BPGC coding process, where the fre- quency assignment for arithmetic coding of bit-plane sym- bols is not only dependent on the distance of the current bit plane from the lazy-plane parameter as in the frequency assignment rule [Eq. (10)], but also on other possible el- ements that may affect the probability distribution of these bit-plane symbols.
In the context of lossless coding of IntMDCT spectral data for audio, elements that possibly affect the distribu- tion of the bit-plane symbols include the frequency loca- tions of the IntMDCT spectral data, the magnitude of the adjacent spectral lines, and the status of the AAC core quantizer. In order to capture these correlations, several contexts are used in the context-based arithmetic code (CBAC) of SLS. The guide in selecting these contexts is to try to find those contexts that are “most” correlated to the distribution of the bit-plane symbols.
In CBAC, three types of contexts are used, namely, the frequency band (FB) context, the distance to lazy (D2L)
Fig. 11. Distribution of residual signal from error-mapping process.
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February32
context, and the significant state (SS) context. The detailed context assignments are summarized in the following.
Context 1: Frequency Band (FB) It is found in [44] that the probability distribution of bit-plane symbols of IntMDCT varies for different frequency bands. Therefore in CBAC the IntMDCT spectral data are classified into three different FB contexts, namely, low band (0–4 kHz), midband (4–11 kHz), and high band (above 11 kHz).
Context 2: Distance to Lazy (D2L) The D2L context is defined as the distance of the current bit plane j to the BPGC lazy-plane parameter L, as defined in the following equation:
D2L = &3 ! j + L , j ! L ' !2
6, otherwise. (12)
This is motivated by the BPGC frequency assignment rule, Eq. (10), which is based on the fact that the skew of the probability distribution of bit-plane symbols from a source with a (near) Laplacian distribution tends to decrease as the number of D2L decreases. To reduce the total number of the D2L context, all the bit planes with j ! L < !2 are grouped into one context where all the bit-plane symbols are coded with probability 0.5.
Context 3: Significant State (SS) The SS context at- tempts to capture the factors that correlate with the distri- bution of the magnitude of the IntMDCT residual in one place. These include the magnitude of the adjacent IntMDCT spectral lines and the quantization interval of the AAC core quantizer if it has been quantized previously in the core encoder. Further detail on the SS context can be found in [49].
4.2 Low-Energy-Mode Coding The BPGC/CBAC coding process described earlier
works well for sources with (near) Laplacian distribution, which usually is the case for most audio signals [50]. However, it was also found that for some music items there are time/frequency (T/F) regions with very low en- ergy levels where the IntMDCT spectral data are in fact dominated by the rounding errors of the IntMDCT (see Fig. 7) with a distribution that is substantially different from Laplacian. In order to encode those low-energy re- gions efficiently, the BPGC/CBAC coding process is re- placed by low-energy-mode coding, as shown in the following.
The low-energy-mode coding is invoked for scale factor bands for which the BPGC parameter L is smaller than or equal to 0. Then the amplitude of the residual spectral data e[k] is first converted into a unitary binary string b ! {b[0], b[1], . . . , b[pos], . . .}, as illustrated in Table 1, with M being the maximum bit plane. It can be seen that the probability distribution of these symbols is a function of the position pos, and the distribution of e[k];
Pr&b$pos% = 1' = Pr&e$k% ( pos |e$k% ' pos' (13)
where 0 # pos < 2M.b[pos] is then coded arithmetically depending on its position pos and the BPGC parameter L with a trained frequency table.
4.3 Smart Decoding Due to their fine-grain scalability, HD-AAC bit streams
can be truncated at any bit rate lower than what would be needed for a fully lossless reconstruction to produce near- lossless representations of the audio signal. In the context of arithmetic decoding, the smart decoding method pro- vides a way to optimally decode such a truncated bit stream. It decodes additional symbols in the absence of incoming bits when a decoding buffer still contains mean- ingful information for arithmetic decoding in the CBAC/ BPGC mode and/or low-energy mode. Decoding contin- ues up to the point where no ambiguity exists in deter- mining a symbol [51].
5 OTHER CODING TOOLS
As counterparts to the underlying AAC perceptual au- dio coder, SLS provides a number of integerized versions of AAC coding tools.
5.1 Integer M/S In the AAC codec the M/S tool allows to choose indi-
vidually between mid/side and left/right coding modes on a scale-factor-band basis. As shown in Section 2.3 it can provide either a global left/right or a global mid/side spec- tral representation. In order to make the integer spectral values fit the spectral values from the AAC core on a scale-factor-band basis, an invertible integer version of the M/S mapping is used. It is based on a lifting decomposi- tion of the normalized M/S matrix, that is, a rotation by !/4,
1
0 1 #"1 0
0 1 #. (14)
5.2 Integer TNS When the temporal noise-shaping (TNS) tool is used in
the AAC core, the resulting MDCT spectral values deviate from the IntMDCT spectral values. In order to compensate for this, the same TNS filter as in the AAC core is applied to the integer spectral values in the lossless enhancement. To assure lossless operation, the TNS filter is converted to a deterministic invertible integer filter.
Table 1. Binarization of IntMDCT error spectrum at low-energy mode.
Amplitude of e[k] Binary String {b[pos]}
0 0 1 1 0 2 1 1 0 . . . . . . 2M !2 1 1 . . . . . . . . . 1 0 2M !1 1 1 . . . . . . . . . 1 1
pos 0 1 2 3 . . .
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 33
6 BIT STREAM MULTIPLEXING
From a mechanical point of view, the SLS coded data, including the core layer AAC bit stream and the enhance- ment bit stream, can be carried in multiple elementary streams (ES) in an MPEG-4 system [52]. As shown in Fig. 12, the AAC bit stream is carried in the so-called base layer ES, and the enhancement bit stream is carried in one or more enhancement layer ESs. Each ES is thus com- posed of a sequence of access units (AUs), where one AU contains one audio frame from the AAC bit stream or the enhancement bit stream.
From an application point of view, such a bit stream structure provides great flexibility in constructing either a large-step scalable system or a fine-grain scalable system with SLS. For example, in a scalable audio streaming ap- plication, the server stores multiple SLS ESs at predefined bit rates, each assigned with a different stream priority. During the streaming, the ESs are transmitted in the order of their stream priority, and ESs with lower stream priority are dropped by either the streaming server or the network gateway whenever the transmission bandwidth is insuffi- cient to stream a full-rate lossless bit stream. Alternatively it is also possible to implement a lightweight bit stream truncation algorithm in the streaming server or in the mul- timedia gateway to truncate the enhancement stream AUs directly according to the available bandwidth to achieve the fine granular bit rate scalability.
7 MODES OF OPERATION
So far the HD-AAC coder has been described as a com- bination of a regular AAC core layer coder (such as low complexity AAC) and an SLS enhancement layer. This section introduces further features offered by the SLS en- hancement layer that concern its combination with the
AAC coder, and its ability to be used in combination with other types of AAC-based codecs, such as AAC scalable or AAC/BSAC.
7.1 Oversampling Mode In the context of high-definition audio applications it is
frequently desirable to achieve lossless signal reconstruc- tion at sampling rates of 96 kHz, or even 192 kHz. While the MPEG-4 AAC coder supports these sampling rates, it typically achieves best coding efficiency for high-quality perceptual coding at sampling rates between 32 and 48 kHz. Thus in order to allow both efficient core layer cod- ing at common rates and lossless reconstruction at higher rates, MPEG-4 SLS includes an additional feature called “oversampling mode.” This refers to the possibility of let- ting the lossless enhancement operate at a sampling rate higher than that of the AAC core codec. The ratio between the SLS sampling rate and the AAC sampling rate is called “oversampling factor” and can be either 1, 2, or 4. For example, the lossless enhancement can operate at a rate of 192 kHz, whereas the AAC core operates at 48 kHz, see Table 2.
The mapping between the two coding layers is achieved by using a time-aligned framing and a correspondingly longer IntMDCT in the lossless enhancement. For ex- ample, an IntMDCT of the size of 4096 spectral values is
Table 2. Example combinations of sampling rates for AAC core and lossless enhancement.
AAC @ 48 kHz
AAC @ 96 kHz
AAC @ 192 kHz
Fig. 12. Structure of MPEG-4 SLS bit stream.
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February34
used in case of oversampling by a factor of 4, and the 1024 MDCT values from the AAC core are mapped to the lower 1024 IntMDCT values. In addition to allowing for opti- mum AAC performance, the lossless performance is im- proved when using longer IntMDCT transforms. (A trans- form length of 2048 or 4096 provides a better lossless performance for stationary signals than a transform length of 1024.)
7.2 Combination with MPEG-4 Scalable AAC
In order to account for varying or unknown transmis- sion capacity, the MPEG-4 AAC codec also provides sev- eral scalable coding modes [3], [34]. MPEG-4 scalable AAC allows perceptual audio coding with one or more AAC mono or stereo layers and can be combined with an SLS enhancement layer. This results in an overall coder that offers fine-grain scalability in the range between loss- less reconstruction and the AAC representation, and scal- ability in several steps within the perceptually coded AAC representation.
7.2 Combination with MPEG-4 AAC/BSAC Similar to the combination with scalable AAC, the SLS
enhancement can also be operated on top of MPEG-4 AAC/BSAC as a core layer coder. This provides a fine- grain scalable representation both between lossless and perceptually coded audio and within the perceptually coded range. The latter has a granularity of 1 kbps per channel.
7.4 Stand-Alone Operation Finally the SLS lossless enhancement can also operate
as a straightforward stand-alone codec, without any un- derlying core codec. Nonetheless, this operation mode of- fers both full lossless coding capability and fine-grain scalability.
8 PERFORMANCE
This section quantifies the performance of the HD-AAC codec in various operation modes. While the compression ratio can be seen as the sole measure of merit for lossless operation scenarios, an evaluation of performance for near-lossless operation requires audio quality measure- ments at various data rates and operating points.
8.1 Lossless Compression Performance This section reports the performance of HD-AAC in
terms of its ability for lossless compression of various audio material. As a figure of merit, the compression ratio is defined as
compression ratio = original file size
compressed file size . (15)
Tables 3 and 4 show the lossless compression perfor- mance for two major sets of test material, that is, the MPEG-4 lossless audio coding test set (donated by Mat- sushita Corporation and containing in part recordings per- formed by the New York Symphonic Ensemble). For both sets the compression results are given for an HD-AAC configuration with an AAC core layer running at 128 kbit/s stereo plus SLS enhancement, and for an SLS stand- alone configuration.
As could be expected from theory, it can be observed in Table 3 that an increase in word length reduces the aver- age compression ratio (due to the fact that the least sig- nificant bits of the PCM codewords are more random and thus less compressible). On the other hand, increasing the sampling rate improves compression because of the in- creased correlation between adjacent samples (assuming sound material with typical high-frequency characteristics).
For the MPEG-4 lossless audio test set, an average com- pression ratio of 2:1 can be achieved easily at a sampling rate of 48 kHz and 16-bit word length. This is competitive with the best of other known lossless compression systems [53]. It can also be observed that for the AAC-based mode an additional bit rate of only 30–40 kbps is required com- pared to the stand-alone mode for lossless representation. This reduces the bit rate consumption by 90–100 kbps compared to simulcast solutions that transmit both an 128- kbps AAC bit stream and a stand-alone SLS lossless bit stream simultaneously.
8.2 Near-Lossless Compression Performance The bit-plane coding of residual spectral values (that is,
of the AAC quantization error) allows to refine the initial AAC quantization successively as more bits from the SLS enhancement layer are decoded. With each additional de- coded bit plane the quantization error is reduced by 6 dB. Consequently an increasing safety margin with respect to audibility is added as the bit rate increases.
Table 3. Lossless compression results for MPEG-4 lossless audio test set.
SLS + AAC @ 128 kbps/Stereo (AAC @ 48 kHz sampling rate) SLS Stand-Alone
Compression Ratio
Average Bit Rate (kbps)
48 kHz/16 bit 2.09 735 2.20 698 48 kHz/24 bit 1.55 1490 1.58 1454 96 kHz/24 bit 2.09 2201 2.13 2160 192 kHz/24 bit 2.60 3543 2.63 3509 Overall 2.08 1992 2.12 1955
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 35
8.2.1 Evaluation of Near-Lossless Audio Quality While it may seem sufficient for most purposes to pro-
vide perceptually transparent reproduction of audio signals by using conventional perceptual audio coders (such as AAC at a sufficient bit rate), there are applications that demand still higher audio quality. This is especially the case for professional audio production facilities, such as archiving and broadcasting, in which audio signals may undergo many cycles of encoding/decoding (tandem cod- ing) before being delivered to the consumer. This leads to an accumulation of introduced coding distortion and may lead to unacceptable final audio quality [54], unless sub- stantial headroom toward audibility is provided by each coding step, for example, by using coding algorithms with very high quality or bit rate.
ITU-R BS.1548-1 [55] defines the requirements for au- dio coding systems for digital broadcasting, assuming a codec chain consisting of so-called contribution, distribu- tion, and emission codecs. According to this recommen- dation, and based on ITU-R BS.1116-1 [56], audio codecs for contribution and distribution should fulfill the follow- ing requirements:
The quality of sound reproduced after a reference contribution/ distribution cascade [. . .] should be subjectively indistinguish- able from the source for most types of audio programme ma- terial. Using the triple stimuli double blind with hidden reference test, described in Recommendation ITU-R BS.1116 [. . .], this requires mean scores generally higher than 4.5 in the impair- ment 5-grade scale, for listeners at the reference listening po- sition. The worst rated item should not be graded lower than 4.
In accordance with these recommendations, tests were run on signals encoded or decoded with HD-AAC. Instead
of running numerous listening tests for subjective quality assessment at individual operating points, the evaluation employed the PEAQ measurement (BS.1387-1) [57], which provides methods for objective measurements of perceived audio quality in scenarios that are normally as- sessed by ITU-R BS.1116 testing. The most essential re- sults can be seen in Figs. 13–15; see also [58].
The graphs show the estimated subjective sound quality expressed as objective difference grade (ODG) values, which were computed by a PEAQ measurement. The evaluation procedure consists of multiple cycles of tandem coding/decoding with up to 16 cycles. The standard set of critical MPEG-4 audio items for perceptual audio coding evaluations was used. ODG values of 0, !1, !2, !3, !4 correspond to a subjective audio quality of “indistinguish- able from original,” “perceptible but not annoying,” “slightly annoying,” “annoying,” and “very annoying,” respectively.
Fig. 13 shows the achieved ODG values as a function of tandem cycles for a traditional AAC coder running at a bit rate of 128 kbps/stereo. As expected, it can be observed that the audio quality degrades significantly with an in- creasing number of tandem cycles, depending on the test item. For this reason tandem coding is not a recommended practice for such coders.
Fig. 14 displays the corresponding tandem coding re- sults for the HD-AAC combination running at 512 kbps/ stereo (AAC at 128 kbps + SLS enhancement at 384 kbps). It can be noted that the audio quality remains consistently at a very high level, even after a total of 16 tandem cycles. This illustrates the high robustness of the HD-AAC rep- resentation against tandem coding. According to these measurements, the aforementioned BS.1548-1 audio qual- ity requirement is fulfilled with a considerable safety mar- gin. Furthermore, when placed in tandem with AAC (such
Table 4. Lossless compression results for commercial CD test set.
CD Items (16 bit/44.1 kHz)
Compression Ratio
SLS Stand-Alone
ACDC—Highway to Hell (Sony 80206) 1.31 1.36 Avril Lavigne—Let Go (Arista 14740) 1.36 1.41 Backstreet Boys—Greatest Hits Chapter One (Jive 41779) 1.39 1.45 Brian Setzer—The Dirty Boogie (Interscope 90183) 1.43 1.49 Cowboy Junkies—Trinity Session (RCA-8568) 1.93 2.04 Grieg—Peer Gynt, von Karajan (DG 439010) 2.63 2.83 Jannifer Warnes—Famous Blue Raincoat (BMG 258418) 2.07 2.20 Marlboro Music Festival—DISC A (Bridge 9108) 2.23 2.35 Marlboro Music Festival—DISC B (Bridge 9108) 2.20 2.33 Nirvana—Nirvana (Interscope 493523) 1.50 1.56 Philip Jones—40 Famous Marches, CD1 (Decca 416241) 1.99 2.11 Philip Jones—40 Famous Marches, CD2 (Decca 416241) 2.00 2.11 Pink Floyd—Dark Side of the Moon (Capitol 46001) 1.76 1.85 Rebecca Pidgeon—The Raven (Chesky 115) 1.88 1.97 Ricky Martin (Sony 69891) 1.33 1.38 Schubert Piano Trio in E-flat (Sony 48088) 2.74 2.90 Spaniels—The Very Best Of (Collectables 7243) 2.41 2.62 Steeleye Span—Below the Salt (Shanachie 79039) 1.85 1.95 Suzanne Vega—Solitude Standing (A&M 5136) 1.74 1.83 Westminster Concert Bell Choir—Christmas Bells (Gothic Records 49055) 2.55 2.71
Overall 1.85 1.94
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February36
as for final audio distribution over a narrow-band chan- nel), the resulting audio quality is not degraded signifi- cantly by the preceding HD-AAC tandem cascade. Further details can be found in [59].
8.2.2 Stand-Alone SLS Operation The SLS codec can also operate as a stand-alone loss-
less codec when the AAC core codec is not used, some-
Fig. 13. Test results: AAC tandem coding.
Fig. 14. Test results: HD-AAC tandem coding.
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 37
times also referred to as the “noncore mode”. Despite of its simple structure (only IntMDCT and BPGC/CBAC mod- ules are used), this mode allows efficient lossless coding [59]. Furthermore, fine-grain scalability by truncated bit- plane coding is also possible in this mode. Given that the stand-alone SLS codec does not include any perceptual model to estimate masking thresholds, it is interesting to investigate the audio quality resulting from a truncation of the SLS bit stream.
Due to the behavior of bit-plane coding in this mode, a constant signal-to-noise ratio is achieved in each scale- factor band. With each additional bit plane the signal-to- noise ratio improves by 6 dB. While this behavior does not allow SLS to compete with efficient perceptual codecs at low bit rates (for example, AAC at 128 kbps/stereo), this simple approach works quite well at higher bit rates in the near-lossless range. Fig. 15 shows tandem coding results for the stand-alone SLS codec operating at 512 kbps/ stereo. It reaches about the same near-lossless audio qual- ity as the AAC-based HD-AAC mode discussed in the previous section.
At a constant bit rate of 768 kbps most test items still require the truncation of some coder frames. Nevertheless the corresponding PEAQ measurements indicate that no degradation of subjective audio quality occurs in this tan- dem coding scenario for both the AAC-based mode and the stand-alone mode; see [59]. This provides an interest- ing operating point for HD-AAC modes, corresponding to a guaranteed 2:1 compression. While other stand-alone lossless codecs can also provide an average compression of 2:1 for suitable test material, their peak compression
performance can be much lower, depending on the audio material to be encoded. In contrast, HD-AAC is able to guarantee a certain compression ratio while providing lossless or near-lossless signal representation, depending on the input signal.
9 DECODER COMPLEXITY
The computational complexity for SLS decoding can be evaluated by counting the total number of standard in- structions (multiplications, additions, bit shifts, compari- sons, memory transfers) required for performing the de- coding process on a generic 32-bit fixed-point CPU.
The main components contributing to the computational complexity of SLS are:
1) IntMDCT filter bank 2) Bit-plane arithmetic decoder 3) AAC Huffman decoding 4) AAC + SLS inverse error mapping 5) Integer M/S stereo coding 6) Unpacking of tables. Items 3) to 5) are only required in the AAC-based mode,
item 6) only if the necessary tables are not precomputed.
9.1 Number of Instructions Table 5 lists the number of instructions required for de-
coding in the AAC-based mode, with the AAC core operat- ing at 64 kbps per channel. Table 6 shows the correspond- ing numbers for the SLS in stand-alone mode (without the AAC core layer). For both tables, values are provided for both implementations with and without table prepacking.
Fig. 15. Test results: stand-alone SLS tandem coding.
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February38
9.2 ROM Requirements For an implementation of SLS in stand-alone mode, a
ROM size of 4 kbytes is required. For the AAC-based mode, the ROM requirement is 45 kbytes. As can be seen from Tables 5 and 6, a tradeoff between ROM requirement and number of instructions can be made by precomputing the necessary table values. More details on SLS compu- tational complexity can be found in [59]. The computa- tional complexity of AAC decoding is analyzed in [60].
10. APPLICATIONS
As the primary functionality of HD-AAC audio coding is lossless audio coding, it can be used in applications that require bit-exact reconstruction, such as studio operation, music disc delivery, or audio archiving. Due to its inherent scalability, HD-AAC audio coding technology in fact fits into virtually every application that requires audio com- pression. Several potential application scenarios are listed here.
Studio Operations HD-AAC audio coding technology is useful for the storage of audio at various points in studio operations such as recording, editing, mixing, and premas- tering as studio procedures are designed to preserve the highest levels of quality. The scalability of the SLS layer also provides a nice solution to situations in which the band- width is not sufficient to support fully lossless quality.
Archival Application Archives of sound recordings are very common in studios, record labels, libraries, and so on. These archives are tremendously large and certainly compression is essential. In addition, the scalability of the
SLS technology facilitates the possibility that lower bit- rate versions of the archive’s lossless audio items can be extracted at any time to allow applications such as remote data browsing.
Broadcast Contribution/Distribution Chain In a broadcast environment HD-AAC audio coding technology could be used in all stages comprising archiving, contri- bution/distribution, and emission. In the broadcast chain one main feature of the technology can be used: In every stage where lower bit rates are required, the bit stream is merely truncated, and no reencoding is therefore required.
Consumer Disc-Based Delivery HD-AAC technology can also be used in consumer disc-based delivery of music content. It enables the music disc to deliver both lossless and lossy audio on the same medium.
Internet Delivery of Audio In such an application sce- nario the available transmission bandwidth can vary dra- matically across different access network technologies and over time. As a result, the same audio content at a variety of bit rates and qualities may need to be kept ready at the server side. HD-AAC technology provides a “one-file” solution for such a requirement.
Audio Streaming HD-AAC technology delivers the vital bit-rate scalability for streaming applications on channels with variable quality of service (QoS) conditions. Examples for this kind of streaming applications include Internet audio streaming and multicast streaming applica- tions that feed several channels of differing capacity.
Digital Home The idea of the digital home is to create an open and transparent home network platform that en- ables consumers to easily create, use, manage, and share digital content such as audio, video, or image. In a typical
Table 5. Maximum numbers of INT32 operations per sample for SLS decoding with AAC core.
Frame Length Muls Adds/Subs Ors Shifts Negs Movs Combined
All ! 1 Cycle
Tables Preunpacked 4096 or 512 19.50 93.46 34.50 72.60 34.26 16.54 270.86 2048 or 256 20.25 91.22 31.50 73.54 32.25 16.29 265.05 1024 or 128 18.00 88.99 28.50 68.50 30.85 15.99 250.83
Tables Unpacked in Place 4096 or 512 19.50 123.21 34.50 88.01 34.26 16.54 316.10 2048 or 256 20.25 117.47 31.50 85.54 32.25 16.29 303.30 1024 or 128 18.00 105.24 28.50 75.00 30.85 15.99 273.58
Table 6. Maximum number of INT32 operations per sample for SLS stand-alone decoding.
Frame Length Muls Adds/Subs Ors Shifts Negs Movs Combined
All ! 1 cycle
Tables Preunpacked 4096 or 512 18 86.63 34.50 67.10 29.93 9.54 245.70 2048 or 256 18.75 84.39 31.50 68.04 27.92 9.29 239.89 1024 or 128 16.5 82.16 28.50 63.00 26.52 8.99 225.67
Tables Unpacked in Place 4096 or 512 18.00 116.38 34.50 82.6 29.93 9.54 290.05 2048 or 256 18.75 110.64 31.50 80.04 27.92 9.29 278.14 1024 or 128 16.50 98.41 28.50 69.50 26.52 8.99 248.42
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 39
setup for audio, the user can download the HD-AAC coded bit streams in lossless quality from the service pro- vider and archive them on the home music server. These bit streams are then streamed, or downloaded to different audio terminals at differing quality for playback.
11. CONCLUSIONS
The new ISO/MPEG specification for scalable lossless coding extends the well-known perceptual coding scheme AAC toward lossless and near-lossless operation, and in this way enables its use in the context of high-definition applications. The HD-AAC scheme offers competitive lossless compression rates at all relevant operating points (word length and sampling rate). For distribution on band- width-limited channels a perceptually coded compatible AAC bit stream can simply be extracted from the com- posite HD-AAC stream. Alternatively, the SLS part can also be used as a simple and versatile stand-alone com- pression engine. In both cases the fidelity of the signal representation can be scaled with fine granularity within a wide range of near lossless representations. This enables lossless or near lossless transmission of high-definition audio with a guaranteed maximum rate. We anticipate that this flexibility will make HD-AAC the technology of choice for many applications that call for both very high audio quality and delivery over a wide range of transmis- sion channels.
12 ACKNOWLEDGMENT
The authors would like to thank all their colleagues at the MPEG audio subgroup who supported the lossless standardization activity, especially Takehiro Moriya (NTT) for inspiring these standardization activities, Til- man Liebchen (Technical University of Berlin) for chair- ing the ad hoc group on lossless audio coding, and Yuriy Reznik (Real Networks/Qualcomm) for his thorough com- plexity evaluations.
13 REFERENCES
[1] ISO/IEC 11172-3, “Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s—Part 3: Audio,” International Standards Orga- nization, Geneva, Switzerland (1992).
[2] ISO/IEC 13818-3, “Information Technology— Generic Coding of Moving Pictures and Associated Au- dio—Part 3: Audio,” International Standards Organiza- tion, Geneva, Switzerland (1994).
[3] ISO/IEC 14496-3:2001, “Coding of Audio-Visual Objects—Part 3: Audio,” International Standards Organi- zation, Geneva, Switzerland (2001).
[4] ISO/IEC 14496-3:2001/Amd.1:2003, “Coding of Audio-Visual Objects—Part 3: Audio, Amendment 1: Bandwidth Extension,” International Standards Organiza- tion, Geneva, Switzerland (2003).
[5] M. Dietz, L. Liljeryd, K. Kjorling, and O. Kunz, “Spectral Band Replication—A Novel Approach in Audio
Coding,” presented at the 112th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 50, pp. 509, 510 (2002 June), convention paper 5553.
[6] ISO/IEC 14496-3:2001/Amd.2:2004, “Coding of Audio-Visual Objects—Part 3: Audio, Amendment 2: Parametric Coding for High Quality Audio,” International Standards Organization, Geneva, Switzerland (2004).
[7] C. den Brinker, E. Schuijers, and W. Oomen, “Para- metric Coding for High-Quality Audio,” presented at the 112th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 50, p. 510 (2002 June), convention paper 5554.
[8] ISO/IEC FCD 23003-1, “MPEG-D (MPEG Audio Technologies)—Part 1: MPEG Surround” International Standards Organization, Geneva, Switzerland (2006).
[9] J. Breebaart, J. Herre, C. Faller, J. Roeden, F. My- burg, S. Disch, H. Purnhagen, G. Hotho, M. Neusinger, K. Kjoerling, and W. Oomen, “MPEG Spatial Audio Coding/ MPEG Surround: Overview and Current Status,” pre- sented at the 119th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 53, p. 1228 (2005 Dec.), convention paper 6599.
[10] “Special Issue: High-Resolution Audio,” J. Audio Eng. Soc., vol. 52, pp. 116–260 (2004 Mar.).
[11] “DVD Forum,” http://www.dvdforum.org/ forum.shtml (2004).
[12] Royal Philips Electronics, “Super Audio CD Sys- tems,” http://www.licensing.philips.com/information/ sacd/ (2006).
[13] Blu-Ray Disc Association, “Blu-Ray Disc,” http:// www.blu-raydisc.com/ (2006).
[14] Dolby Laboratories Inc., “MLP Lossless” http:// www.dolby.com/consumer/technology/mlp_lossless.html (2006).
[15] M. A. Gerzon, P. G. Craven, J. R. Stuart, M. J. Law, and R. J. Wilson, “The MLP Lossless Compression System,” in Proc 17th AES Conf. (Florence, Italy, 1999 Sept.), pp. 61–75.
[16] DTS Inc., “DTS HD,” http://www.dtsonline.com/ consumer/dtshd.php (2006).
[17] Apple Computer Inc., “Apple Quicktime,” http:// www.apple.com/downloads/macosx/apple/quicktime651 .html (2006).
[18] J. Coalson, “FLAC—Free Lossless Audio Codec,” http://flac.sourceforge.net (2006).
[19] M. T. Ashland, “Monkey’s Audio—A Fast and Powerful Lossless Audio Compressor,” http://www .monkeysaudio.com (2004).
[20] F. Ghido, “OptimFROG,” http://www.losslessaudio .org (2006).
[21] ITU-T G.729.1, “G.729 Based Embedded Variable Bit-Rate Coder: An 8–32 kbit/s Scalable Wideband Coder Bitstream Interoperable with G.729,” International Tele- communication Union, Geneva, Switzerland (2006).
[22] B. Grill, “A Bit-Rate Scalable Perceptual Coder for MPEG-4 Audio presented at the 103rd Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Ab- stracts), vol. 15, p. 1005 (1997 Nov.), preprint 4620.
[23] J. Herre, E. Allamanche, K. Brandenburg, M.
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February40
Dietz, B. Teichmann, B. Grill, A. Jin, T. Moriya, N. Iwakami, T. Norimatsu, M. Tsushima, and T. Ishikawa, “The Integrated Filterbank-Based Scalable MPEG-4 Au- dio Coder,” presented at the 105th Convention of the Au- dio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 46, p. 1039 (1998 Nov.), preprint 4810.
[24] S. H. Park, Y. B. Kim, S. W. Kim, and Y. S. Seo, “Multi-Layer Bit-Sliced Bit-Rate Scalable Audio Coding,” presented at the 103rd Convention of the Audio Engineer- ing Society, J. Audio Eng. Soc. (Abstracts), vol. 45, p. 1005 (1997 Nov.), preprint 4520.
[25] M. Nishiguchi, “MPEG-4 Speech Coding,” in Proc. 17th AES Conf. (Florence, Italy, 1999 Sept.), pp. 139–146.
[26] M. Nishiguchi, A. Inoue, Y. Maeda, and J. Matsu- moto, “Parametric Speech Coding—HVXC at 2.0–4.0 kbps,” presented at the IEEE Workshop on Speech Cod- ing, Porvoo, Finland (1999 June).
[27] M. S. Schroeder and B. S. Atal, “Code-Excited Linear Prediction (CELP): High-Quality Speech at Very Low Bit Rates,” in Proc. IEEE ICASSP (Tampa, FL, 1985 Mar.), pp. 937–940.
[28] T. Nomura, M. Iwadare, M. Serizawa, and K. Ozawa, “A Bitrate and Bandwidth Scalable CELP Coder,” in Proc. IEEE ICASSP (Seattle, WA, 1998 May), pp. 341–344.
[29] ISO/IEC JTC1/SC29/WG11, “Final Call for Pro- posals on MPEG-4 Lossless Audio Coding,” MPEG2002/ N5208, Shanghai, China (2002 Oct.).
[30] ISO/IEC 14496-3:2001/Amd.6:2005, “Coding of Audio-Visual Objects—Part 3: Audio, Amendment 6: Lossless Coding of Oversampled Audio,” International Standards Organization, Geneva, Switzerland (2005)
[31] E. Knapen, D. Reefman, E. Janssen, and F. Bruek- ers, “Lossless Compression of 1-Bit Audio,” J. Audio Eng. Soc., vol. 52, pp. 190–199 (2004 Mar.).
[32] ISO/IEC 14496-3:2005/Amd.2:2006, “Coding of Audio-Visual Objects—Part 3: Audio, Amendment 2: Au- dio Lossless Coding (ALS), New Audio Profiles and BSAC Extensions,” International Standards Organization, Geneva, Switzerland (2006).
[33] ISO/IEC 14496-3:2005/Amd.3:2006, “Coding of Audio-Visual Objects—Part 3: Audio, Amendment 3: Scalable Lossless Coding (SLS),” International Standards Organization, Geneva, Switzerland (2006).
[34] J. Herre and H. Purnhagen, “General Audio Cod- ing,” in The MPEG-4 Book, IMSC Multimedia Ser., F. Pereira and T. Ebrahimi, Eds. (Prentice-Hall, Englewood Cliffs, NJ, 2002).
[35] M. Bosi, K. Brandenburg, S. Quackenbush, L. Fielder, K. Akagiri, H. Fuchs, M. Dietz, J. Herre, G. Dav- idson, and Y. Oikawa, “ISO/IEC MPEG-2 Advanced Au- dio Coding,” J. Audio Eng. Soc., vol. 45, pp. 789–814 (1997 Oct.).
[36] ISO/IEC JTC1/SC29/WG11, “Report on the MPEG-2 AAC Stereo Verification Tests,” MPEG1998/ N2006, San Jose, CA (1998 Feb.).
[37] J. Princen, A. Johnson, and A. Bradley, “Subband/ Transform Coding Using Filter Bank Designs Based on
Time Domain Aliasing Cancellation,” in Proc IEEE ICASSP (Dallas, TX, 1987), pp. 2161–2164.
[38] J. Herre and J. D. Johnston, “Enhancing the Per- formance of Perceptual Audio Coders by Using Temporal Noise Shaping (TNS),” presented at the 101st Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Ab- stracts), vol. 44, p. 1175 (1996 Dec.), preprint 4384.
[39] J. D. Johnston, J. Herre, M. Davis, and U. Gbur, “MPEG-2 NBC Audio—Stereo and Multichannel Coding Methods,” presented at the 101st Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 44, p. 1175 (1996 Dec.), preprint 4383.
[40] R. Geiger, T. Sporer, J. Koller, and K. Branden- burg, “Audio Coding Based on Integer Transforms,” pre- sented at the 111th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 49, p. 1230 (2001 Dec.), convention paper 5471.
[41] R. Geiger, Y. Yokotani, and G. Schuller, “Im- proved Integer Transforms for Lossless Audio Coding,” in Proc. 37th Asilomar Conf. on Signals, Systems and Com- puters (Pacific Grove, CA, 2003 Nov.).
[42] R. Geiger, J. Herre, J. Koller, and K. Brandenburg, “IntMDCT—A Link between Perceptual and Lossless Au- dio Coding,” in Proc. IEEE ICASSP (Orlando, FL, 2002 May).
[43] R. Yu, C. C. Ko, S. Rahardja, and X. Lin, “Bit- Plane Golomb Code for Sources with Laplacian Distribu- tions,” in Proc. IEEE ICASSP (Hong Kong, China, 2003 Apr.), pp. 277–280.
[44] R. Yu, X. Lin, S. Rahardja, C. C. Ko, and H. Huang, “Improving Coding Efficiency for MPEG-4 Audio Scalable Lossless Coding,” in Proc. IEEE ICASSP (Phila- delphia, PA, 2005 May).
[45] I. Daubechies, and W. Sweldens, “Factoring Wave- let Transforms into Lifting Steps,” Tech. Rep. Bell Labo- ratories, Lucent Technologies (1996).
[46] F. Bruekers and A. Enden, “New Networks for Perfect Inversion and Perfect Reconstruction,” IEEE J. Selected Areas Comm., vol. 10, pp. 130–137 (1992 Jan).
[47] R. Geiger, Y. Yokotani, G. Schuller, and J. Herre, “Improved Integer Transforms Using Multi-Dimensional Lifting,” in Proc. IEEE ICASSP (Montreal, Canada, 2004 May).
[48] Y. Yokotani, R. Geiger, G. Schuller, S. Oraintara, and K. R. Rao, “Improved Lossless Audio Coding Using the Noise-Shaped IntMDCT,” presented at the IEEE 11th DSP Workshop, Taos Ski Valley, NM (2004 Aug).
[49] R. Yu, X. Lin, S. Rahardja, and H. Haibin, “Pro- posed Core Experiment for Improving Coding Efficiency in MPEG-4 Audio Scalable Coding (SLS),” ISO/IEC JTC1/SC29/WG11, M10683, Munich, Germany (2004 Mar.)
[50] R. Yu, X. Lin, S. Rahardja, and C. C. Ko, “A Statistics Study of the MDCT Coefficient Distribution for Audio,” in Proc. ICME (Taipei, Taiwan, 2004 June).
[51] K. H. Choo, E. Oh, J. H. Kim, and C. Y. Son, “Enhanced Performance in the Functionality of Fine Grain Scalability,” presented at the 119th Convention of the Au- dio Engineering Society, J. Audio Eng. Soc. (Abstracts),
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 41
vol. 53, pp. 1227, 1228 (2005 Dec.), convention paper 6597.
[52] ISO/IEC 14496–1:2004, “Coding of Audio-Visual Objects—Part 1: Systems,” International Standards Orga- nization, Geneva, Switzerland (2004).
[53] M. Hans and R. W. Schafer, “Lossless Compres- sion of Digital Audio,” IEEE Signal Process Mag., vol. 18, pp. 21–32 (2001 July).
[54] AES Technical Committee of Coding of Audio Signals, “Perceptual Audio Coders: What to Listen For,” CD-ROM with tutorial information and audio examples, Audio Engineering Society, New York (2001).
[55] ITU-R BS.1548-1, “User Requirements for Audio Coding Systems for Digital Broadcasting,” International Telecommunication Union, Geneva, Switzerland (2001–2002).
[56] ITU-R BS.1116-1, “Methods for the Subjective Assessment of Small Impairments in Audio Systems In-
cluding Multichannel Sound Systems,” International Tele- communication Union, Geneva, Switzerland (1994).
[57] ITU-R BS.1387-1, “Method for Objective Mea- surements of Perceived Audio Quality,” International Telecommunication Union, Geneva, Switzerland (1998).
[58] R. Geiger, M. Schmidt, J. Herre, and R. Yu, “MPEG-4 SLS—Lossless and Near-Lossless Audio Cod- ing Based on MPEG-4 AAC,” presented at the Interna- tional Symposium on Communications, Control and Sig- nal Processing, Marrakech, Morocco (2006 Mar.)
[59] ISO/IEC JTC1/SC29/WG11, “Verification Report on MPEG-4 SLS,” MPEG2005/N7687, Nice, France (2005 Oct.).
[60] ISO/IEC JTC1/SC29/WG11, “Revised Report on Complexity of MPEG-2 AAC Tools,” MPEG1999/N2957 (Melbourne, Australia, 1999 Oct.), http://www.chiariglione .org/mpeg/working_documents/mpeg-02/audio/AAC_ tool_complexity(rev).zip
THE AUTHORS
S.-W. Kim X. Lin M. Schmidt
Ralf Geiger received a diploma degree in mathematics from the University of Regensburg, Regensburg, Ger- many, in 1997.
In 1998 he joined the Audio/Multimedia Department at the Fraunhofer Institute for Integrated Circuits (IIS), Er- langen, Germany. From 2000 to 2004 he was with the Fraunhofer Institute for Digital Media Technology (IDMT), Ilmenau, Germany. In 2005 he returned to Fraun- hofer IIS.
Mr. Geiger is working on the development and stan- dardization of perceptual and lossless audio coding
!
Rongshan Yu received a B.Eng. degree from Shanghai Jiaotong University, Shanghai, P. R. China, in 1995, and M. Eng. and Ph.D. degrees from the National University of Singapore in 2000 and 2005, respectively.
He was with the Centre for Signal Processing, School of Electrical and Electronics, Nanyang Technological Uni- versity, Singapore, from 1999 to 2001, and with the Insti-
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February42
tute for Infocomm Research (I2R), A*STAR, Singapore, from 2001 to 2005. He is currently with Dolby Laborato- ries, San Francisco, CA, USA. His research interests in- clude audio coding, data compression, and digital signal processing.
!
Jurgen Herre joined the Fraunhofer Institute for Inte- grated Circuits (IIS) in Erlangen, Germany, in 1989. Since then he has been involved in the development of percep- tual coding algorithms for high-quality audio, including the well-known ISO/MPEG-Audio Layer III coder (aka MP3). In 1995 he joined Bell Laboratories for a postdoc- toral term, working on the development of MPEG-2 ad- vanced audio coding (AAC). Since the end of 1996 he has been back at Fraunhofer, working on the development of advanced multimedia technology, including MPEG-4, MPEG-7, and secure delivery of audiovisual content, cur- rently as the chief scientist for the audio/multimedia ac- tivities at Fraunhofer IIS, Erlangen.
!
Susanto Rahardja received a Ph.D. degree in electrical and electronic engineering from the Nanyang Technologi- cal University, Singapore.
He joined the Centre for Signal Processing, Nanyang Technological University, in 1996 and he has been a fac- ulty member at its School of Electrical and Electronic Engineering since 2001. In 2002 he joined the Institute for Infocomm Research (I2R) and is currently the director of its Media Division. He is overseeing research areas on signal processing (audio coding, video/image processing), media analysis (text/speech, image, video), media security (biometrics, computer vision, and surveillance), and sen- sor network.
Dr. Rahardja has published in more than 150 interna- tional journals and at conferences in the areas of digital
!
!
Xiao Lin received a Ph.D. degree from the Electronics and Computer Science Department of the University of Southampton, Southampton, UK, in 1993.
He worked at the Centre for Signal Processing, Nanyang Technological University, Singapore, as a re- search fellow and senior research fellow for about five years. Subsequently he joined DeSOC Technology as a technical director and then the Institute for Infocomm Re- search in 2002. There he was member of technical staff, lead scientist, and principal scientist and managed the Me- dia Processing Department until 2006. He is now with Fortemedia Inc. as a senior director. He actively partici- pated in international standards such as MPEG-4, JPEG2000, and JVT.
Dr. Lin is a senior member of the IEEE.
!
Markus Schmidt received a Dipl.-Ing. degree in media technology from the Technical University of Ilmenau, Germany, in 2004. During his studies he spent a year at the University of Strathclyde in Glasgow, Scotland.
After completing an intership at the Fraunhofer Institute for Digital Media Technology (IDMT) in Ilmenau, he joined the Fraunhofer Institute for Integrated Circuits (IIS), Erlangen, Germany, in 2005. There his research in- terests include low-delay and lossless audio coding schemes and their implementation in real-time environments.
PAPERS ISO/IEC MPEG-4 SCALABLE AAC