ISO/IEC MPEG-4 High-Definition Scalable Advanced Audio Coding* RALF GEIGER, 1 RONGSHAN YU, 2 ** JU ¨ RGEN HERRE, 1 SUSANTO RAHARDJA, 2 (ralf.geiger@iis.fraunhofer.de) (rzyu@dolby.com) (juergen.herre@iis.fraunhofer.de) (rsusanto@i2r.a-star.edu.sg) SANG-WOOK KIM, 3 XIAO LIN, 2 and MARKUS SCHMIDT 1 (sangwookkim@samsung.com) (linxiao@fortemedia.com.cn) (markus.schmidt@iis.fraunhofer.de) 1 Fraunhofer IIS, Erlangen, Germany 2 Institute for Infocomm Research, Singapore 3 Samsung Electronics, Suwon, Korea Recently the MPEG Audio standardization group has successfully concluded the standard- ization process on technology for lossless coding of audio signals. A summary of the scalable lossless coding (SLS) technology as one of the results of this standardization work is given. MPEG-4 scalable lossless coding provides a fine-grain scalable lossless extension of the well-known MPEG-4 AAC perceptual audio coder up to fully lossless reconstruction at word lengths and sampling rates typically used for high-resolution audio. The underlying innova- tive technology is described in detail and its performance is characterized for lossless and near lossless representation, both in conjunction with an AAC coder and as a stand-alone com- pression engine. A number of application scenarios for the new technology are discussed. 0 INTRODUCTION Perceptual coding of high-quality audio signals has ex- perienced a tremendous evolution over the past two de- cades, both in terms of research progress and in worldwide deployment in products. Examples for successful applica- tions include portable music players, Internet audio, audio for digital media (such as VCD, DVD), and digital broad- casting. Several phases of international standardization have been conducted successfully [1]–[3]. Many recent research and standardization efforts focus at achieving good sound quality at even lower bit rates [4]–[9] to ac- commodate storage and transmission channels with lim- ited quality (such as terrestrial and satellite broadcasting and third-generation mobile telecommunication). Never- theless, for other scenarios with higher transmission band- width available, there is a general trend toward providing the consumer with a media experience of extremely high fidelity [10], as it is frequently associated with the terms “high definition” and “high resolution” and dedicated me- dia types, such as DVD-Audio [11], SACD [12], HD- DVD [11], or Blu-Ray Disc [13]. In the realm of audio, this is achieved by employing lossless formats with high resolution (word length) and/or high sampling rate. There exist several proprietary lossless formats, the most prominent of which are MLP Lossless [14], [15], DTS-HD [16], and Apple Lossless [17]. Furthermore, sev- eral freeware coding systems have gained some promi- nence through the Internet. Among them are FLAC [18], Monkey’s Audio [19], and OptimFROG [20]. On the other end of the bit-rate scale, approaches for scalable perceptual audio coding have been developed in the recent years. Within the International Telecommuni- cation Union (ITU) scalability in wide-band audio coding has been adopted only recently [21]. Within MPEG-4 Au- dio [3], scalability has been developed and adopted some- time earlier [22]–[24]. This includes the realm of speech coding where scalability is provided both for harmonic vector excitation coding (HVXC) [25], [26] and code- excited linear prediction (CELP) [27], [28]. In this context the ISO/MPEG Audio standardization group decided to start a new work item to explore tech- nology for lossless and near-lossless coding of audio sig- nals by issuing a call for proposals for relevant technology in 2002 [29]. Three specifications emerged from this call *Manuscript received 2006 March 23; revised 2006 October 24 and November 27. **Currently with Dolby Laboratories, San Francisco, CA. PAPERS J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 27
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
RALF GEIGER,1 RONGSHAN YU,2** JURGEN HERRE,1 SUSANTO RAHARDJA,2
(ralf.geiger@iis.fraunhofer.de) (rzyu@dolby.com)
(juergen.herre@iis.fraunhofer.de)
(rsusanto@i2r.a-star.edu.sg)
SANG-WOOK KIM,3 XIAO LIN,2 and MARKUS SCHMIDT1
(sangwookkim@samsung.com) (linxiao@fortemedia.com.cn)
(markus.schmidt@iis.fraunhofer.de)
3Samsung Electronics, Suwon, Korea
Recently the MPEG Audio standardization group has successfully
concluded the standard- ization process on technology for lossless
coding of audio signals. A summary of the scalable lossless coding
(SLS) technology as one of the results of this standardization work
is given. MPEG-4 scalable lossless coding provides a fine-grain
scalable lossless extension of the well-known MPEG-4 AAC perceptual
audio coder up to fully lossless reconstruction at word lengths and
sampling rates typically used for high-resolution audio. The
underlying innova- tive technology is described in detail and its
performance is characterized for lossless and near lossless
representation, both in conjunction with an AAC coder and as a
stand-alone com- pression engine. A number of application scenarios
for the new technology are discussed.
0 INTRODUCTION
Perceptual coding of high-quality audio signals has ex- perienced a
tremendous evolution over the past two de- cades, both in terms of
research progress and in worldwide deployment in products. Examples
for successful applica- tions include portable music players,
Internet audio, audio for digital media (such as VCD, DVD), and
digital broad- casting. Several phases of international
standardization have been conducted successfully [1]–[3]. Many
recent research and standardization efforts focus at achieving good
sound quality at even lower bit rates [4]–[9] to ac- commodate
storage and transmission channels with lim- ited quality (such as
terrestrial and satellite broadcasting and third-generation mobile
telecommunication). Never- theless, for other scenarios with higher
transmission band- width available, there is a general trend toward
providing the consumer with a media experience of extremely high
fidelity [10], as it is frequently associated with the terms “high
definition” and “high resolution” and dedicated me-
dia types, such as DVD-Audio [11], SACD [12], HD- DVD [11], or
Blu-Ray Disc [13]. In the realm of audio, this is achieved by
employing lossless formats with high resolution (word length)
and/or high sampling rate.
There exist several proprietary lossless formats, the most
prominent of which are MLP Lossless [14], [15], DTS-HD [16], and
Apple Lossless [17]. Furthermore, sev- eral freeware coding systems
have gained some promi- nence through the Internet. Among them are
FLAC [18], Monkey’s Audio [19], and OptimFROG [20].
On the other end of the bit-rate scale, approaches for scalable
perceptual audio coding have been developed in the recent years.
Within the International Telecommuni- cation Union (ITU)
scalability in wide-band audio coding has been adopted only
recently [21]. Within MPEG-4 Au- dio [3], scalability has been
developed and adopted some- time earlier [22]–[24]. This includes
the realm of speech coding where scalability is provided both for
harmonic vector excitation coding (HVXC) [25], [26] and code-
excited linear prediction (CELP) [27], [28].
In this context the ISO/MPEG Audio standardization group decided to
start a new work item to explore tech- nology for lossless and
near-lossless coding of audio sig- nals by issuing a call for
proposals for relevant technology in 2002 [29]. Three
specifications emerged from this call
*Manuscript received 2006 March 23; revised 2006 October 24 and
November 27.
**Currently with Dolby Laboratories, San Francisco, CA.
PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
27
as amendments to the MPEG-4 Audio standard. First the standard on
lossless coding of 1-bit oversampled signals [30] specifies the
lossless compression of highly over- sampled 1-bit
sigma–delta-modulated audio signals as they are stored on the SACD
media under the name Direct Stream Digital (DSD) [31]. Second the
audio lossless cod- ing (ALS) specification [32] describes
technology for the lossless coding of PCM-coded signals at sampling
rates up to 192 kHz as well as floating-point audio.
This paper provides an overview of the third descen- dant, the
scalable lossless coding (SLS) specification [33], which extends
traditional methods for perceptual audio coding toward lossless
coding of high-resolution audio signals in a scalable way.
Specifically, it allows to scale up from a perceptually coded
representation (MPEG-4 ad- vanced audio coding, AAC) to a fully
lossless representa- tion with high definition, including a wide
range of inter- mediate near-lossless representations. The paper
starts with an explanation of the general concept, discusses the
underlying novel technology components, and character- izes the
codec in terms of compression performance and complexity. Finally a
number of application scenarios are briefly outlined.
1 CONCEPT AND TECHNOLOGY OVERVIEW
This section provides an overview of the principle and the
structure of the scalable lossless coding technology and its
combination with the AAC perceptual audio coder. This combination
will be referred to as high-definition advanced audio coding
(HD-AAC) in the following.
1.1 General System Structure Fig. 1 shows a very high-level view of
the structure of
the HD-AAC encoder. The input signal is first coded by an AAC
(“core layer”) encoder. The SLS algorithm uses the output to
enhance the system’s performance toward loss- less coding,
resulting in an enhancement layer. The two layers of information
are subsequently multiplexed into one high-definition bit stream.
Decoding of an HD-AAC bit stream is illustrated in Fig. 2. From the
HD-AAC bit stream, decoders are able either to decode the
perceptually coded AAC part only, or to use the additional SLS
infor-
mation to produce losslessly/near-losslessly coded audio for
high-definition applications.
1.2 AAC Background Since the MPEG-4 SLS coder has been designed
to
operate as an enhancement to MPEG-4 AAC [3], [34], its structure is
closely related to that of the underlying AAC core coder [35]. This
section sketches the architectural features of MPEG-4 AAC as they
are relevant to the SLS enhancement technology.
The underlying AAC codec provides efficient percep- tual audio
coding with high quality and achieves broadcast quality at a bit
rate of about 64 kbps per channel [36]. Fig. 3 gives a very concise
view of the AAC encoder’s struc- ture. The audio signal is
processed in a blockwise spectral representation using the modified
discrete cosine trans- form (MDCT) [37]. The resulting 1024
spectral values are quantized and coded considering the required
accuracy as demanded by the perceptual model. This is done to mini-
mize the perceptibility of the introduced quantization dis- tortion
by exploiting masking effects. Several neighboring spectral values
are grouped into so-called scale factor bands, sharing the same
scale factor for quantization. Prior to the quantization/coding
(Q/C) tool, a number of pro- cessing tools operate on the spectral
coefficients in order to improve coding performance for certain
situations. The most important tools are the following.
• The temporal noise shaping (TNS) tool [38] carries out predictive
filtering across frequency in order to achieve a temporal shaping
of the quantization noise according to the signal envelope and in
this way optimize temporal masking of the quantization
distortion.
• The M/S stereo coding tool [39] provides sum/ difference coding
of channel pairs, exploits interchannel redundancy for
near-monophonic signals, and avoids binaural unmasking.
1.3 HD-AAC/SLS Enhancement The SLS scalable lossless enhancement
works on top of
this AAC architecture. The structures of an HD-AAC en- coder and
decoder are shown in Figs. 4 and 5.
In the encoder the audio signal is converted into a spectral
representation using the integer modified discrete cosine transform
(IntMDCT) [40], [41]. This transform represents an invertible
integer approximation of the MDCT and is well-suited for lossless
coding in the frequency domain. Other AAC coding tools, such as
mid/side coding or tempo- ral noise shaping, are also considered
and performed on the IntMDCT spectral coefficients in an invertible
integer fash- ion, thus maintaining the similarity between the
spectral val- ues used in the AAC coder and in the lossless
enhancement.Fig. 1. HD-AAC encoder structure.
Fig. 2. HD-AAC decoder structure. Fig. 3. Structure of AAC encoder
(simplified).
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February28
The link between the perceptual core and the scalable lossless
enhancement layer of the coder [42] is provided by an error-mapping
process. The error-mapping process removes the information that has
already been coded in the perceptual (AAC) path from the IntMDCT
spectral coef- ficients such that only the resulting (IntMDCT)
residuals are coded in the enhancement encoder, and in this way the
coding efficiency benefits from the underlying AAC layer. The
error-mapping process also preserves the probability distribution
skew of the original IntMDCT coefficients and thus permits an
efficient encoding of the residual by means of two bit-plane coding
processes, namely, bit-plane Golomb coding (BPGC) [43] and
context-based arithmetic coding (CBAC) [44], and a so-called
low-energy-mode encoder, all of which will be described later in
further detail.
By using the bit-plane coding process the lossless en- hancement is
performed in a fine-grain scalable way. A lossless reconstruction
is obtained if all the bit planes of the residual are coded,
transmitted, and decoded com- pletely. If only parts of the bit
planes are decoded, a lossy reconstruction of the signal is
obtained with a quality be- tween the AAC layer’s and lossless
reconstruction. In or- der to achieve optimal perceptual quality at
intermediate bit rates, the bit-plane coding is started from the
most significant bit (MSB) for all scale-factor bands, and pro-
gresses toward the least significant bit (LSB) for all bands (see
Fig. 6). In this way the bit-plane coding process pre- serves the
overall spectral shape of the quantization noise,
as it results from the noise-shaping process of the AAC perceptual
coder, and thus takes advantage of the AAC perceptual model.
The following sections will describe the principles un- derlying
the design of SLS in greater detail by focusing on its IntMDCT
filter bank, error-mapping strategy, and en- tropy coding
parts.
2 FILTER BANK
2.1 IntMDCT The IntMDCT, as introduced in [40], is an
invertible
integer approximation of the MDCT, which is obtained by utilizing
the “lifting scheme” [45] or “ladder network” [46]. It enables
efficient lossless coding of audio signals [40] by means of entropy
coding of integer spectral coef- ficients. Furthermore, as the
IntMDCT closely approxi- mates the behavior of the MDCT, it allows
to combine the strategies of perceptual and lossless audio coding
in the frequency domain into a common framework [42].
Fig. 7 illustrates the close relationship between IntMDCT and MDCT
for a small audio segment by displaying the respective magnitude
spectra of both filter banks. The dif- ference between MDCT and
IntMDCT values is visible as a small noise floor that is typically
much lower than the error introduced by perceptual coding. Thus the
IntMDCT allows to code efficiently the quantization error of an
MDCT-based perceptual codec in the frequency domain.
Fig. 4. Structure of HD-AAC encoder.
Fig. 5. Structure of HD-AAC decoder.
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
29
The following section provides some detail on the deriva- tion of
the IntMDCT from the MDCT.
2.2 Decomposition of MDCT The MDCT, defined by
X!m" =!2 N #
4N ,
m = 0, . . . , N ! 1 (1)
with the time-domain input x(k) and the windowing func- tion w(k)
can be decomposed into two blocks, namely, windowing and
time-domain aliasing (TDA) and discrete cosine transform of type IV
(DCT-IV). This is illustrated in Fig. 8 for both forward MDCT and
inverse MDCT.
In the forward IntMDCT, the windowing/TDA block is calculated by
3N/2 so-called lifting steps:
"x!k"
w!k"
w!k"
! 1. (2)
After each lifting step a rounding operation is applied to stay in
the integer domain. Every lifting step can be in- verted by simply
adding the subtracted value.
2.3 Integer DCT-IV For the IntMDCT, the DCT-IV is calculated in an
in-
vertible integer fashion, called the integer DCT-IV. The
multidimensional lifting (MDL) scheme [41], [47] is ap- plied in
order to reduce the required rounding operations in the invertible
integer approximation as much as possible and in this way minimize
the approximation error noise floor (see Fig. 7). The following
block matrix decompo- sition for an invertible matrix T and the
identity matrix I shows the basic principle of the MDL
scheme,
"T 0
I T!1#. (3)
The three blocks in this decomposition are the so-called MDL steps.
Similar to the conventional lifting steps, they can be transferred
to invertible integer mappings by round- ing the floating-point
values after being processed by T or T!1, and they can be inverted
by subtracting the values that have been added.
Fig. 6. Bit-plane scan process in SLS.
Fig. 7. IntMDCT and MDCT magnitude spectra.
Fig. 8. MDCT and inverse MDCT by windowing/TDA and DCT-IV.
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February30
When coding stereo signals, this decomposition is used to obtain an
integrated calculation of the M/S matrix and the integer DCT-IV for
the left and right channels. The number of required rounding
operations is 3N per channel pair, or 3N/2 per channel, which is
the same number as for the windowing/TDA stage. As a whole, this
so-called ste- reo IntMDCT requires only three rounding operations
per sample, including M/S processing. Concerning mono sig- nals,
the same structure can be used, but has to be ex- tended by some
additional lifting steps to obtain the inte- ger DCT-IV of one
block; see [47]. This mono IntMDCT requires four rounding
operations per sample.
2.4 Noise Shaping The lossless coding efficiency of the IntMDCT is
fur-
ther improved by utilizing a noise-shaping technique, in- troduced
in [48]. In the lifting steps where time-domain signals are
processed, the rounding operations include an error feedback
mechanism to provide a spectral shaping of the approximation noise.
This approximation noise affects the lossless coding efficiency
mainly in the high- frequency region where audio signals usually
carry only a small amount of energy, especially at sampling rates
of 96 kHz and above. Hence the low-pass characteristics of the
approximation noise improve the lossless coding effi- ciency. A
first-order noise-shaping filter is employed in the three stages of
lifting steps in the windowing/TDA processing and in the first
rounding stage of the integer DCT-IV processing. Fig. 9 compares
the resulting ap- proximation error between the IntMDCT values and
the MDCT values rounded to integer, when the IntMDCT operates both
with and without noise shaping.
3 ERROR MAPPING AND RESIDUAL CALCULATION
The objective of the error-mapping/residual calculation stage is to
produce an integer enhancement signal that enables lossless
reconstruction of the audio signal while
consuming as few bits as possible. Rather than encoding all IntMDCT
coefficients c[k] directly, in the lossless en- hancement layer, it
is more efficient to make use of the information that has already
been coded by the AAC layer. This is achieved by the error-mapping
process, which produces a deterministic residual signal between the
IntMDCT spectral values and their counterparts from the AAC
layer.
In order to produce a residual signal e[k] with the small- est
possible entropy, given the core AAC quantized value i[k], it is
sufficient in most cases to use the minimum mean square error
(MMSE) residual obtained by subtracting the IntMDCT coefficient
c[k] from its MMSE reconstruction c[k] ! E{c[k]|i[k]},
given i[k], that is,
e$k% = c$k% ! c$k%. (4)
Here E{·} denotes the expectation operation. However, in SLS a
somewhat different approach is adopted, where the residual signal
is given by
e$k% = &c$k% ! thr$k%, i $k% " 0
c$k%, i $k% = 0 (5)
where thr (i[k]) is the next quantized value closer to zero with
respect to i[k], and is calculated via table lookup and linear
interpolation to ensure the deterministic behavior necessary for
lossless coding. This error-mapping process is illustrated in Fig.
10, where two different cases are shown. In the first case the
IntMDCT coefficients c[k] belong to a scale-factor band that has
been quantized and coded at the AAC encoder (significant band). For
these coefficients the residual coefficients are obtained by sub-
tracting the quantization thresholds from their correspond- ing
c[k], resulting in a residual spectrum with reduced amplitude. In
the other case c[k] belongs to a band that is not coded or has been
quantized to zero in the AAC en- coder (insignificant band). In
this case the residual spec- trum is simply the IntMDCT spectrum
itself.
This error-mapping process offers many advantages that permit
better coding efficiency in the enhancement layer. First, as
illustrated in Fig. 11, we notice that if the ampli- tude of c[k]
is distributed exponentially, the amplitude of
Fig. 9. Mean-squared approximation error of stereo IntMDCT
(including M/S) with and without noise shaping. Fig. 10.
Illustration of error-mapping process.
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
31
e[k] will likewise be distributed approximately exponen- tially.
Thus it can be coded very efficiently by using the BPGC/CBAC coding
process described in the next sec- tion. Secondly, for a
significant band we find the follow- ing property for the
coefficients c[k]:
| thr!i$k%" | # |c$k% | $ | thr!i$k%" | + %!i$k%" (6) where
%!i$k%" = thr!| i$k% | + 1" ! thr!| i$k% |" (7)
is the quantization step size of i[k]. Clearly, the magnitude of
e[k] is bounded by the value of %(i[k]), and its sign is also
identical to that of i[k] for i[k] " 0. As a result, no additional
data transmission is needed for conveying the sign information.
This is referred to as “implicit signaling” in the SLS
specification.
This implicit signaling mechanism assumes that the in- put to the
SLS layer and the AAC core quantization value are “well
synchronized.” This may not be true in all cases, given that AAC
encoders have all the freedom to optimize the encoding process and
thus may produce an i[k] that does not satisfy Eq. (6). Thus SLS
also includes an “ex- plicit signaling” mechanism, which employs an
MMSE error mapping as defined in Eq. (4), where c[k] is approxi-
mated by the AAC inverse quantization value. In this case all the
side information necessary for decoding e[k] is signaled explicitly
from the encoder to the decoder.
4 ENTROPY CODING
To achieve best efficiency in the compression of the en- hancement
layer information, several methods for entropy coding of the
IntMDCT residual information are employed.
4.1 Combined BPGC/CBAC Coding The bit-plane Golomb code (BPGC)
coding process
used in SLS is basically a bit-plane coding scheme where
the bit-plane symbols are coded arithmetically with a structural
frequency assignment rule. Considering an input data vector e !
{e[0], . . , e[N ! 1]} for which N is the dimension of e, each
element e[k] in e is first represented in a binary format as
e$k% = !2s$k% ! 1" # j=0
M!1
(8)
s$k% != &1, e$k% ' 0
0, e$k% $ 0 , k = 0, . . . , N ! 1 (9)
and bit-plane symbols b[k, j] # {0, 1}, i ! 1, . . . , k, and M is
the MSB for e that satisfies 2M!1 # max { |e[k] |} < 2M, k ! 0,
. . . , N ! 1. The bit-plane symbols are then scanned and coded
from MSB to LSB over all the ele- ments in e, and coded by using an
arithmetic code with a structural frequency assignment QJ
L given by
1 + 22 j!L , j ' L (10)
where the “lazy-plane” parameter L can be selected using the
adaptation rule
L = min&L" # " |2L"+1N ' A' (11)
and A is the absolute sum of the data vector e. Although this BPGC
coding process delivers excellent
compression performance for data that stem from an in- dependent
and identically distributed (iid) source with Laplacian
distribution [43], it lacks the capability to ex- plore the
statistical dependencies between data samples that may exist in
certain sources to achieve better com- pression performance. These
correlations can be captured very effectively by incorporating
context modeling tech- nology into the BPGC coding process, where
the fre- quency assignment for arithmetic coding of bit-plane sym-
bols is not only dependent on the distance of the current bit plane
from the lazy-plane parameter as in the frequency assignment rule
[Eq. (10)], but also on other possible el- ements that may affect
the probability distribution of these bit-plane symbols.
In the context of lossless coding of IntMDCT spectral data for
audio, elements that possibly affect the distribu- tion of the
bit-plane symbols include the frequency loca- tions of the IntMDCT
spectral data, the magnitude of the adjacent spectral lines, and
the status of the AAC core quantizer. In order to capture these
correlations, several contexts are used in the context-based
arithmetic code (CBAC) of SLS. The guide in selecting these
contexts is to try to find those contexts that are “most”
correlated to the distribution of the bit-plane symbols.
In CBAC, three types of contexts are used, namely, the frequency
band (FB) context, the distance to lazy (D2L)
Fig. 11. Distribution of residual signal from error-mapping
process.
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February32
context, and the significant state (SS) context. The detailed
context assignments are summarized in the following.
Context 1: Frequency Band (FB) It is found in [44] that the
probability distribution of bit-plane symbols of IntMDCT varies for
different frequency bands. Therefore in CBAC the IntMDCT spectral
data are classified into three different FB contexts, namely, low
band (0–4 kHz), midband (4–11 kHz), and high band (above 11
kHz).
Context 2: Distance to Lazy (D2L) The D2L context is defined as the
distance of the current bit plane j to the BPGC lazy-plane
parameter L, as defined in the following equation:
D2L = &3 ! j + L , j ! L ' !2
6, otherwise. (12)
This is motivated by the BPGC frequency assignment rule, Eq. (10),
which is based on the fact that the skew of the probability
distribution of bit-plane symbols from a source with a (near)
Laplacian distribution tends to decrease as the number of D2L
decreases. To reduce the total number of the D2L context, all the
bit planes with j ! L < !2 are grouped into one context where
all the bit-plane symbols are coded with probability 0.5.
Context 3: Significant State (SS) The SS context at- tempts to
capture the factors that correlate with the distri- bution of the
magnitude of the IntMDCT residual in one place. These include the
magnitude of the adjacent IntMDCT spectral lines and the
quantization interval of the AAC core quantizer if it has been
quantized previously in the core encoder. Further detail on the SS
context can be found in [49].
4.2 Low-Energy-Mode Coding The BPGC/CBAC coding process described
earlier
works well for sources with (near) Laplacian distribution, which
usually is the case for most audio signals [50]. However, it was
also found that for some music items there are time/frequency (T/F)
regions with very low en- ergy levels where the IntMDCT spectral
data are in fact dominated by the rounding errors of the IntMDCT
(see Fig. 7) with a distribution that is substantially different
from Laplacian. In order to encode those low-energy re- gions
efficiently, the BPGC/CBAC coding process is re- placed by
low-energy-mode coding, as shown in the following.
The low-energy-mode coding is invoked for scale factor bands for
which the BPGC parameter L is smaller than or equal to 0. Then the
amplitude of the residual spectral data e[k] is first converted
into a unitary binary string b ! {b[0], b[1], . . . , b[pos], . .
.}, as illustrated in Table 1, with M being the maximum bit plane.
It can be seen that the probability distribution of these symbols
is a function of the position pos, and the distribution of
e[k];
Pr&b$pos% = 1' = Pr&e$k% ( pos |e$k% ' pos' (13)
where 0 # pos < 2M.b[pos] is then coded arithmetically depending
on its position pos and the BPGC parameter L with a trained
frequency table.
4.3 Smart Decoding Due to their fine-grain scalability, HD-AAC bit
streams
can be truncated at any bit rate lower than what would be needed
for a fully lossless reconstruction to produce near- lossless
representations of the audio signal. In the context of arithmetic
decoding, the smart decoding method pro- vides a way to optimally
decode such a truncated bit stream. It decodes additional symbols
in the absence of incoming bits when a decoding buffer still
contains mean- ingful information for arithmetic decoding in the
CBAC/ BPGC mode and/or low-energy mode. Decoding contin- ues up to
the point where no ambiguity exists in deter- mining a symbol
[51].
5 OTHER CODING TOOLS
As counterparts to the underlying AAC perceptual au- dio coder, SLS
provides a number of integerized versions of AAC coding
tools.
5.1 Integer M/S In the AAC codec the M/S tool allows to choose
indi-
vidually between mid/side and left/right coding modes on a
scale-factor-band basis. As shown in Section 2.3 it can provide
either a global left/right or a global mid/side spec- tral
representation. In order to make the integer spectral values fit
the spectral values from the AAC core on a scale-factor-band basis,
an invertible integer version of the M/S mapping is used. It is
based on a lifting decomposi- tion of the normalized M/S matrix,
that is, a rotation by !/4,
1
0 1 #"1 0
0 1 #. (14)
5.2 Integer TNS When the temporal noise-shaping (TNS) tool is used
in
the AAC core, the resulting MDCT spectral values deviate from the
IntMDCT spectral values. In order to compensate for this, the same
TNS filter as in the AAC core is applied to the integer spectral
values in the lossless enhancement. To assure lossless operation,
the TNS filter is converted to a deterministic invertible integer
filter.
Table 1. Binarization of IntMDCT error spectrum at low-energy
mode.
Amplitude of e[k] Binary String {b[pos]}
0 0 1 1 0 2 1 1 0 . . . . . . 2M !2 1 1 . . . . . . . . . 1 0 2M !1
1 1 . . . . . . . . . 1 1
pos 0 1 2 3 . . .
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
33
6 BIT STREAM MULTIPLEXING
From a mechanical point of view, the SLS coded data, including the
core layer AAC bit stream and the enhance- ment bit stream, can be
carried in multiple elementary streams (ES) in an MPEG-4 system
[52]. As shown in Fig. 12, the AAC bit stream is carried in the
so-called base layer ES, and the enhancement bit stream is carried
in one or more enhancement layer ESs. Each ES is thus com- posed of
a sequence of access units (AUs), where one AU contains one audio
frame from the AAC bit stream or the enhancement bit stream.
From an application point of view, such a bit stream structure
provides great flexibility in constructing either a large-step
scalable system or a fine-grain scalable system with SLS. For
example, in a scalable audio streaming ap- plication, the server
stores multiple SLS ESs at predefined bit rates, each assigned with
a different stream priority. During the streaming, the ESs are
transmitted in the order of their stream priority, and ESs with
lower stream priority are dropped by either the streaming server or
the network gateway whenever the transmission bandwidth is insuffi-
cient to stream a full-rate lossless bit stream. Alternatively it
is also possible to implement a lightweight bit stream truncation
algorithm in the streaming server or in the mul- timedia gateway to
truncate the enhancement stream AUs directly according to the
available bandwidth to achieve the fine granular bit rate
scalability.
7 MODES OF OPERATION
So far the HD-AAC coder has been described as a com- bination of a
regular AAC core layer coder (such as low complexity AAC) and an
SLS enhancement layer. This section introduces further features
offered by the SLS en- hancement layer that concern its combination
with the
AAC coder, and its ability to be used in combination with other
types of AAC-based codecs, such as AAC scalable or AAC/BSAC.
7.1 Oversampling Mode In the context of high-definition audio
applications it is
frequently desirable to achieve lossless signal reconstruc- tion at
sampling rates of 96 kHz, or even 192 kHz. While the MPEG-4 AAC
coder supports these sampling rates, it typically achieves best
coding efficiency for high-quality perceptual coding at sampling
rates between 32 and 48 kHz. Thus in order to allow both efficient
core layer cod- ing at common rates and lossless reconstruction at
higher rates, MPEG-4 SLS includes an additional feature called
“oversampling mode.” This refers to the possibility of let- ting
the lossless enhancement operate at a sampling rate higher than
that of the AAC core codec. The ratio between the SLS sampling rate
and the AAC sampling rate is called “oversampling factor” and can
be either 1, 2, or 4. For example, the lossless enhancement can
operate at a rate of 192 kHz, whereas the AAC core operates at 48
kHz, see Table 2.
The mapping between the two coding layers is achieved by using a
time-aligned framing and a correspondingly longer IntMDCT in the
lossless enhancement. For ex- ample, an IntMDCT of the size of 4096
spectral values is
Table 2. Example combinations of sampling rates for AAC core and
lossless enhancement.
AAC @ 48 kHz
AAC @ 96 kHz
AAC @ 192 kHz
Fig. 12. Structure of MPEG-4 SLS bit stream.
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February34
used in case of oversampling by a factor of 4, and the 1024 MDCT
values from the AAC core are mapped to the lower 1024 IntMDCT
values. In addition to allowing for opti- mum AAC performance, the
lossless performance is im- proved when using longer IntMDCT
transforms. (A trans- form length of 2048 or 4096 provides a better
lossless performance for stationary signals than a transform length
of 1024.)
7.2 Combination with MPEG-4 Scalable AAC
In order to account for varying or unknown transmis- sion capacity,
the MPEG-4 AAC codec also provides sev- eral scalable coding modes
[3], [34]. MPEG-4 scalable AAC allows perceptual audio coding with
one or more AAC mono or stereo layers and can be combined with an
SLS enhancement layer. This results in an overall coder that offers
fine-grain scalability in the range between loss- less
reconstruction and the AAC representation, and scal- ability in
several steps within the perceptually coded AAC
representation.
7.2 Combination with MPEG-4 AAC/BSAC Similar to the combination
with scalable AAC, the SLS
enhancement can also be operated on top of MPEG-4 AAC/BSAC as a
core layer coder. This provides a fine- grain scalable
representation both between lossless and perceptually coded audio
and within the perceptually coded range. The latter has a
granularity of 1 kbps per channel.
7.4 Stand-Alone Operation Finally the SLS lossless enhancement can
also operate
as a straightforward stand-alone codec, without any un- derlying
core codec. Nonetheless, this operation mode of- fers both full
lossless coding capability and fine-grain scalability.
8 PERFORMANCE
This section quantifies the performance of the HD-AAC codec in
various operation modes. While the compression ratio can be seen as
the sole measure of merit for lossless operation scenarios, an
evaluation of performance for near-lossless operation requires
audio quality measure- ments at various data rates and operating
points.
8.1 Lossless Compression Performance This section reports the
performance of HD-AAC in
terms of its ability for lossless compression of various audio
material. As a figure of merit, the compression ratio is defined
as
compression ratio = original file size
compressed file size . (15)
Tables 3 and 4 show the lossless compression perfor- mance for two
major sets of test material, that is, the MPEG-4 lossless audio
coding test set (donated by Mat- sushita Corporation and containing
in part recordings per- formed by the New York Symphonic Ensemble).
For both sets the compression results are given for an HD-AAC
configuration with an AAC core layer running at 128 kbit/s stereo
plus SLS enhancement, and for an SLS stand- alone
configuration.
As could be expected from theory, it can be observed in Table 3
that an increase in word length reduces the aver- age compression
ratio (due to the fact that the least sig- nificant bits of the PCM
codewords are more random and thus less compressible). On the other
hand, increasing the sampling rate improves compression because of
the in- creased correlation between adjacent samples (assuming
sound material with typical high-frequency characteristics).
For the MPEG-4 lossless audio test set, an average com- pression
ratio of 2:1 can be achieved easily at a sampling rate of 48 kHz
and 16-bit word length. This is competitive with the best of other
known lossless compression systems [53]. It can also be observed
that for the AAC-based mode an additional bit rate of only 30–40
kbps is required com- pared to the stand-alone mode for lossless
representation. This reduces the bit rate consumption by 90–100
kbps compared to simulcast solutions that transmit both an 128-
kbps AAC bit stream and a stand-alone SLS lossless bit stream
simultaneously.
8.2 Near-Lossless Compression Performance The bit-plane coding of
residual spectral values (that is,
of the AAC quantization error) allows to refine the initial AAC
quantization successively as more bits from the SLS enhancement
layer are decoded. With each additional de- coded bit plane the
quantization error is reduced by 6 dB. Consequently an increasing
safety margin with respect to audibility is added as the bit rate
increases.
Table 3. Lossless compression results for MPEG-4 lossless audio
test set.
SLS + AAC @ 128 kbps/Stereo (AAC @ 48 kHz sampling rate) SLS
Stand-Alone
Compression Ratio
Average Bit Rate (kbps)
48 kHz/16 bit 2.09 735 2.20 698 48 kHz/24 bit 1.55 1490 1.58 1454
96 kHz/24 bit 2.09 2201 2.13 2160 192 kHz/24 bit 2.60 3543 2.63
3509 Overall 2.08 1992 2.12 1955
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
35
8.2.1 Evaluation of Near-Lossless Audio Quality While it may seem
sufficient for most purposes to pro-
vide perceptually transparent reproduction of audio signals by
using conventional perceptual audio coders (such as AAC at a
sufficient bit rate), there are applications that demand still
higher audio quality. This is especially the case for professional
audio production facilities, such as archiving and broadcasting, in
which audio signals may undergo many cycles of encoding/decoding
(tandem cod- ing) before being delivered to the consumer. This
leads to an accumulation of introduced coding distortion and may
lead to unacceptable final audio quality [54], unless sub- stantial
headroom toward audibility is provided by each coding step, for
example, by using coding algorithms with very high quality or bit
rate.
ITU-R BS.1548-1 [55] defines the requirements for au- dio coding
systems for digital broadcasting, assuming a codec chain consisting
of so-called contribution, distribu- tion, and emission codecs.
According to this recommen- dation, and based on ITU-R BS.1116-1
[56], audio codecs for contribution and distribution should fulfill
the follow- ing requirements:
The quality of sound reproduced after a reference contribution/
distribution cascade [. . .] should be subjectively indistinguish-
able from the source for most types of audio programme ma- terial.
Using the triple stimuli double blind with hidden reference test,
described in Recommendation ITU-R BS.1116 [. . .], this requires
mean scores generally higher than 4.5 in the impair- ment 5-grade
scale, for listeners at the reference listening po- sition. The
worst rated item should not be graded lower than 4.
In accordance with these recommendations, tests were run on signals
encoded or decoded with HD-AAC. Instead
of running numerous listening tests for subjective quality
assessment at individual operating points, the evaluation employed
the PEAQ measurement (BS.1387-1) [57], which provides methods for
objective measurements of perceived audio quality in scenarios that
are normally as- sessed by ITU-R BS.1116 testing. The most
essential re- sults can be seen in Figs. 13–15; see also
[58].
The graphs show the estimated subjective sound quality expressed as
objective difference grade (ODG) values, which were computed by a
PEAQ measurement. The evaluation procedure consists of multiple
cycles of tandem coding/decoding with up to 16 cycles. The standard
set of critical MPEG-4 audio items for perceptual audio coding
evaluations was used. ODG values of 0, !1, !2, !3, !4 correspond to
a subjective audio quality of “indistinguish- able from original,”
“perceptible but not annoying,” “slightly annoying,” “annoying,”
and “very annoying,” respectively.
Fig. 13 shows the achieved ODG values as a function of tandem
cycles for a traditional AAC coder running at a bit rate of 128
kbps/stereo. As expected, it can be observed that the audio quality
degrades significantly with an in- creasing number of tandem
cycles, depending on the test item. For this reason tandem coding
is not a recommended practice for such coders.
Fig. 14 displays the corresponding tandem coding re- sults for the
HD-AAC combination running at 512 kbps/ stereo (AAC at 128 kbps +
SLS enhancement at 384 kbps). It can be noted that the audio
quality remains consistently at a very high level, even after a
total of 16 tandem cycles. This illustrates the high robustness of
the HD-AAC rep- resentation against tandem coding. According to
these measurements, the aforementioned BS.1548-1 audio qual- ity
requirement is fulfilled with a considerable safety mar- gin.
Furthermore, when placed in tandem with AAC (such
Table 4. Lossless compression results for commercial CD test
set.
CD Items (16 bit/44.1 kHz)
Compression Ratio
SLS Stand-Alone
ACDC—Highway to Hell (Sony 80206) 1.31 1.36 Avril Lavigne—Let Go
(Arista 14740) 1.36 1.41 Backstreet Boys—Greatest Hits Chapter One
(Jive 41779) 1.39 1.45 Brian Setzer—The Dirty Boogie (Interscope
90183) 1.43 1.49 Cowboy Junkies—Trinity Session (RCA-8568) 1.93
2.04 Grieg—Peer Gynt, von Karajan (DG 439010) 2.63 2.83 Jannifer
Warnes—Famous Blue Raincoat (BMG 258418) 2.07 2.20 Marlboro Music
Festival—DISC A (Bridge 9108) 2.23 2.35 Marlboro Music
Festival—DISC B (Bridge 9108) 2.20 2.33 Nirvana—Nirvana (Interscope
493523) 1.50 1.56 Philip Jones—40 Famous Marches, CD1 (Decca
416241) 1.99 2.11 Philip Jones—40 Famous Marches, CD2 (Decca
416241) 2.00 2.11 Pink Floyd—Dark Side of the Moon (Capitol 46001)
1.76 1.85 Rebecca Pidgeon—The Raven (Chesky 115) 1.88 1.97 Ricky
Martin (Sony 69891) 1.33 1.38 Schubert Piano Trio in E-flat (Sony
48088) 2.74 2.90 Spaniels—The Very Best Of (Collectables 7243) 2.41
2.62 Steeleye Span—Below the Salt (Shanachie 79039) 1.85 1.95
Suzanne Vega—Solitude Standing (A&M 5136) 1.74 1.83 Westminster
Concert Bell Choir—Christmas Bells (Gothic Records 49055) 2.55
2.71
Overall 1.85 1.94
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February36
as for final audio distribution over a narrow-band chan- nel), the
resulting audio quality is not degraded signifi- cantly by the
preceding HD-AAC tandem cascade. Further details can be found in
[59].
8.2.2 Stand-Alone SLS Operation The SLS codec can also operate as a
stand-alone loss-
less codec when the AAC core codec is not used, some-
Fig. 13. Test results: AAC tandem coding.
Fig. 14. Test results: HD-AAC tandem coding.
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
37
times also referred to as the “noncore mode”. Despite of its simple
structure (only IntMDCT and BPGC/CBAC mod- ules are used), this
mode allows efficient lossless coding [59]. Furthermore, fine-grain
scalability by truncated bit- plane coding is also possible in this
mode. Given that the stand-alone SLS codec does not include any
perceptual model to estimate masking thresholds, it is interesting
to investigate the audio quality resulting from a truncation of the
SLS bit stream.
Due to the behavior of bit-plane coding in this mode, a constant
signal-to-noise ratio is achieved in each scale- factor band. With
each additional bit plane the signal-to- noise ratio improves by 6
dB. While this behavior does not allow SLS to compete with
efficient perceptual codecs at low bit rates (for example, AAC at
128 kbps/stereo), this simple approach works quite well at higher
bit rates in the near-lossless range. Fig. 15 shows tandem coding
results for the stand-alone SLS codec operating at 512 kbps/
stereo. It reaches about the same near-lossless audio qual- ity as
the AAC-based HD-AAC mode discussed in the previous section.
At a constant bit rate of 768 kbps most test items still require
the truncation of some coder frames. Nevertheless the corresponding
PEAQ measurements indicate that no degradation of subjective audio
quality occurs in this tan- dem coding scenario for both the
AAC-based mode and the stand-alone mode; see [59]. This provides an
interest- ing operating point for HD-AAC modes, corresponding to a
guaranteed 2:1 compression. While other stand-alone lossless codecs
can also provide an average compression of 2:1 for suitable test
material, their peak compression
performance can be much lower, depending on the audio material to
be encoded. In contrast, HD-AAC is able to guarantee a certain
compression ratio while providing lossless or near-lossless signal
representation, depending on the input signal.
9 DECODER COMPLEXITY
The computational complexity for SLS decoding can be evaluated by
counting the total number of standard in- structions
(multiplications, additions, bit shifts, compari- sons, memory
transfers) required for performing the de- coding process on a
generic 32-bit fixed-point CPU.
The main components contributing to the computational complexity of
SLS are:
1) IntMDCT filter bank 2) Bit-plane arithmetic decoder 3) AAC
Huffman decoding 4) AAC + SLS inverse error mapping 5) Integer M/S
stereo coding 6) Unpacking of tables. Items 3) to 5) are only
required in the AAC-based mode,
item 6) only if the necessary tables are not precomputed.
9.1 Number of Instructions Table 5 lists the number of instructions
required for de-
coding in the AAC-based mode, with the AAC core operat- ing at 64
kbps per channel. Table 6 shows the correspond- ing numbers for the
SLS in stand-alone mode (without the AAC core layer). For both
tables, values are provided for both implementations with and
without table prepacking.
Fig. 15. Test results: stand-alone SLS tandem coding.
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February38
9.2 ROM Requirements For an implementation of SLS in stand-alone
mode, a
ROM size of 4 kbytes is required. For the AAC-based mode, the ROM
requirement is 45 kbytes. As can be seen from Tables 5 and 6, a
tradeoff between ROM requirement and number of instructions can be
made by precomputing the necessary table values. More details on
SLS compu- tational complexity can be found in [59]. The computa-
tional complexity of AAC decoding is analyzed in [60].
10. APPLICATIONS
As the primary functionality of HD-AAC audio coding is lossless
audio coding, it can be used in applications that require bit-exact
reconstruction, such as studio operation, music disc delivery, or
audio archiving. Due to its inherent scalability, HD-AAC audio
coding technology in fact fits into virtually every application
that requires audio com- pression. Several potential application
scenarios are listed here.
Studio Operations HD-AAC audio coding technology is useful for the
storage of audio at various points in studio operations such as
recording, editing, mixing, and premas- tering as studio procedures
are designed to preserve the highest levels of quality. The
scalability of the SLS layer also provides a nice solution to
situations in which the band- width is not sufficient to support
fully lossless quality.
Archival Application Archives of sound recordings are very common
in studios, record labels, libraries, and so on. These archives are
tremendously large and certainly compression is essential. In
addition, the scalability of the
SLS technology facilitates the possibility that lower bit- rate
versions of the archive’s lossless audio items can be extracted at
any time to allow applications such as remote data browsing.
Broadcast Contribution/Distribution Chain In a broadcast
environment HD-AAC audio coding technology could be used in all
stages comprising archiving, contri- bution/distribution, and
emission. In the broadcast chain one main feature of the technology
can be used: In every stage where lower bit rates are required, the
bit stream is merely truncated, and no reencoding is therefore
required.
Consumer Disc-Based Delivery HD-AAC technology can also be used in
consumer disc-based delivery of music content. It enables the music
disc to deliver both lossless and lossy audio on the same
medium.
Internet Delivery of Audio In such an application sce- nario the
available transmission bandwidth can vary dra- matically across
different access network technologies and over time. As a result,
the same audio content at a variety of bit rates and qualities may
need to be kept ready at the server side. HD-AAC technology
provides a “one-file” solution for such a requirement.
Audio Streaming HD-AAC technology delivers the vital bit-rate
scalability for streaming applications on channels with variable
quality of service (QoS) conditions. Examples for this kind of
streaming applications include Internet audio streaming and
multicast streaming applica- tions that feed several channels of
differing capacity.
Digital Home The idea of the digital home is to create an open and
transparent home network platform that en- ables consumers to
easily create, use, manage, and share digital content such as
audio, video, or image. In a typical
Table 5. Maximum numbers of INT32 operations per sample for SLS
decoding with AAC core.
Frame Length Muls Adds/Subs Ors Shifts Negs Movs Combined
All ! 1 Cycle
Tables Preunpacked 4096 or 512 19.50 93.46 34.50 72.60 34.26 16.54
270.86 2048 or 256 20.25 91.22 31.50 73.54 32.25 16.29 265.05 1024
or 128 18.00 88.99 28.50 68.50 30.85 15.99 250.83
Tables Unpacked in Place 4096 or 512 19.50 123.21 34.50 88.01 34.26
16.54 316.10 2048 or 256 20.25 117.47 31.50 85.54 32.25 16.29
303.30 1024 or 128 18.00 105.24 28.50 75.00 30.85 15.99
273.58
Table 6. Maximum number of INT32 operations per sample for SLS
stand-alone decoding.
Frame Length Muls Adds/Subs Ors Shifts Negs Movs Combined
All ! 1 cycle
Tables Preunpacked 4096 or 512 18 86.63 34.50 67.10 29.93 9.54
245.70 2048 or 256 18.75 84.39 31.50 68.04 27.92 9.29 239.89 1024
or 128 16.5 82.16 28.50 63.00 26.52 8.99 225.67
Tables Unpacked in Place 4096 or 512 18.00 116.38 34.50 82.6 29.93
9.54 290.05 2048 or 256 18.75 110.64 31.50 80.04 27.92 9.29 278.14
1024 or 128 16.50 98.41 28.50 69.50 26.52 8.99 248.42
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
39
setup for audio, the user can download the HD-AAC coded bit streams
in lossless quality from the service pro- vider and archive them on
the home music server. These bit streams are then streamed, or
downloaded to different audio terminals at differing quality for
playback.
11. CONCLUSIONS
The new ISO/MPEG specification for scalable lossless coding extends
the well-known perceptual coding scheme AAC toward lossless and
near-lossless operation, and in this way enables its use in the
context of high-definition applications. The HD-AAC scheme offers
competitive lossless compression rates at all relevant operating
points (word length and sampling rate). For distribution on band-
width-limited channels a perceptually coded compatible AAC bit
stream can simply be extracted from the com- posite HD-AAC stream.
Alternatively, the SLS part can also be used as a simple and
versatile stand-alone com- pression engine. In both cases the
fidelity of the signal representation can be scaled with fine
granularity within a wide range of near lossless representations.
This enables lossless or near lossless transmission of
high-definition audio with a guaranteed maximum rate. We anticipate
that this flexibility will make HD-AAC the technology of choice for
many applications that call for both very high audio quality and
delivery over a wide range of transmis- sion channels.
12 ACKNOWLEDGMENT
The authors would like to thank all their colleagues at the MPEG
audio subgroup who supported the lossless standardization activity,
especially Takehiro Moriya (NTT) for inspiring these
standardization activities, Til- man Liebchen (Technical University
of Berlin) for chair- ing the ad hoc group on lossless audio
coding, and Yuriy Reznik (Real Networks/Qualcomm) for his thorough
com- plexity evaluations.
13 REFERENCES
[1] ISO/IEC 11172-3, “Coding of Moving Pictures and Associated
Audio for Digital Storage Media at up to about 1.5 Mbit/s—Part 3:
Audio,” International Standards Orga- nization, Geneva, Switzerland
(1992).
[2] ISO/IEC 13818-3, “Information Technology— Generic Coding of
Moving Pictures and Associated Au- dio—Part 3: Audio,”
International Standards Organiza- tion, Geneva, Switzerland
(1994).
[3] ISO/IEC 14496-3:2001, “Coding of Audio-Visual Objects—Part 3:
Audio,” International Standards Organi- zation, Geneva, Switzerland
(2001).
[4] ISO/IEC 14496-3:2001/Amd.1:2003, “Coding of Audio-Visual
Objects—Part 3: Audio, Amendment 1: Bandwidth Extension,”
International Standards Organiza- tion, Geneva, Switzerland
(2003).
[5] M. Dietz, L. Liljeryd, K. Kjorling, and O. Kunz, “Spectral Band
Replication—A Novel Approach in Audio
Coding,” presented at the 112th Convention of the Audio Engineering
Society, J. Audio Eng. Soc. (Abstracts), vol. 50, pp. 509, 510
(2002 June), convention paper 5553.
[6] ISO/IEC 14496-3:2001/Amd.2:2004, “Coding of Audio-Visual
Objects—Part 3: Audio, Amendment 2: Parametric Coding for High
Quality Audio,” International Standards Organization, Geneva,
Switzerland (2004).
[7] C. den Brinker, E. Schuijers, and W. Oomen, “Para- metric
Coding for High-Quality Audio,” presented at the 112th Convention
of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts),
vol. 50, p. 510 (2002 June), convention paper 5554.
[8] ISO/IEC FCD 23003-1, “MPEG-D (MPEG Audio Technologies)—Part 1:
MPEG Surround” International Standards Organization, Geneva,
Switzerland (2006).
[9] J. Breebaart, J. Herre, C. Faller, J. Roeden, F. My- burg, S.
Disch, H. Purnhagen, G. Hotho, M. Neusinger, K. Kjoerling, and W.
Oomen, “MPEG Spatial Audio Coding/ MPEG Surround: Overview and
Current Status,” pre- sented at the 119th Convention of the Audio
Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 53, p.
1228 (2005 Dec.), convention paper 6599.
[10] “Special Issue: High-Resolution Audio,” J. Audio Eng. Soc.,
vol. 52, pp. 116–260 (2004 Mar.).
[11] “DVD Forum,” http://www.dvdforum.org/ forum.shtml
(2004).
[12] Royal Philips Electronics, “Super Audio CD Sys- tems,”
http://www.licensing.philips.com/information/ sacd/ (2006).
[13] Blu-Ray Disc Association, “Blu-Ray Disc,” http://
www.blu-raydisc.com/ (2006).
[14] Dolby Laboratories Inc., “MLP Lossless” http://
www.dolby.com/consumer/technology/mlp_lossless.html (2006).
[15] M. A. Gerzon, P. G. Craven, J. R. Stuart, M. J. Law, and R. J.
Wilson, “The MLP Lossless Compression System,” in Proc 17th AES
Conf. (Florence, Italy, 1999 Sept.), pp. 61–75.
[16] DTS Inc., “DTS HD,” http://www.dtsonline.com/
consumer/dtshd.php (2006).
[17] Apple Computer Inc., “Apple Quicktime,” http://
www.apple.com/downloads/macosx/apple/quicktime651 .html
(2006).
[18] J. Coalson, “FLAC—Free Lossless Audio Codec,”
http://flac.sourceforge.net (2006).
[19] M. T. Ashland, “Monkey’s Audio—A Fast and Powerful Lossless
Audio Compressor,” http://www .monkeysaudio.com (2004).
[20] F. Ghido, “OptimFROG,” http://www.losslessaudio .org
(2006).
[21] ITU-T G.729.1, “G.729 Based Embedded Variable Bit-Rate Coder:
An 8–32 kbit/s Scalable Wideband Coder Bitstream Interoperable with
G.729,” International Tele- communication Union, Geneva,
Switzerland (2006).
[22] B. Grill, “A Bit-Rate Scalable Perceptual Coder for MPEG-4
Audio presented at the 103rd Convention of the Audio Engineering
Society, J. Audio Eng. Soc. (Ab- stracts), vol. 15, p. 1005 (1997
Nov.), preprint 4620.
[23] J. Herre, E. Allamanche, K. Brandenburg, M.
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February40
Dietz, B. Teichmann, B. Grill, A. Jin, T. Moriya, N. Iwakami, T.
Norimatsu, M. Tsushima, and T. Ishikawa, “The Integrated
Filterbank-Based Scalable MPEG-4 Au- dio Coder,” presented at the
105th Convention of the Au- dio Engineering Society, J. Audio Eng.
Soc. (Abstracts), vol. 46, p. 1039 (1998 Nov.), preprint
4810.
[24] S. H. Park, Y. B. Kim, S. W. Kim, and Y. S. Seo, “Multi-Layer
Bit-Sliced Bit-Rate Scalable Audio Coding,” presented at the 103rd
Convention of the Audio Engineer- ing Society, J. Audio Eng. Soc.
(Abstracts), vol. 45, p. 1005 (1997 Nov.), preprint 4520.
[25] M. Nishiguchi, “MPEG-4 Speech Coding,” in Proc. 17th AES Conf.
(Florence, Italy, 1999 Sept.), pp. 139–146.
[26] M. Nishiguchi, A. Inoue, Y. Maeda, and J. Matsu- moto,
“Parametric Speech Coding—HVXC at 2.0–4.0 kbps,” presented at the
IEEE Workshop on Speech Cod- ing, Porvoo, Finland (1999
June).
[27] M. S. Schroeder and B. S. Atal, “Code-Excited Linear
Prediction (CELP): High-Quality Speech at Very Low Bit Rates,” in
Proc. IEEE ICASSP (Tampa, FL, 1985 Mar.), pp. 937–940.
[28] T. Nomura, M. Iwadare, M. Serizawa, and K. Ozawa, “A Bitrate
and Bandwidth Scalable CELP Coder,” in Proc. IEEE ICASSP (Seattle,
WA, 1998 May), pp. 341–344.
[29] ISO/IEC JTC1/SC29/WG11, “Final Call for Pro- posals on MPEG-4
Lossless Audio Coding,” MPEG2002/ N5208, Shanghai, China (2002
Oct.).
[30] ISO/IEC 14496-3:2001/Amd.6:2005, “Coding of Audio-Visual
Objects—Part 3: Audio, Amendment 6: Lossless Coding of Oversampled
Audio,” International Standards Organization, Geneva, Switzerland
(2005)
[31] E. Knapen, D. Reefman, E. Janssen, and F. Bruek- ers,
“Lossless Compression of 1-Bit Audio,” J. Audio Eng. Soc., vol. 52,
pp. 190–199 (2004 Mar.).
[32] ISO/IEC 14496-3:2005/Amd.2:2006, “Coding of Audio-Visual
Objects—Part 3: Audio, Amendment 2: Au- dio Lossless Coding (ALS),
New Audio Profiles and BSAC Extensions,” International Standards
Organization, Geneva, Switzerland (2006).
[33] ISO/IEC 14496-3:2005/Amd.3:2006, “Coding of Audio-Visual
Objects—Part 3: Audio, Amendment 3: Scalable Lossless Coding
(SLS),” International Standards Organization, Geneva, Switzerland
(2006).
[34] J. Herre and H. Purnhagen, “General Audio Cod- ing,” in The
MPEG-4 Book, IMSC Multimedia Ser., F. Pereira and T. Ebrahimi, Eds.
(Prentice-Hall, Englewood Cliffs, NJ, 2002).
[35] M. Bosi, K. Brandenburg, S. Quackenbush, L. Fielder, K.
Akagiri, H. Fuchs, M. Dietz, J. Herre, G. Dav- idson, and Y.
Oikawa, “ISO/IEC MPEG-2 Advanced Au- dio Coding,” J. Audio Eng.
Soc., vol. 45, pp. 789–814 (1997 Oct.).
[36] ISO/IEC JTC1/SC29/WG11, “Report on the MPEG-2 AAC Stereo
Verification Tests,” MPEG1998/ N2006, San Jose, CA (1998
Feb.).
[37] J. Princen, A. Johnson, and A. Bradley, “Subband/ Transform
Coding Using Filter Bank Designs Based on
Time Domain Aliasing Cancellation,” in Proc IEEE ICASSP (Dallas,
TX, 1987), pp. 2161–2164.
[38] J. Herre and J. D. Johnston, “Enhancing the Per- formance of
Perceptual Audio Coders by Using Temporal Noise Shaping (TNS),”
presented at the 101st Convention of the Audio Engineering Society,
J. Audio Eng. Soc. (Ab- stracts), vol. 44, p. 1175 (1996 Dec.),
preprint 4384.
[39] J. D. Johnston, J. Herre, M. Davis, and U. Gbur, “MPEG-2 NBC
Audio—Stereo and Multichannel Coding Methods,” presented at the
101st Convention of the Audio Engineering Society, J. Audio Eng.
Soc. (Abstracts), vol. 44, p. 1175 (1996 Dec.), preprint
4383.
[40] R. Geiger, T. Sporer, J. Koller, and K. Branden- burg, “Audio
Coding Based on Integer Transforms,” pre- sented at the 111th
Convention of the Audio Engineering Society, J. Audio Eng. Soc.
(Abstracts), vol. 49, p. 1230 (2001 Dec.), convention paper
5471.
[41] R. Geiger, Y. Yokotani, and G. Schuller, “Im- proved Integer
Transforms for Lossless Audio Coding,” in Proc. 37th Asilomar Conf.
on Signals, Systems and Com- puters (Pacific Grove, CA, 2003
Nov.).
[42] R. Geiger, J. Herre, J. Koller, and K. Brandenburg, “IntMDCT—A
Link between Perceptual and Lossless Au- dio Coding,” in Proc. IEEE
ICASSP (Orlando, FL, 2002 May).
[43] R. Yu, C. C. Ko, S. Rahardja, and X. Lin, “Bit- Plane Golomb
Code for Sources with Laplacian Distribu- tions,” in Proc. IEEE
ICASSP (Hong Kong, China, 2003 Apr.), pp. 277–280.
[44] R. Yu, X. Lin, S. Rahardja, C. C. Ko, and H. Huang, “Improving
Coding Efficiency for MPEG-4 Audio Scalable Lossless Coding,” in
Proc. IEEE ICASSP (Phila- delphia, PA, 2005 May).
[45] I. Daubechies, and W. Sweldens, “Factoring Wave- let
Transforms into Lifting Steps,” Tech. Rep. Bell Labo- ratories,
Lucent Technologies (1996).
[46] F. Bruekers and A. Enden, “New Networks for Perfect Inversion
and Perfect Reconstruction,” IEEE J. Selected Areas Comm., vol. 10,
pp. 130–137 (1992 Jan).
[47] R. Geiger, Y. Yokotani, G. Schuller, and J. Herre, “Improved
Integer Transforms Using Multi-Dimensional Lifting,” in Proc. IEEE
ICASSP (Montreal, Canada, 2004 May).
[48] Y. Yokotani, R. Geiger, G. Schuller, S. Oraintara, and K. R.
Rao, “Improved Lossless Audio Coding Using the Noise-Shaped
IntMDCT,” presented at the IEEE 11th DSP Workshop, Taos Ski Valley,
NM (2004 Aug).
[49] R. Yu, X. Lin, S. Rahardja, and H. Haibin, “Pro- posed Core
Experiment for Improving Coding Efficiency in MPEG-4 Audio Scalable
Coding (SLS),” ISO/IEC JTC1/SC29/WG11, M10683, Munich, Germany
(2004 Mar.)
[50] R. Yu, X. Lin, S. Rahardja, and C. C. Ko, “A Statistics Study
of the MDCT Coefficient Distribution for Audio,” in Proc. ICME
(Taipei, Taiwan, 2004 June).
[51] K. H. Choo, E. Oh, J. H. Kim, and C. Y. Son, “Enhanced
Performance in the Functionality of Fine Grain Scalability,”
presented at the 119th Convention of the Au- dio Engineering
Society, J. Audio Eng. Soc. (Abstracts),
PAPERS ISO/IEC MPEG-4 SCALABLE AAC
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
41
vol. 53, pp. 1227, 1228 (2005 Dec.), convention paper 6597.
[52] ISO/IEC 14496–1:2004, “Coding of Audio-Visual Objects—Part 1:
Systems,” International Standards Orga- nization, Geneva,
Switzerland (2004).
[53] M. Hans and R. W. Schafer, “Lossless Compres- sion of Digital
Audio,” IEEE Signal Process Mag., vol. 18, pp. 21–32 (2001
July).
[54] AES Technical Committee of Coding of Audio Signals,
“Perceptual Audio Coders: What to Listen For,” CD-ROM with tutorial
information and audio examples, Audio Engineering Society, New York
(2001).
[55] ITU-R BS.1548-1, “User Requirements for Audio Coding Systems
for Digital Broadcasting,” International Telecommunication Union,
Geneva, Switzerland (2001–2002).
[56] ITU-R BS.1116-1, “Methods for the Subjective Assessment of
Small Impairments in Audio Systems In-
cluding Multichannel Sound Systems,” International Tele-
communication Union, Geneva, Switzerland (1994).
[57] ITU-R BS.1387-1, “Method for Objective Mea- surements of
Perceived Audio Quality,” International Telecommunication Union,
Geneva, Switzerland (1998).
[58] R. Geiger, M. Schmidt, J. Herre, and R. Yu, “MPEG-4
SLS—Lossless and Near-Lossless Audio Cod- ing Based on MPEG-4 AAC,”
presented at the Interna- tional Symposium on Communications,
Control and Sig- nal Processing, Marrakech, Morocco (2006
Mar.)
[59] ISO/IEC JTC1/SC29/WG11, “Verification Report on MPEG-4 SLS,”
MPEG2005/N7687, Nice, France (2005 Oct.).
[60] ISO/IEC JTC1/SC29/WG11, “Revised Report on Complexity of
MPEG-2 AAC Tools,” MPEG1999/N2957 (Melbourne, Australia, 1999
Oct.), http://www.chiariglione
.org/mpeg/working_documents/mpeg-02/audio/AAC_
tool_complexity(rev).zip
THE AUTHORS
S.-W. Kim X. Lin M. Schmidt
Ralf Geiger received a diploma degree in mathematics from the
University of Regensburg, Regensburg, Ger- many, in 1997.
In 1998 he joined the Audio/Multimedia Department at the Fraunhofer
Institute for Integrated Circuits (IIS), Er- langen, Germany. From
2000 to 2004 he was with the Fraunhofer Institute for Digital Media
Technology (IDMT), Ilmenau, Germany. In 2005 he returned to Fraun-
hofer IIS.
Mr. Geiger is working on the development and stan- dardization of
perceptual and lossless audio coding
!
Rongshan Yu received a B.Eng. degree from Shanghai Jiaotong
University, Shanghai, P. R. China, in 1995, and M. Eng. and Ph.D.
degrees from the National University of Singapore in 2000 and 2005,
respectively.
He was with the Centre for Signal Processing, School of Electrical
and Electronics, Nanyang Technological Uni- versity, Singapore,
from 1999 to 2001, and with the Insti-
GEIGER ET AL. PAPERS
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February42
tute for Infocomm Research (I2R), A*STAR, Singapore, from 2001 to
2005. He is currently with Dolby Laborato- ries, San Francisco, CA,
USA. His research interests in- clude audio coding, data
compression, and digital signal processing.
!
Jurgen Herre joined the Fraunhofer Institute for Inte- grated
Circuits (IIS) in Erlangen, Germany, in 1989. Since then he has
been involved in the development of percep- tual coding algorithms
for high-quality audio, including the well-known ISO/MPEG-Audio
Layer III coder (aka MP3). In 1995 he joined Bell Laboratories for
a postdoc- toral term, working on the development of MPEG-2 ad-
vanced audio coding (AAC). Since the end of 1996 he has been back
at Fraunhofer, working on the development of advanced multimedia
technology, including MPEG-4, MPEG-7, and secure delivery of
audiovisual content, cur- rently as the chief scientist for the
audio/multimedia ac- tivities at Fraunhofer IIS, Erlangen.
!
Susanto Rahardja received a Ph.D. degree in electrical and
electronic engineering from the Nanyang Technologi- cal University,
Singapore.
He joined the Centre for Signal Processing, Nanyang Technological
University, in 1996 and he has been a fac- ulty member at its
School of Electrical and Electronic Engineering since 2001. In 2002
he joined the Institute for Infocomm Research (I2R) and is
currently the director of its Media Division. He is overseeing
research areas on signal processing (audio coding, video/image
processing), media analysis (text/speech, image, video), media
security (biometrics, computer vision, and surveillance), and sen-
sor network.
Dr. Rahardja has published in more than 150 interna- tional
journals and at conferences in the areas of digital
!
!
Xiao Lin received a Ph.D. degree from the Electronics and Computer
Science Department of the University of Southampton, Southampton,
UK, in 1993.
He worked at the Centre for Signal Processing, Nanyang
Technological University, Singapore, as a re- search fellow and
senior research fellow for about five years. Subsequently he joined
DeSOC Technology as a technical director and then the Institute for
Infocomm Re- search in 2002. There he was member of technical
staff, lead scientist, and principal scientist and managed the Me-
dia Processing Department until 2006. He is now with Fortemedia
Inc. as a senior director. He actively partici- pated in
international standards such as MPEG-4, JPEG2000, and JVT.
Dr. Lin is a senior member of the IEEE.
!
Markus Schmidt received a Dipl.-Ing. degree in media technology
from the Technical University of Ilmenau, Germany, in 2004. During
his studies he spent a year at the University of Strathclyde in
Glasgow, Scotland.
After completing an intership at the Fraunhofer Institute for
Digital Media Technology (IDMT) in Ilmenau, he joined the
Fraunhofer Institute for Integrated Circuits (IIS), Erlangen,
Germany, in 2005. There his research in- terests include low-delay
and lossless audio coding schemes and their implementation in
real-time environments.
PAPERS ISO/IEC MPEG-4 SCALABLE AAC