Top Banner
1 Media Processing – Audio Part Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering [email protected] http://personal.ee.surrey.ac.uk/Personal/W.Wang/te aching.html
56

Media Processing – Audio Part

Jan 13, 2016

Download

Documents

moke

Media Processing – Audio Part. Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering [email protected] http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html. Approximate outline. Week 6: Fundamentals of audio - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Media Processing – Audio Part

1

Media Processing – Audio Part

Dr Wenwu Wang

Centre for Vision Speech and Signal Processing

Department of Electronic Engineering

[email protected]

http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

Page 2: Media Processing – Audio Part

2

Approximate outline

Week 6: Fundamentals of audio

Week 7: Audio acquiring, recording, and standards

Week 8: Audio processing, coding, and standards

Week 9: Audio production and reproduction

Week 10: Audio perception and audio quality assessment

Page 3: Media Processing – Audio Part

3

Audio coding and standards

Spectral analysis of audio

DFT, DCT, and MDCT

Subband analysis of audio

PQMF

Audio coding methods

Lossless and lossy coding

Coding standards

MPEG-1, -2, and -4

Coding principles of MPEG-1

Concepts and topics to be covered:

Page 4: Media Processing – Audio Part

4 4

Fourier transform Motivation for using Fourier transform (FT)

The waveform (a general time-domain representation) provides some indication of the dynamics and periodicity of audio, but apart from this, it is not clear, for example, about its frequency distribution.

Fourier transform provides an alternative representation of the signal, suitable for displaying other speech characteristics, such as its frequency information, harmonics, etc.

The Fourier transform of a continuous signal is computed as:

dtetxX tj )()(

Inverse Fourier transform:

)(tx

221 )()( deXtx tj

where is the angular frequency: f 2

Definition

Page 5: Media Processing – Audio Part

5 5

Discrete-time Fourier transform (DTFT)

The Fourier transform of a discrete-time signal is computed as:

njenxX ][)(

Inverse discrete-time Fourier transform:

)(][ nTxnx

221 )(][ deXnx nj

where is the angular frequency: f 2

Page 6: Media Processing – Audio Part

6 6

Discrete Fourier transform (DFT) The Fourier transform of a digital signal is computed as:

1

0

][][N

n

nj kenxkX

Inverse discrete Fourier transform:

][nx

1,...,1,0,][1

][1

0

NnekXN

nxN

k

nj k

where is the angular frequency: k 1,...,1,0,2

NkN

kk

Page 7: Media Processing – Audio Part

7 7

Power spectral density (PSD)

PSD of a vowel spoken by a male speaker

PSD of a fricative spoken by a male speaker

PSD is defined as the magnitude squared of the DFT of the signal:

Examples:

2][][ kXkP

Page 8: Media Processing – Audio Part

8 8

Fast Fourier transform (FFT)

FFT is a fast computation of DFT. The typical FFT algorithm consists of three conceptual parts:

Shuffling (bit reversal): shuffling the N-dimensional input into N one-dimensional signals

Performing N one-point DFTs

Merging the N one-point DFTs into one N-point DFT using “Butterfly” merging equations (requiring that N to be an integral power of 2.)

The computational complexities of FFT and DFT are respectively:

FFT:

DFT: )( 2NO

)log( NNO

Page 9: Media Processing – Audio Part

9 9

Short-time Fourier transform (STFT) STFT (sometimes called short-term FT) can be computed as a N-point

windowed DFT as follows (note that we only consider the discrete form here, and in practice FFT is usually used to compute the DFT in each frame ):

1

0

)(][],[N

n

nj kenwnmxmkx

where - the discrete angular frequency: k 1,...,1,0,2

NkN

kk

- the time-frame indexm

- the hop size

- a window function, such as rectangular, Hann windows

)(nw

Page 10: Media Processing – Audio Part

10 10

Spectrogram

Spectrogram of female speaker uttering “warm cloak”

2],[]}[{ mkxnxmspectrogra

Spectrogram of a speech signal can be computed as magnitude squared STFT:

An example:

Page 11: Media Processing – Audio Part

11 11

Spectrogram (cont.)

Each vertical line of the spectrogram describes the frequency-dependent power distribution of the signal over a short segment (or window) of the speech signal, i.e. PSD of the segment.

The width of the window is N, and the gap between consecutive windows is the hop size .

The horizontal line of the spectrogram represents the power distribution within a particular frequency band as a function of time.

The spectrogram shows the time-frequency spectral distribution of power within the signal.

The spectrogram is much better suited than the waveform to displaying speech structures, e.g. harmonics, the energy balance of frequency components, formants, etc.

The time and frequency resolution of the spectrogram are inversely proportional.

Page 12: Media Processing – Audio Part

12 12

Spectrogram – resolution issues

The STFT has a fixed resolution that depends on the selection of the window size.

A wider window gives better frequency resolution (frequency components close together can be separated) but poorer time resolution (the time at which frequencies change), and vice versa.

We use the example from Wikipedia to demonstrate this: a signal is composed of 4 sinusoidal components, whose frequencies are 10, 25, 50, 100Hz respectively, with the same length of 5 seconds. The sampling frequency of the signal is 400Hz.

Multi-resolution analysis tools exist that do not suffer from this problem, such as wavelet transform.

Page 13: Media Processing – Audio Part

13 13

Spectrogram resolution issues (cont.)

Different time-frequency resolutions for the same signal due to different window sizes were used in generating the STFT. (Resource: from Wikipedia.)

Page 14: Media Processing – Audio Part

14 14

Spectrogram resolution issues (cont.)

Although a long window can give higher frequency resolution, it would be misleading if we use too long a window, as the spectral characteristics would change over the duration of the windowed segment.

How long window should we choose such that the spectral characteristics does not change (dramatically)? This question relates to the concept of “stationarity”.

In practice, speech segment with a length of around 20-30ms is usually regarded as “quasi-stationary” (very littler change in spectral characteristics). This is because the speech units (phonemes) occur at a rate of 4-5 per second for average speech, although more rapid changes can occur from one steady state to another.

To ensure smooth transitions of the energy distribution from frame to frame, the windows are usually chosen to be overlapping, with a typical hop size of 5ms.

Page 15: Media Processing – Audio Part

15 15

Windowing and overlapping in spectrogram

By choosing the window length appropriately, the assumption of stationarity (quasi-stationarity) within the windowed speech is almost true.

However, when appending copies of the segment one after another, there may still be sharp discontinuities in the waveform at the boundaries (see the figure below).

The discontinuity results in the high-frequency noise spread across the spectrum, known as spectral leakage.

Spectral leakage:

(a) a sinusoidal audio segment

(b) its periodic extension.

Page 16: Media Processing – Audio Part

16 16

Windowing and overlapping in spectrogram (cont.)

To reduce spectral leakage, we can multiply the segment with a window function that approaches zero at its ends, such as Hann window shown below. This effectively attenuates discontiuities between two boundairies of the window, and therefore reduces the leakage.

The waveform and amplitude spectrum of the Hann window function

In practice, short segments can be appended with zeros to the required length, known as zero-padding.

Page 17: Media Processing – Audio Part

17

Various window functions

Source: Kondoz (2001)

Page 18: Media Processing – Audio Part

18

Time plots of various window functions

Source: Kondoz (2001)

Page 19: Media Processing – Audio Part

19

Frequency response of various window functions

Source: Kondoz (2001)

Page 20: Media Processing – Audio Part

20

Short-time spectral analysis using DFT

Effect of window types on voiced speech with 220 samples window length. (a) and (b) are time and frequency plots of speech using a rectangular window, and (c) and (d) are time and frequency plots of speech using Hamming window.

Source: Kondoz (2001)

Page 21: Media Processing – Audio Part

21

Short-time spectral analysis using DFT

Effect of window types on unvoiced speech with 220 samples window length. (a) and (b) are time and frequency plots of speech using a rectangular window, and (c) and (d) are time and frequency plots of speech using Hamming window.

Source: Kondoz (2001)

Page 22: Media Processing – Audio Part

22

Short-time spectral analysis using DFT

Effect of window types on voiced speech with 40 samples window length. (a) and (b) are time and frequency plots of speech using a rectangular window, and (c) and (d) are time and frequency plots of speech using Hamming window.

Source: Kondoz (2001)

Page 23: Media Processing – Audio Part

23

Discrete Cosine Transform (DCT)

A definition of DCT transform (DCT II) is shown below:

Source: wikipedia

Page 24: Media Processing – Audio Part

24

MDCT

An advantage of Modified DCT (MDCT) is that it allows for a 50% overlap between blocks without increasing the data rate.

The MDCT is an example of a class of transforms called Time Domain Aliasing Cancellation (TDAC). In particular, MDCT is sometimes referred to as oddly-stacked TDAC (OTDAC).

These transforms do not invert like the DFT to recover the original signal but rather invert to recover a signal that has adjacent blocks’ signal mixed into it so that the effect of “time-domain aliasing”, i.e. the mixing of adjacent blocks of data, is removed. As a result, the input signal is perfectly reconstructed.

Page 25: Media Processing – Audio Part

25

MDCT (cont.)

Analysis: from time to frequency

Synthesis: from frequency to time

For the signal to be perfectly reconstructed from after synthesis process, the windows should satisfy the following condition:

where i is the index of blocks (or short-time frames), subscript a means analysis, and s means synthesis. n0 = (N/2+1)/2.

Page 26: Media Processing – Audio Part

26

MDCT (cont.)

Responses of the MDCT filter bank (cosine window function):

Source: Bosi & Goldberg (2002)

Page 27: Media Processing – Audio Part

27

Subband analysis of audio signals

General subband analysis framework for audio coding:

Source: Bosi & Goldberg (2002)

Page 28: Media Processing – Audio Part

28

Pseudo quadrature mirror filter (PQMF) as a subband analysis tool

The PQMF filter bank employs the following analysis h and synthesis g filters respectively, where k is the frequency index and n is the time index. See its general form below

Page 29: Media Processing – Audio Part

29

PQMF filter bank

An example of subband analysis of speech:

Source: Kondoz (2001)

Page 30: Media Processing – Audio Part

30

PQMF filter bank (cont.)

The PQMF filter bank in MPEG audio standards employs the following analysis and synthesis filters respectively, where k is the frequency index and n is the time index.

Page 31: Media Processing – Audio Part

31

PQMF filter bank (cont.)

MPEG Audio PQMF prototype filter impulse response h[n] and hk[n], for k=0 and k=1.

Source: Bosi & Goldberg (2002)

Page 32: Media Processing – Audio Part

32

PQMF filter bank (cont.)

Frequency response of the prototype filter (unit: Fs/64) used in MPEG audio standards.

Source: Bosi & Goldberg (2002)

Page 33: Media Processing – Audio Part

33

PQMF filter bank (cont.)

Frequency response of the first four bands of the MPEG audio coding standards (unit: Fs/64) is shown in the figure below.

Source: Bosi & Goldberg (2002)

Page 34: Media Processing – Audio Part

34

Audio coding methods

Lossless coding

Based on statistical relation between symbols within the data

Entropy coding such as Huffman coding, arithmetic coding etc.

The original signals can be perfectly reconstructed.

Lossy coding

Based on the perceptual modelling of audio signals (such as psychoacoustic models of hearing), some redundant information within audio signals can be removed without affecting their perceptual quality.

Usually done in transform domain followed by quantisation.

The original signals cannot be perfectly reconstructed.

Page 35: Media Processing – Audio Part

35

Lossless coding: Huffman coding

Huffman coding is a variable length coding method for coding symbols based on the probabilities of each symbol’s occurrence.

Considering a 2-bit quantised signal that has the codes [00], [01], [10], [11], and suppose that we have a signal to be encoded, where the probability of the symbols’ occurrence is 70%, 15%, 10%, 5% respectively. The original bit rate was 2 bits per sample, and after entropy coding, the average bit rate becomes 1.45 (=70%*1+15%*2+10%*3+5%*3) bits per sample.

Source: Bosi & Goldberg (2002)

Page 36: Media Processing – Audio Part

36

Lossy coding: psychoacoustic model

Psychoacoustic principles and models, in particular, frequency and temporal masking, have been used as a basis in producing perceptually lossless audio quality in lossy audio coding algorithms.

Frequency masking

Source: Bosi & Goldberg (2002)

Page 37: Media Processing – Audio Part

37

Lossy coding: psychoacoustic model

Psychoacoustic principles and models, in particular, frequency and temporal masking, have been used as a basis in producing perceptually lossless audio quality in lossy audio coding algorithms.

Temporal masking Source: Bosi & Goldberg (2002)

Page 38: Media Processing – Audio Part

38

MPEG-1 audio coding standard

MPEG: the Moving Picture Experts Group (MPEG) within the joint technical committee on information technology (JTC 1) sponsored by the International Organisation for Standardisation (ISO) and the International Electrotechnical Commission (IEC), was established at the end of 1980s aiming to develop standards for coded representation of moving pictures, associated audio, and their combination.

MPEG-1 Audio was the first international standard specifying the digital format for high quality audio with the aim of reducing the data rate while maintaining CD-like quality. Prior to this standard, there was standardisation effort for either speech-only applications or providing only media-quality audio performance. The adoption of MPEG-1 enables the compressed high-quality audio in a wide range of applications, including digital broadcasting to internet applications.

Page 39: Media Processing – Audio Part

39

MPEG standards – a brief history The MPEG standardisation effort started in 1988.

The MPEG-1 standard [ISO/IEC 11172] coding of synchronised video and audio at a total data rate of about 1.5 Mb/s was finalised in 1992.

The MPEG-2 standard [ISO/IEC 13818] coding of synchronised video and audio at a total data rate of about 10 Mb/s was finalised in 1994.

The effort for MPEG-3 standard, i.e. coding of synchronised video and audio at a total data rate of about 40 Mb/s, was dropped in 1993, after being considered redundant as its attributes were already incorporated in the MPEG-2 standard.

The MPEG-4 [ISO/IEC 14496] addresses audio visual coding at very low data rates with additional functionalities, such as scalability, 3-D, synthetic/natural hybrid coding, was finalised in 1998.

MPEG-7 [ISO/IEC 15938] addresses the description of multimedia content for multimedia database search, finalised in 2001.

Page 40: Media Processing – Audio Part

40

MPEG audio standards MPEG Audio is usually used as stand-alone standard, however it is a part

of a multi-part standard, where “part 1” describes the system structure, “part 2” describes video coding, and “part 3” the audio coding.

MPEG-1 Audio

Defining coding/decoding high quality audio signals for storage media

Standardising only the bitstream and decoder specifications, but not the encoder. This allows interoperability between different implementations, and for manufacturers to retain control on the core intellectual property of their coding system.

Aims to support one or two main channels.

Input and outputs are compatible with existing PCM standards such as the CD and the digital audio tape formats.

Sampling rate at 32 kHz, 44.1 kHz, and 48 kHz

Source: Bosi & Goldberg (2002)

Page 41: Media Processing – Audio Part

41

MPEG audio standards (cont.) MPEG-2 Audio

Extending of MPEG-1 Audio to multiple channels.

Lower sampling rates than MPEG-1 at 16 kHz, 22.5 kHz and 24 kHz.

Motivated by the emerging internet applications.

Defining a higher-quality multichannel audio than achievable with MPEG-1 extensions.

MPEG-2 AAC (NBC, non backward-compatible) shows comparable or better audio quality than MPEG-2 Layer II BC (backward-compatible).

Page 42: Media Processing – Audio Part

42

MPEG audio standards (cont.) MPEG-4 Audio

Aims to provide a high coding efficiency, where data rates (200 b/s to 64 kb/s) introduced are lower than defined in MPEG-1/2.

Accommodates speech coding technology, error protection

Includes content-based interactivity such as flexible access and manipulation, for example, pitch modifications.

Allows universal access, such as access to a subset of data or scalability.

Supports for synthetic audio and speech, e.g. in structured audio, text to speech interfaces.

Accommodates additional post-processing effects, such as reverberation, 3D etc.

Page 43: Media Processing – Audio Part

43

MPEG-1 Audio The MPEG-1 standard part 3, i.e. [ISO/IEC 11172-3], specifies the audio

part of the MPEG-1 standard.

It includes the syntax of the audio coded bitstream and a description of decoding process, which ensures interoperability between different systems.

It also provides reference software modules and a set of test vectors for assessing the compliance of the decoder.

It does not define the encoder which is left to the designer of the systems to decide.

It describes perceptual audio coding algorithms for general audio signals, unlike in speech codecs where specific source model is applied.

Page 44: Media Processing – Audio Part

44

MPEG-1 Audio – main features It supports sampling rates at 32, 44.1 and 48 kHz, one or two channels

(including dual monophonic mode for two independent channels and a stereo mode for stereophonic channels).

Data rates vary between 32 and 224 kb/s per channel allowing for compression ratios between 2.7:1 to 24:1, depending on sampling rate.

It specifies three different layers, layer I, layer II, and layer III, which offer increasingly higher audio quality at slightly increased complexity.

Layer I is the simplest layer, operates at date rates between 32 and 224 kb/s per channel (preferred rate at above 128 kb/s). It finds applications in e.g. digital compact cassette (DCC) at 192 kb/s per channel.

Layer II is of media complexity, operates preferably between 32 and 192 kb/s per channel (providing very good quality at 128 kb/s). Applications of Layer II include digital audio broadcasting (DAB).

Layer III has the highest quality with increased complexity. The data rates vary between 32 and 160 kb/s per channel. Applications include ISDN and internet transmissions. A modified MPEG Layer III format at lower sampling frequencies becomes the well-known MP3 format.

Page 45: Media Processing – Audio Part

45

MPEG-1 Audio coding: main building blocks of encoder

The encoder includes a time to frequency mapping stage followed by a bit (or noise) allocation stage. The psychoacoustic model is used to determine the precision of the allocation stage. The bitstream formatting stage interleaves the representation of the quantised data with side information and optional ancillary data.

Source: Bosi & Goldberg (2002)

Page 46: Media Processing – Audio Part

46

MPEG-1 Audio coding: main building blocks of decoder

The decoder interprets the bitstream, restores the quantised spectral components of the signal and reconstructs the time domain representation of the audio signal by frequency to time mapping.

Source: Bosi & Goldberg (2002)

Page 47: Media Processing – Audio Part

47

MPEG-1 Audio: coding options of layer I and II

In both layer I an II, T-F mapping is performed by a 32-band PQMF, which is then scaled and quantised with a uniform midtread quantiser whose precision is determined by the output of the psychoacoustic model based on 512- (Layer I) or 1024-point (Layer II) FFT analysis. To reduce data rate, group coding of consecutive quantised samples is applied in Layer II.

Source: Bosi & Goldberg (2002)

Page 48: Media Processing – Audio Part

48

MPEG-1 Audio: coding options of layer III

In Layer III, the output of the PQMF is fed to an MDCT stage. The filter bank is adaptive, rather than static (as in Layer I & II), and is scaled and non-uniform quantised with a midtread quantiser. Noiseless coding, such as Huffman coding has also been employed. Side informtion include bit allocation and control parameters.

Source: Bosi & Goldberg (2002)

Page 49: Media Processing – Audio Part

Time-frequency mapping in layer III: analysis filterbank

After the 32-band PQMF filter, the subband samples are overlapped by 50%, multiplied by a sine window, and then process by MDCT transform.

The MDCT output is multiplied by coefficients to reduce the aliasing effects caused by PQMF and overlapping bands.

Source: Bosi & Goldberg (2002)

Page 50: Media Processing – Audio Part

50

Time-frequency mapping in layer III: synthesis filterbank

In the decoder, the inverse aliasing reduction process is applied prior to the IMDCT (inverse MDCT). Without aliasing reduction, a pure sine wave after passing through the PQMF/MDCT filterbank can present a spurious component as high as -12 dB with respect to the original signal.

Source: Bosi & Goldberg (2002)

Page 51: Media Processing – Audio Part

51

Time-frequency mapping in layer III: block switching

The block size processed by the layer III filter is 32*36 time-samples = 1152, which leads to a frequency resolution of about 41.66 Hz at 48 kHz sampling rate, and hence is good for performing bit allocation based on the psychoacoustic model.

However, for transient signals, such a long block size can result in unmasked temporal noise, such as pre-echo. Hence, a shorter block size of 32*12 = 384 time samples will be used to improve the time resolution and hence reducing the temporal spreading of quantisation noise for sharp attacks.

Two transition blocks, long-to-short and short-to-long, having the same size as the long block are employed.

Page 52: Media Processing – Audio Part

52

Time-frequency mapping in layer III: window sequence in block switching

Source: Bosi & Goldberg (2002)

Page 53: Media Processing – Audio Part

53

Time-frequency mapping in layer III: window sequence in block switching The mixed block mode ensures high frequency resolution in low

frequencies and high time resolution at high frequencies.

Source: Bosi & Goldberg (2002)

Page 54: Media Processing – Audio Part

54

Time-frequency mapping in layer III as compared with that in layer I & II

The hybrid filter bank in Layer III has advantages such as high frequency resolution, a dynamic, adaptive tradeoff between time and frequency resolution, full compatibility with layer I & II.

The disadvantages include potential aliasing effects exposed by the MDCT and long impulse response filters.

The complexity of Layer III filter bank is increased with respect to the complexity of Layers I & II.

Page 55: Media Processing – Audio Part

55

Psychoacoustic models in MPEG audio

Source: Bosi & Goldberg (2002)

Page 56: Media Processing – Audio Part

56

References

Marina Bosi and Richard E. Goldberg, “Introduction to Digital Audio Coding and Standards”, Springer, 2002.

Ahmet Kondoz, “Digital Speech Coding for Low Bit Rate Communication Systems”, Wiley, 2001.