Top Banner
11/16/11 1 Audio Henning Schulzrinne Dept. of Computer Science Columbia University Fall 2011 Human speech Mark Handley
32

Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

Jun 19, 2018

Download

Documents

NguyễnHạnh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

1

Audio Henning Schulzrinne Dept. of Computer Science Columbia University Fall 2011

Human speech

Mark Handley

Page 2: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

2

Human speech

Ü  voiced sounds: vocal cords vibrate (e.g.,A4 [above middle C] = 440 Hz Ü  vowels (a, e, i, o, u, …) Ü  determines pitch

Ü  unvoiced sounds: Ü  fricatives (f, s) Ü  plosives (p, d)

Ü  filtered by vocal tract

Ü  changes slowly (10 to 100 ms)

Ü  air volume à loudness (dB)

Human hearing

Page 3: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

3

Human hearing

Human hearing & age

Page 4: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

4

Digital sound

Analog-to-digital conversion

Ü  Sample value of digital signal at fs (8 – 96 kHz)

Ü  Digitize into 2B discrete values (8-24)

Mark Handley

Page 5: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

5

Sample & hold quantization

noise

Mark Handley

Direct-Stream Digital

Delta-Sigma coding

Page 6: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

6

How fast to sample?

Ü  Harry Nyquist (1928) & Claude Shannon (1949) Ü  no loss of information à sampling frequency ≥ 2 *

maximum signal frequency

Ü  More recent: compressed sensing Ü  works for sparse signals in some space

Audio coding

application frequency sampling quantization

telephone 300-3,400 Hz 8 kHz 12-13

wide-band 50-7,000 Hz 16 kHz 14-15

high quality 30-15,000 Hz 32 kHz 16

20-20,000 Hz 44.1 kHz 16

10-22,000 Hz 48 kHz ≤ 24

CD

DAT

24 bit, 44.1/48 kHz

Page 7: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

7

Complete A/D

Mark Handley

Aliasing distortion

Mark Handley

Mark Handley

Page 8: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

8

Quantization

Ü  CDs: 16 bit à lots of bits

Ü  Professional audio: 24 bits (or more)

Ü  8-bit linear has poor quality (noise)

Ü  Ear has logarithmic sensitivity à “companding” Ü  used for Dolby tape decks

Ü  quantization noise ~ signal level

Quantization noise

Mark Handley

Page 9: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

9

Fourier series

Ü  Express periodic function as sum of sines and cosines of different amplitudes Ü  iff band-limited, finite sum

Ü  Time domain à frequency domain Ü  no information loss

Ü  and no compression Ü  but for periodic (or time limited)

signals

Ü  http://www.westga.edu/~jhasbun/osp/Fourier.htm

Fourier series

continuous time,

discrete frequencies

Page 10: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

10

Fourier series

Fourier transform

inverse transform

forward transform

continuous time,

continuous frequencies

Page 11: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

11

Fourier transform

Ü  Fourier transform: time series à series of frequencies Ü  complex frequencies: amplitude & phasess

Ü  Inverse Fourier transform: frequencies (amplitude & phase) à time series

Ü  Note: also works for other basis functions

Discrete Fourier transform Ü  For sampled functions, continuous FT not very

useful à DFT

complex numbers à

complex coefficients

Page 12: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

12

DFT example

Ü  Interpreting a DFT can be slightly difficult, because the DFT of real data includes complex numbers. Ü  The magnitude of the complex

number for a DFT component is the power at that frequency.

Ü  The phase θ of the waveform can be determined from the relative values of the real and imaginary coefficients.

Ü  Also both positive and “negative” frequencies show up.

Mark Handley

DFT example

Mark Handley

Page 13: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

13

DFT example

Mark Handley

Fast Fourier Transform (FFT)

Ü  Discrete Fourier Transform would normally require O(n2) time to process for n samples:

Ü  Don’t usually calculate it this way in practice. Ü  Fast Fourier Transform takes O(n log(n)) time.

Ü  Most common algorithm is the Cooley-Tukey Algorithm.

Page 14: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

14

Fourier Cosine Transform

Ü  Split function into odd and even parts:

Ü  Re-express FT:

Ü  Only real numbers from an even function à DFT becomes DCT

DCT (for JPEG)

other versions exist (e.g., for MP3, with overlap)

Page 15: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

15

Why do we use DCT for multimedia?

Ü  For audio: Ü  Human ear has different dynamic range for different

frequencies. Ü  Transform to from time domain to frequency domain,

and quantize different frequencies differently.

Ü  For images and video: Ü  Human eye is less sensitive to fine detail. Ü  Transform from spatial domain to frequency domain,

and quantize high frequencies more coarsely (or not at all)

Ü  Has the effect of slightly blurring the image - may not be perceptible if done right.

Mark Handley

Why use DCT/DFT?

Ü  Some tasks easier in frequency domain Ü  e.g., graphic equalizer

Ü  Human hearing is logarithmic in frequency (à octaves)

Ü  Masking effects (see MP3)

Page 16: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

16

Example: DCT for image

µ-law encoding

Mark Handley

Page 17: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

17

µ-law encoding

Mark Handley

Companding

Wikipedia

Page 18: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

18

µ-law & A-law

Mark Handley

Differential codec

Page 19: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

19

(Adaptive) Differential Pulse Code Modulation

ADPCM

Ü  Makes a simple prediction of the next sample, based on weighted previous n samples.

Ü  For G.721, previous 8 weighted samples are added to make the prediction.

Ü  Lossy coding of the difference between the actual sample and the prediction. Ü  Difference is quantized into 4 bits ⇒ 32Kb/s sent. Ü  Quantization levels are adaptive, based on the

content of the audio. Ü  Receiver runs same prediction algorithm and

adaptive quantization levels to reconstruct speech.

Page 20: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

20

Model-based coding

Ü  PCM, DPCM and ADPCM directly code the received audio signal.

Ü  An alternative approach is to build a parameterized model of the sound source (i.e., human voice).

Ü  For each time slice (e.g., 20ms): Ü  Analyze the audio signal to determine how the signal

was produced. Ü  Determine the model parameters that fit. Ü  Send the model parameters. Ü  At the receiver, synthesize the voice from the model

and received parameters.

Speech formation

Page 21: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

21

Linear predictive codec

Ü  Earliest low-rate codec (1960s)

Ü  LPC10 at 2.4 kb/s Ü  sampling rate 8 kHz

Ü  frame length 180 samples (22.5 ms)

Ü  linear predictive filter (10 coefficients = 42 bits)

Ü  pitch and voicing (7 bits)

Ü  gain information (5 bits)

Linear predictive codec

Page 22: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

22

Code Excited Linear Prediction (CELP)

Ü  Goal is to efficiently encode the residue signal, improving speech quality over LPC, but without increasing the bit rate too much.

Ü  CELP codecs use a codebook of typical residue values. (à vector quantization)

Ü  Analyzer compares residue to codebook values. Ü  Chooses value which is closest. Ü  Sends that value. Ü  Receiver looks up the code in its codebook,

retrieves the residue, and uses this to excite the LPC formant filter.

CELP (2)

Ü  Problem is that codebook would require different residue values for every possible voice pitch. Ü  Codebook search would be slow, and code would

require a lot of bits to send. Ü  One solution is to have two codebooks. Ü  One fixed by codec designers, just large enough to

represent one pitch period of residue. Ü  One dynamically filled in with copies of the previous

residue delayed by various amounts (delay provides the pitch)

Ü  CELP algorithm using these techniques can provide pretty good quality at 4.8Kb/s.

Page 23: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

23

Enhanced LPC usage

Ü  GSM (Groupe Speciale Mobile) Ü  Residual Pulse Excited LPC Ü  13 kb/s

Ü  LD-CELP Ü  Low-delay Code-Excited Linear Prediction (G.728) Ü  16 kb/s

Ü  CS-ACELP Ü  Conjugate Structure Algebraic CELP (G.729) Ü  8 kb/s

Ü  MP-MLQ Ü  Multi-Pulse Maximum Likelihood Quantization (G.723.1) Ü  6.3 kb/s

Distortion metrics

Ü  error (noise) r(n) = x(n) – y(n)

Ü  variancesσx2,σy2,σr2

Ü  power for signal with pdf p(x) and range −V ...+V

Ü  SNR = 6.02N − 1.73 for uniform quantizer with N bits

Page 24: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

24

Distortion measures

Ü  SNR not a good measure of perceptual quality

Ü  ➠ segmental SNR: time-averaged blocks (say, 16 ms)

Ü  frequency weighting

Ü  subjective measures: Ü  A-B preference

Ü  subjective SNR: comparison with additive noise

Ü  MOS (mean opinion score of 1-5), DRT, DAM, . . .

Quality metrics

Ü  speech vs. music

Ü  communication vs. toll quality

score MOS DMOS understanding

5 excellent inaudible no effort

4 good, toll quality audible, not annoying no appreciable effort

3 fair slightly annoying moderate effort

2 poor annoying considerable effort

1 bad very annoying no meaning

Page 25: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

25

Subjective quality metrics

Ü  Test phrases (ITU P.800) Ü  You will have to be very quiet.

Ü  There was nothing to be seen.

Ü  They worshipped wooden idols.

Ü  I want a minute with the inspector.

Ü  Did he need any money?

Ü  Diagnostic rhyme test (DRT) Ü  96 pairs like dune vs. tune

Ü  90% right à toll quality

Objective quality metrics

Ü  approximate human perception of noise and other distortions

Ü  distortion due to encoding and packet loss (gaps, interpolation of decoder)

Ü  examples: PSQM (P.861), PESQ (P.862), MNB, EMBSD – compare reference signal to distorted signal

Ü  either generate MOS scores or distance metrics

Ü  much cheaper than subjective tests

Ü  only for telephone-quality audio so far

Page 26: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

26

Objective vs. subjective quality

Common narrowband audio codecs

Codec rate (kb/s)

delay (ms)

multi-rate em-bedded

VBR bit-robust/ PLC

remarks

iLBC 15.2 13.3

20 30

--/X quality higher than G.729A no licensing

Speex 2.15--24.6

30 X X X --/X no licensing

AMR-NB 4.75--12.2

20 X X/X 3G wireless

G.729 8 15 X/X TDMA wireless

GSM-FR 13 20 GSM wireless (Cingular)

GSM-EFR 12.2 20 X/X 2.5G

G.728 16 12.8

2.5 X/X H.320 (ISDN videconferencing)

G.723.1 5.3 6.3 37.537.5

X/-- H.323, videoconferences

Page 27: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

27

Common wideband audio codecs

Codec rate (kb/s)

delay (ms)

multi-rate em-bedded

VBR bit-robust/ PLC

remarks

Speex 4—44.4

34 X X X --/X no licensing

AMR-WB 6.6—23.85

20 X X/X 3G wireless

G.722 48, 56, 64

0.125

(1.5)

X/-- 2 sub-bands now dated

http://www.voiceage.com/listeningroom.php

MOS vs. packet loss

Page 28: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

28

iLBC – MOS behavior with packet loss

Recent audio codecs

Ü  iLBC: optimized for high packet loss rates (frames encoded independently)

Ü  AMR-NB Ü  3G wireless codec

Ü  4.75-12.2 kb/s

Ü  20 ms coding delay

Page 29: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

29

Speex

Ü  Open-source patent-free speech codec

Ü  CELP (code-excited linear prediction) codec

Ü  operating modes: Ü  narrowband (8 kHz sampling rate)

Ü  2.15 – 24.6 kb/s Ü  delay of 30 ms

Ü  wideband (16 kHz sampling rate) Ü  4-44.2 kb/s Ü  delay of 34 ms

Ü  ultra-wideband (32 kHz sampling rate)

Ü  intensity stereo encoding

Ü  variable bit rate (VBR) possible

Ü  voice activity detection (VAD)

Opus

Ü  interactive speech & music

Ü  6 kb/s … 510 kb/s (music)

Ü  delay: 5 ms … 65.2 ms

Ü  Linear prediction + MDCT

Page 30: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

30

Comparison

Ogg Vorbis Ü  Similar in application to AAC, MP3, VQF, …, but claims to be free of patents

Ü  Ogg = container format file (also for Speex, FLAC)

Ü  Vorbis = music speech codec

Ü  near CD quality = 160 kb/s

Ü  forward-adaptive modified DCT (discrete cosine transform)

Ü  overlapping windows

Ü  floor: carries frequency representation as piecewise linear interpolated representation on a dB amplitude scale and linear frequency scale

Ü  residue: subtract out floor à cascaded (multi-pass) vector quantization

Ü  entropy (Huffman) coding

Ü  carries codec parameters in header

Page 31: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

31

Audio traffic models

Ü  talkspurt: typically, constant bit rate: Ü  one packet every 20. . . 100 ms ➠ mean: 1.67 s

Ü  silence period: usually none Ü  (maybe transmit background noise value) ➠ 1.34 s

Ü  ➠ for telephone conversation, both roughly exponentially distributed

Ü  double talk for “hand-off”

Ü  may vary between conversations Ü  ➠ only in aggregate

Sound localization

Ü  Human ear uses 3 metrics for stereo localization: Ü  intensity Ü  time of arrival (TOA) – 7 µs Ü  direction filtering and spectral shaping by outer ear

Ü  For shorter wavelengths (4 – 20 kHz), head casts an acoustical shadow giving rise to a lower sound level at the ear farthest from the sound sources

Ü  At long wavelength (20 Hz - 1 KHz) the, head is very small compared to wavelengths Ü  In this case localization is based on perceived

Interaural Time Differences (ITD)

UCSC CMPE250 Fall 2002

Page 32: Human speech - Computer Science · 2015-02-06 · Human speech ! voiced sounds: vocal cords vibrate (e.g.,A4 ... CELP algorithm using these techniques can ... (discrete cosine transform)

11/16/11

32

Audio samples

Ü  http://www.cs.columbia.edu/~hgs/audio/codecs.html

Ü  Speex: http://www.speex.org/audio/samples/ Ü  both narrowband and wideband