Page 1
Speech Technology - Kishore Prahallad ([email protected] )1
Speech Technology: A Practical Introduction
Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis
Kishore PrahalladEmail: [email protected]
Carnegie Mellon University&
International Institute of Information Technology Hyderabad
Page 2
Speech Technology - Kishore Prahallad ([email protected] )2
Topics
• Spectrogram• Cepstrum • Mel-Frequency Analysis • Mel-Frequency Cepstral Coefficients
Page 3
Speech Technology - Kishore Prahallad ([email protected] )3
Spectrogram
Page 4
Speech Technology - Kishore Prahallad ([email protected] )4
Speech signal represented as a sequence of spectral vectors
FFT FFT FFT
Spectrum
Page 5
Speech Technology - Kishore Prahallad ([email protected] )5
Speech signal represented as a sequence of spectral vectors
FFT
Spectrum
FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT
Page 6
Speech Technology - Kishore Prahallad ([email protected] )6
Speech signal represented as a sequence of spectral vectors
FFT
Spectrum
FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT
Hz
Amp.
Page 7
Speech Technology - Kishore Prahallad ([email protected] )7
Speech signal represented as a sequence of spectral vectors
FFT
Spectrum
FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT
Hz
Amplitude
Rotate it by 90 degrees
Page 8
Speech Technology - Kishore Prahallad ([email protected] )8
Speech signal represented as a sequence of spectral vectors
FFT
Spectrum
FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT
Hz • MAP spectral amplitude to a grey level (0-255) value. 0 represents black and 255 represents white.• Higher the amplitude, darker the corresponding region.
Amplitude
Page 9
Speech Technology - Kishore Prahallad ([email protected] )9
Speech signal represented as a sequence of spectral vectors
FFT
Spectrum
FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT
Hz
Time
Page 10
Speech Technology - Kishore Prahallad ([email protected] )10
Speech signal represented as a sequence of spectral vectors
FFT
Spectrum
FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT
Hz
Time
Time Vs Frequency representation of a speech
signal is referred to as spectrogram
Page 11
Speech Technology - Kishore Prahallad ([email protected] )11
Some Real Spectrograms
Dark regions indicate peaks (formants) in the spectrum
Page 12
Speech Technology - Kishore Prahallad ([email protected] )12
Why we are bothered about spectrograms
Phones and their properties are
better observed in spectrogram
Page 13
Speech Technology - Kishore Prahallad ([email protected] )13
Why we are bothered about spectrograms
Sounds can be identified much
better by the Formants and by their transitions
Page 14
Speech Technology - Kishore Prahallad ([email protected] )14
Why we are bothered about spectrograms
Sounds can be identified much
better by the Formants and by their transitions
Hidden Markov Models implicitly model these spectrograms to perform speech recognition
Page 15
Speech Technology - Kishore Prahallad ([email protected] )15
Usefulness of Spectrogram• Time-Frequency representation of the speech signal
• Spectrogram is a tool to study speech sounds (phones)
• Phones and their properties are visually studied by phoneticians
• Hidden Markov Models implicitly model spectrograms for speech totext systems
• Useful for evaluation of text to speech systems– A high quality text to speech system should produce synthesized
speech whose spectrograms should nearly match with the natural sentences.
Page 16
Speech Technology - Kishore Prahallad ([email protected] )16
Cepstral Analysis
Page 17
Speech Technology - Kishore Prahallad ([email protected] )17
A Sample Speech Spectrum
Frequency (Hz)
dB
• Peaks denote dominant frequency components in the speech signal
• Peaks are referred to as formants• Formants carry the identity of the sound
Page 18
Speech Technology - Kishore Prahallad ([email protected] )18
What we want to Extract? –Spectral Envelope
• Formants and a smooth curve connecting them• This Smooth curve is referred to as spectral envelope
Frequency (Hz)
dB
Page 19
Speech Technology - Kishore Prahallad ([email protected] )19
Spectral Envelope
Spectral Envelope
Spectrum
Spectral details
Page 20
Speech Technology - Kishore Prahallad ([email protected] )20
Spectral Envelope
Spectral Envelope
Spectrum
Spectral details
log X[k]
log H[k]
log E[k]
Page 21
Speech Technology - Kishore Prahallad ([email protected] )21
Spectral Envelope
Spectral Envelope
Spectrum
Spectral details
log X[k]
log H[k]
log E[k]
log X[k] = log H[k] + log E[k]
1. Our goal: We want to separate spectral envelope and spectral details from the spectrum.
2. i.e Given log X[k], obtain log H[k] and log E[k], such that log X[k] = log H[k] + log E[k]
Page 22
Speech Technology - Kishore Prahallad ([email protected] )22
How to achieve this separation ?
Page 23
Speech Technology - Kishore Prahallad ([email protected] )23
Play a Mathematical Trick
Spectral Envelope
Spectral details
Spectrum
• Trick: Take FFT of the spectrum!!
• An FFT on spectrum referred to as Inverse FFT (IFFT).
• Note: We are dealing with spectrum in log domain (part of the trick)
• IFFT of log spectrum would represent the signal in pseudo-frequency axis
Page 24
Speech Technology - Kishore Prahallad ([email protected] )24
Play a Mathematical Trick
Spectral Envelope
A pseudo-frequency axis
Spectral details
Spectrum
Page 25
Speech Technology - Kishore Prahallad ([email protected] )25
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
Low Freq. region
High Freq. region
Page 26
Speech Technology - Kishore Prahallad ([email protected] )26
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
Low Freq. region
High Freq. region
IFFT
Page 27
Speech Technology - Kishore Prahallad ([email protected] )27
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
Low Freq. region
High Freq. region
IFFT
Treat this as a sine wave
with 4 cycles per sec.
Page 28
Speech Technology - Kishore Prahallad ([email protected] )28
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
Low Freq. region
High Freq. region
IFFT
Treat this as a sine wave
with 4 cycles per sec.
Gives a peak at 4 Hz in frequency
axis
Page 29
Speech Technology - Kishore Prahallad ([email protected] )29
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
Low Freq. region
High Freq. region
IFFT
Treat this as a sine wave
with 4 cycles per sec.
Gives a peak at 4 Hz in frequency
axis
Page 30
Speech Technology - Kishore Prahallad ([email protected] )30
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
Low Freq. region
High Freq. region
IFFT
Page 31
Speech Technology - Kishore Prahallad ([email protected] )31
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
Low Freq. region
High Freq. region
IFFT
Treat this as a sine wave with 100 cycles per
sec.
Gives a peak at 100 Hz in frequency
axis
Page 32
Speech Technology - Kishore Prahallad ([email protected] )32
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
Low Freq. region
High Freq. region
IFFT
IFFT
Page 33
Speech Technology - Kishore Prahallad ([email protected] )33
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
Page 34
Speech Technology - Kishore Prahallad ([email protected] )34
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
IFFT
log X[k] = log H[k] + log E[k]
log H[k]
log E[k]
Page 35
Speech Technology - Kishore Prahallad ([email protected] )35
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
IFFT
log X[k] = log H[k] + log E[k]
log H[k]
log E[k]
x[k] = h[k] + e[k]
Page 36
Speech Technology - Kishore Prahallad ([email protected] )36
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
IFFT
log X[k] = log H[k] + log E[k]
log H[k]
log E[k]
x[k] = h[k] + e[k]
In practice all you have access to only log X[k] and hence you can obtain x[k]
Page 37
Speech Technology - Kishore Prahallad ([email protected] )37
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral detailsA pseudo-frequency
axis
IFFT
log X[k] = log H[k] + log E[k]
log H[k]
log E[k]
x[k] = h[k] + e[k]
If you know x[k] Filter the low
frequency region to get h[k]
Page 38
Speech Technology - Kishore Prahallad ([email protected] )38
Play a Mathematical Trick
Spectral Envelope
Spectrum
Spectral details
A pseudo-frequency axis
IFFT
log X[k] = log H[k] + log E[k]
log H[k]
log E[k]
x[k] = h[k] + e[k]
• x[k] is referred to as Cepstrum • h[k] is obtained by considering
the low frequency region of x[k].• h[k] represents the spectral
envelope and is widely used as feature for speech recognition
Page 39
Speech Technology - Kishore Prahallad ([email protected] )39
Cepstral Analysis
][][][
sidesboth on FFTinverseTaking
||][||log||][||log||][||log
sidesboth on Log Take
magnitude denotes||.||
||][||||][||||][||
][][][
kekhkx
kEkHkX
kEkHkX
kEkHkX
+=
+=
−=
=
Page 40
Speech Technology - Kishore Prahallad ([email protected] )40
Mel-Frequency Analysis
Page 41
Speech Technology - Kishore Prahallad ([email protected] )41
Review: What we did
• We captured spectral envelope (curve connecting all formants)
• BUT: Perceptual experiments say human ear concentrates on certain regions rather than using whole of the spectral envelope….
Frequency (Hz)
dB
Page 42
Speech Technology - Kishore Prahallad ([email protected] )42
Mel-Frequency Analysis
• Mel-Frequency analysis of speech is based on human perception experiments
• It is observed that human ear acts as filter – It concentrates on only certain frequency
components
• These filters are non-uniformly spaced on the frequency axis– More filters in the low frequency regions – Less no. of filters in high frequency regions
Page 43
Speech Technology - Kishore Prahallad ([email protected] )43
Mel-Frequency Filters
Page 44
Speech Technology - Kishore Prahallad ([email protected] )44
Mel-Frequency FiltersMore no. of filters in low freq. region
Lesser no. of filters in high freq. region
Page 45
Speech Technology - Kishore Prahallad ([email protected] )45
Mel-Frequency Cepstral Coefficients (MFCC)
• Spectrum � Mel-Filters � Mel-Spectrum
• Say log X[k] = log (Mel-Spectrum) • NOW perform Cepstral analysis on log X[k]
– log X[k] = log H[k] + log E[k]– Taking IFFT – x[k] = h[k] + e[k]
• Cepstral coefficients h[k] obtained for Mel-spectrum are referred to as Mel-Frequency Cepstral Coefficients often denoted by *MFCC*
Page 46
Speech Technology - Kishore Prahallad ([email protected] )46
Speech signal represented as a sequence of spectral vectors
FFT
Spectrum
FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT
Mel-Filters
Cepstral Analy.
Page 47
Speech Technology - Kishore Prahallad ([email protected] )47
Speech signal represented as a sequence of CEPSTRAL vectors
FFT
Spectrum
FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT
Cepstral Vectors
Page 48
Speech Technology - Kishore Prahallad ([email protected] )48
Why we are going to use MFCC
• Speech synthesis– Used for joining two speech segments S1 and S2– Represent S1 as a sequence of MFCC– Represent S2 as a sequence of MFCC– Join at the point where MFCCs of S1 and S2 have
minimal Euclidean distance
• Used in speech recognition – MFCC are mostly used features in state-of-art speech
recognition system
Page 49
Speech Technology - Kishore Prahallad ([email protected] )49
Summary: Process of Feature Extraction
• Speech is analyzed over short analysis window• For each short analysis window a spectrum is obtained
using FFT • Spectrum is passed through Mel-Filters to obtain Mel-
Spectrum• Cepstral analysis is performed on Mel-Spectrum to
obtain Mel-Frequency Cepstral Coefficients• Thus speech is represented as a sequence of Cepstral
vectors• It is these Cepstral vectors which are given to pattern
classifiers for speech recognition purpose
Page 50
Speech Technology - Kishore Prahallad ([email protected] )50
Additional Reading
• Chapter 6– Pg: 273 – 281
– Pg: 304 – 311– Pg: 314 - 316