Department of Precision and Microsystems Engineering Improvements on Time-Frequency Analysis using Time-Warping and Timbre Techniques Name: Maarten van der Seijs Report no: EM 11.018 Coach: dr. ir. D. de Klerk Professor: prof. dr. D.J. Rixen Specialisation: Engineering Mechanics Type of report: Masters Thesis Date: Delft, June 6, 2011
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Department of Precision and Microsystems Engineering
Improvements on Time-Frequency Analysis
using Time-Warping and Timbre Techniques
Name: Maarten van der Seijs
Report no: EM 11.018
Coach: dr. ir. D. de Klerk
Professor: prof. dr. D.J. Rixen
Specialisation: Engineering Mechanics
Type of report: Masters Thesis
Date: Delft, June 6, 2011
Abstract
Spectral analysis of non-stationary signals is known to be a challenging task. Classical methods
like the discrete Fourier transform are often inadequate to capture and track periodic content with
rapidly changing frequencies. This is basically for two reasons. On one hand, the Fourier transform
is intended for expressing frequency content in terms of constant-frequency contributions. On the
other hand, the simultaneous accuracy of temporal and spectral localisation is limited by the time-
frequency uncertainty principle. This thesis lays out the findings of an explorative study towards
potential improvements on time-frequency analysis.
Anticipating on the first issue, the concept of time-warping has been explored. By stretching and
contraction of pieces of the signal, frequency changes may be "flattened out", resulting in improved
detection of non-stationary frequencies and much sharper spectra than possible with traditional
Fourier analysis. Both linear and non-linear time warping approaches were investigated, together
with the required non-uniform interpolation techniques.
Application of linear time-warping prior to a Fourier transformation leads to the definition of the Fan
chirp transform. This transformation is in essence closely related to the popular short-time Fourier
transform, but provides time-frequency basis functions in a fan-geometry rather than a rectangularly-
tiled grid. The skewed basis functions match the harmonic structure of an instationary component
with linearly increasing frequency.
The second issue is addressed by considering periodic signals in their entirety rather than by their
individual partials (or harmonics, overtones). A novel concept is proposed: timbre analysis. The
timbre representation provides means to classify a tonal signal, similar to the way the human ear
(which is in fact a remarkably sophisticated Fourier analyser) perceives and identifies sound. It is
shown that the instantaneous timbre, obtained by normalisation of the harmonic phases, tends to
remain stationary throughout a non-stationary signal.
The timbre representation is used to identify components in polyphonic problems, where the signal is
a mixture of multiple crossing tonal components. In addition, a pitch tracking technique is proposed
that tracks a periodic component based on its timbre. The component can then be isolated and
extracted using Vold-Kalman filtering.
3
Preface
This thesis is the result of a Master of Science thesis project from October 2010 onwards. It was
fulfilled in the group of Engineering Dynamics, which is part of the Precision and Microsystems
Engineering department at Delft University of Technology.
First, I would like to thank dr. ir. Dennis de Klerk for his enthusiastic and dedicated supervision. As
a true expert in experimental dynamics, he confronted me with a diversity of challenges and never
failed to inspire me.
Second, I greatly acknowledge prof. dr. Daniel Rixen for his support throughout my entire Masters
studies. His readiness to help and ever-constructive suggestions are exemplary. I frankly believe that
due to his involvement with students, many will eventually find the path to Engineering Dynamics.
Finally, I would like to thank my family, friends and house mates for their love, support and reflection
specially written for efficient time-frequency analysis. Only a few functions are included in appendix
A and B. The complete toolbox including the code for the examples is found on a CD-ROM.
14
Part I
Basic Concepts
Chapter1
Time-Domain Concepts
This chapter introduces some basic concepts related to signals and signal processing in the time-
domain. First, the discretisation of continuous signals into digital signals is discussed. In section 1.2,
the concepts of periodicity and harmonicity are considered. Section 1.3 concludes with a discussion
of signal modelling.
A thorough discussion of the concepts is found in textbooks on signal processing, for instance [14, 13].
This chapter offers a brief recap of time-domain signal theory that is relevant for this thesis.
1.1 Signals
In a most general formulation, a time-domain signal can be any real-valued quantity that varies in
time. Mathematically, a signal may be written as y = f(t), where the quantity y and the time domain
t are not necessarily bounded. Due the complexity of most signals, the function f can rarely be
expressed as a simple closed form, but may in some cases be approximated.
1.1.1 Continuous-time signals
By nature, all signals we encounter in the real world are continuous. Nature treats everything with
infinite smoothness; it does not require any type of discretisation, nor limits accuracy. Therefore both
quantity and time have an uncountable domain of real values: y, t ∈ R. These signals are called
continuous-time signals and are often referred to as analogue, in contrast to digital.
1.1.2 Discrete-time signals
Discrete-time signals on the other hand have a discretised time domain, meaning that the signal is
sampled at a finite number (N ) of instances tn: t0, t1, . . . , tN−1. The sampling is usually performed
at a constant interval (i.e. tn+1 − tn = Δt), yielding the so-called sample rate fs = 1/Δt in Hz or
ωs = 2π/Δt in rad/s. The discrete-time signal may be represented as a vector y[n] with the signal
values denoted by y[0], y[1], . . . , y[N−1]. Note that these values can still be continuous; y[n] ∈ R.
After sampling, the signal values are merely known at the specified instances tn, which would imply
that everything that happened in between of two adjacent samples is lost. What this really means
for signals in the context of frequency content is discussed later on in chapter 2. An illustration of
discrete-time sampling is shown in figure 1.1.
17
1. TIME-DOMAIN CONCEPTS
time
val
ue
(a) Continuous-time signal
time
val
ue
(b) Discrete-time signal
Figure 1.1: Discrete-time signals use a finite number of values at a fixed sample rate.
1.1.3 Digital signals
Digital signals take both a discretised time and value domain. Digital devices like computers use
finite sets of values to approximate and store the values of every sample into a binary format. For this
purpose, the values are rounded or quantised to a fixed set of equally spaced values a1, a2, . . . , aM(see figure 1.2). The amountM of possible values depends on the chosen resolution. The resolution
is usually specified in terms of a bit depth or word-length Q, related to the total amount of values by
M = 2Q.
The bit depth is directly related to the dynamic range. Dynamic range is defined as the range
between the smallest and largest possible value in a set. For instance: compact disc audio is
formatted in 16-bit resolution which has M = 216 = 65536 values, providing a dynamic range of
20 · 10log(65536) = 96.33 ≈ 96 dB. As an approximation, it is often said that every bit increases the
dynamic range with 6 dB.
Typical classes of quantised values encountered in signal processing are 8, 16, 32 or 64-bit integers,
single precision (32-bit) or double precision (64-bit) floating point. The latter is used most often in
Figure 1.2: The values of digital signals are quantised to a fixed set of values.
18
1.2. PERIODIC SIGNALS
1.2 Periodic signals
A particular interest is in the signals that exhibit a certain amount of periodicity. Periodicity implies
that events occur repeatedly in time, with constant period T in seconds. That may be the extension
of an oscillating spring, the water level of the ocean due to tidal change or the sound pressure created
by the vibration of a guitar string. Periodic signals often origin from clear deterministic systems
and are therefore in the interest of engineers. Noise, in contrast, is typically generated by more
or less stochastic processes. Many signals encountered in practice exhibit both noise and periodic
components.
(a) Periodic signal (b) Noise
Figure 1.3: Periodic signals have repetitive content with period T . Noise is fully stochastic.
1.2.1 Periodicity
Mathematically, a signal or function is said to be periodic if there is a period T that satisfies:
y(t) = y (t+ kT ) k ∈ N (1.1a)
or for discrete-time signals:
y[n] = y [n+ kT ] k ∈ N, T fs ∈ N (1.1b)
which shows already a periodicity problem for Tfs /∈ N, see section 2.4.2. However, equations (1.1a)
and (1.1b) are very strict definitions of periodicity and only hold for a few simple functions. In real
life, a signal consist of several components that only partly satisfy the definition.
1.2.2 Frequency and phase
Periodic components can be assigned by frequency, f = 1/T . As a convention, frequencies expressed
in Hertz take the symbol f , while angular frequencies in radians per second are notated with ω = 2πf .
After one period T , the phase ϕ(t) of the component has advanced 2π radians. The phase refers to the
instantaneous position of a component y at time t and indicates the fraction of the period in radians
that has been elapsed. For a sine wave, it is simply given by y(t) = sin(ϕ(t)
).
19
1. TIME-DOMAIN CONCEPTS
For stationary signals with constant frequency f , the phase continues to increase linearly with ϕ(t) =
2πft or ϕ(t) = ωt. The phase may be biased by a constant phase shift θ in radians that determines
the phase for t = 0:
ϕ(t) = 2πft+ θ (1.2)
The concept of frequency is generalised for non-stationary signals by defining the time-dependent
instantaneous frequency (IF) f(t) as the derivative of the phase ϕ(t) with respect to time:
f(t) =1
2π
dϕ(t)
dt(1.3)
Likewise, the phase of a component with time-varying frequency is found by
ϕ(t) = 2π
∫ t
0
f(t) dt+ θ (1.4a)
or in the discrete case
ϕ[n] = 2π
n∑n=1
f [n] Δt+ θ (1.4b)
although it is better to replace the latter summation by a proper numerical integration method. A
complete discussion of instantaneous frequency and phase is found in [3].
1.2.3 Periodic functions
The most basic but very important periodic functions are the sine and cosine functions (or sinusoids)
with constant frequency f :
x(t) = sin(ϕ(t)
)= sin
(2πft
)(1.5a)
x(t) = cos(ϕ(t)
)= cos
(2πft
)(1.5b)
These functions fully satisfy the definition of periodicity. A complex exponential equation is based on
Euler’s formula and describes a complex-valued1 wave x(t) with constant frequency f :
x(t) = ei 2πft = cos(2πft
)+ i sin
(2πft
)(1.6a)
Multiplying with the complex amplitude scalar c = a + bi and keeping only the real part, the wave
can be given an amplitude and phase shift:
x(t) = � (c ei 2πft) = a cos(2πft
)− b sin(2πft
)= A cos
(2πft+ θ
)(1.6b)
with A = ‖c‖ the absolute amplitude and θ = ∠c the phase shift in radians. An alternative notation
uses the identities cos(ϕ) = 12
(ei ϕ + e−i ϕ
)and sin(ϕ) = 1
2i
(ei ϕ − e−i ϕ
):
x(t) = c+ ei 2πft + c− e−i 2πft (1.6c)
If c− is the complex conjugate of c+, the resulting signal y(t) is real. If so, it follows that c+ = 12 c
and c− = 12c and consequently A = 2 ‖c+‖ = 2 ‖c−‖ and θ = ∠c+ = −∠c−.
Equations 1.6a — 1.6c lie in the very essence of the frequency-domain representation, as discussed in
chapter 2.
1Throughout this text, the quantity i is reserved for the imaginary unit, defined by i2 = −1.
20
1.3. SIGNAL MODELLING
1.2.4 Orthogonality of harmonic waves
An important property of the sinusoids and complex waves is the orthogonality between waves with
intersecting full periods. Consider the complex exponential wave x0(t) with base frequency f0 and
non-zero complex amplitude c0 given by
x0(t) = ei 2πf0t (1.7)
Also consider the kth harmonic wave with derived frequency fk = kf0 and complex amplitude ck:
sk(t) = ck ei 2πkf0t k ∈ N (1.8)
Then the projection of the wave xk(t) onto xl(t) over a full period T0 = 1/f0 writes:∫ T0
0
xk(t) · xl(t) dt =∫ T0
0
ck cl ei 2π (l−k) f0t = (ck cl) δkl k, l ∈ N (1.9)
using the Kronecker delta notation:
δkl =
{1 for k = l
0 for k �= l
The harmonic waves xk and xl for k �= l are thus orthogonal over a full period of the base frequency,
regardless of their amplitude and phase shift.
In discrete notation, vector sk is given by:
sk[n] = ck ei 2πk n
N n = 0, . . . , N−1 (1.10)
and sk and sl are orthogonal as well:
1
NsHk sl =
1
N
N−1∑n=0
sk[n]sl[n] = (ck cl) δkl k, l ∈ N (1.11)
where (·)H denotes the complex conjugate or Hermitian vector transpose.
1.3 Signal modelling
The sinusoid wave by itself is the most pure periodic signal and is (in the audible frequency range)
perceived by the human ear as a tone with a most “mellow” quality. In reality, such signals are rarely
seen, except for the test signal on some audio devices or the swing of a pendulum clock. Still, in a
good approximation, periodic signals can be brought down to a combination of many sinusoids:
y(t) =
K∑k=1
Ak(t) cos(ϕk(t)
)0 ≤ t < T (1.12)
This observation is actually the basis for the Fourier series as introduced in section 2.2.
In addition to periodic components, signals may also contain a certain amount of uncorrelated
noise. The following sections discuss means to model an arbitrary signal comprising both periodic
components and noise.
21
1. TIME-DOMAIN CONCEPTS
+ =
Figure 1.4: A signal can be modelled as a combination of periodic components and noise.
1.3.1 Sinusoids plus noise model
Let us consider an arbitrary signal y(t) for 0 ≤ t < T . Then the signal can be modelled as a
summation of a deterministic periodic part consisting of K sinusoids characterised by Ak(t) and
ϕk(t), plus a stochastic part of uncorrelated noise η(t):
y(t) =
K∑k=1
Ak(t) cos(ϕk(t))︸ ︷︷ ︸deterministic
+ η(t)︸︷︷︸stochastic
0 ≤ t < T (1.13)
This way of representing a signal is often referred to as the sinusoids plus noise model [17]. In case
of a stationary signal ys(t), frequencies and amplitudes remain constant and ys(t) can be written as:
ys(t) =
K∑k=1
Ak cos(2π fk t+ θk
)+ η(t) 0 ≤ t < T (1.14)
An example is shown in figure 1.4.
1.3.2 Tonal components plus noise model
If some frequencies f(1)h ⊆ fk can be related to a fundamental frequency f
(1)0 by f
(1)h = hf
(1)0 ,
then the waves for h = 1, . . . , H are considered to be harmonic partials of a tonal component with
fundamental frequency f(1)0 . The stationary tonal component y(1)(t) is then modelled as:
y(1)s (t) =
H∑h=1
A(1)h cos
(2π hf
(1)0 t+ θ
(1)h
)0 ≤ t < T (1.15)
An non-stationary tonal component may be described by:
y(1)(t) =
H∑h=1
A(1)h (t) cos
(ϕ(1)h (t)
)0 ≤ t < T (1.16)
Defining the fundamental phase shift θ(1)0
Δ= 0, it follows that ϕ
(1)0 (0) = 0 and the phase functions of
the partials are given by:
ϕ(1)h (t) = hϕ
(1)0 (t) + θ
(1)h 0 ≤ t < T (1.17)
This concept will be used extensively in chapter 4.
22
1.4. SUMMARY
A monophonic signal y(t) comprises only one tonal component y(1)(t) in addition to noise. A
polyphonic signal contains multiple tonal components y(g)(t) with different fundamental phase
functions ϕ(g)0 (t). The model for an instationary signal comprising g = 1, . . . , G tonal components
plus noise finally writes:
y(t) =G∑
g=1
y(g)(t) + η(t) 0 ≤ t < T (1.18)
This model is referred to as the tonal components plus noise model.
1.4 Summary
Time-domain signals are formulated in either the continuous-time or the discrete-time domain. The
value of the continuous-time signal is known at every instance in time. The values of a discrete-time
signal are solely known at a finite number of instances in time and can be obtained by sampling. A
digital signal additionally requires the values to be quantised to a finite set of values.
Signals usually comprise both periodic components and noise. Periodic components can be assigned
by frequency, which is the inverse of the period. A stationary component has a constant frequency
and consequently a linearly increasing phase. For a non-stationary component, the instantaneous
frequency is found as the derivative of the phase with respect to time. Essential periodic functions
are the sine and cosine wave or sinusoids, that may also be formulated using complex exponential
notation. The sinusoid exhibits orthogonality for waves with intersecting full periods.
In a good approximation, any time-domain signal may be considered as a deterministic part consisting
of sinusoids plus a stochastic part of uncorrelated noise. If some sinusoids can be related to a
fundamental frequency, the sinusoids are considered to be harmonic partials of a tonal component. A
monophonic signal comprises only one tonal component, while a polyphonic signal contains multiple
tonal components.
23
Chapter2
Frequency-Domain Concepts
For many analysis purposes, a time-domain representation of a signal does not offer enough
information. More insight is gained from its frequency domain representation or frequency spectrum,
which can be obtained by means of a Fourier transformation.
This chapter explores the fundamentals of the frequency domain and Fourier analysis. First, the
difference between the time and frequency domain is discussed and illustrated by a basis vector
transformation. In section 2.2, the Fourier series and Fourier transformations are introduced. The
discrete Fourier transform and its properties are discussed in section 2.4. Section 2.5 addresses the
theory and application of windowing. The chapter concludes with a study of the uncertainty principle
in section 2.6 that brings up the fundamental trade-off between time and frequency localisation.
2.1 Time domain versus frequency domain
The frequency domain is the domain in which a function or signal is expressed in terms of frequency
content rather than time content. Formally, a frequency domain representation shows the spectral
distribution of a signal, whereas a time-domain representation shows its temporal distribution.
time
freq
uen
cy
(a) Time domain
time
freq
uen
cy
(b) Frequency domain
Figure 2.1: The difference between the time and frequency domain representation of a signal. Thetime domain offers perfect time localisation but no frequency localisation, while the frequency domainoffers excellent frequency localisation but lacks time localisation.
25
2. FREQUENCY-DOMAIN CONCEPTS
The concept is illustrated in figure 2.1 with time on the horizontal axis and frequency on the vertical
axis. Let us assume that all time and frequency content of a signal is contained within the square.
Then the time-domain only offers temporal localisation, while the frequency-domain merely provides
spectral localisation. Nevertheless, both domains can represent exactly the same signal as long as
some conditions are satisfied, as will be discussed in this chapter.
The frequency-domain representation or Fourier transformed of a signal y(t) is denoted by y(f), with
f the frequency in Hertz. The transformation y(t) ⇒ y(f) is called a Fourier transform (FT) and is
discussed in section 2.2. For discrete signals, the time-domain sequence y[n] can be transformed to a
frequency-domain spectrum y[k] of equal length by means of a discrete Fourier transform (DFT), as
discussed in section 2.4. First, the difference between the two domains is illustrated by a basis vector
transformation.
2.1.1 Example 1: Basis vector transformation
Let us consider the simple time-domain sequence y[n] as shown in figure 2.2 with onlyN = 8 points.
In a time-domain representation, the 8× 1 vector y[n] holds the 8 values for the 8 different instances
n = 0, 2, . . . , 7:
y =[1 2 0 −1 −1.5 −0.5 1.5 1
]TAll 8 values are independent of each other: there exists no other combination of these 8 entries in
y that represent the sample signal. Mathematically speaking, the space of the R8 time-domain can
exactly be spanned byK = 8 orthogonal basis vectors ek that form the basis vector matrix E:
E =
⎡⎢⎢⎢⎣e0e1...
e7
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎣1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0...
......
......
......
...
0 0 0 0 0 0 0 1
⎤⎥⎥⎥⎦ = I
Since E equals the unity matrix, the vectors ek are also orthonormal and it simply follows that
y = Ey. It is observed that the vectors of E are perfectly independent in terms of time localisation,
but do not offer any information about the frequency content.
1 2 3 4 5 6 7 8−2
−1
0
1
2
n
y
Figure 2.2: A simple time-domain sequence of 8 samples.
In the frequency-domain representation, the same sequence y is expressed in another set of 8 basis
vectors ek, k = 0, . . . , 7. The vectors correspond to the 8 orthogonal complex exponential waves
26
2.1. TIME DOMAIN VS. FREQUENCY DOMAIN
0 1 2 3 4 5 6 7 8−1
0
1
e 0
0 1 2 3 4 5 6 7 8−1
0
1
e 1
0 1 2 3 4 5 6 7 8−1
0
1
e 2
0 1 2 3 4 5 6 7 8−1
0
1
e 30 1 2 3 4 5 6 7 8
−1
0
1
e 4
0 1 2 3 4 5 6 7 8−1
0
1
e 5
0 1 2 3 4 5 6 7 8−1
0
1
e 6
0 1 2 3 4 5 6 7 8−1
0
1
e 7
Figure 2.3: The orthogonal basis vectors of the 8-point frequency domain. The real part is colouredblue, the imaginary part is red. The arrows indicate the conjugate pairs: waves with similarfrequencies but opposing imaginary part.
with full periods (see equation 1.10), counting from k = 0 to 7:
ek[n] = ei 2π knN k, n = 0, 1, . . . , 7 N = 8
The vector values and corresponding waves are shown in figure 2.3 with the real part in blue and the
imaginary part in red.
The first vector e0 corresponds to a constant level, also referred to as the DC component. Vector 1 to
4 contain respectively 1 to 4 full periods. Note that vector 0 and 4 are real-valued.
For vector 5 to 7, the number of periods are expected to be 5 to 7, but their vector values only show
3 to 1 periods. This is in accordance with the Nyquist-Shannon sampling theorem that states that a
sampled signal can only contain frequency content up to half the sampling frequency. As a result,
waves with frequencies > 12fs will appear as aliased waves with lower frequency and an opposing
complex part, which yields in case of this example:
e5 = e3
e6 = e2
e7 = e1
The conjugate pairs are indicated by arrows in figure 2.3. In fact, the aliased vectors correspond to
the negative frequencies of the spectrum, as will be discussed in section 2.4.
Getting back to the basis vectors, a square matrix E can be constructed from the 8 vectors ek , similar
to the time domain procedure. Just as the basis vectors ek , the vectors ek span an orthogonal space
(see equation (1.11)). In contrast toE, E contains vectors that are independent in terms of frequency
27
2. FREQUENCY-DOMAIN CONCEPTS
0 1 2 3 4 5 6 70
0.5
1
n
YFigure 2.4: Frequency-domain representation y of the sequence y of figure 2.2.
but are evenly spread over time:
E =
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
1 1 1 1 1 1 1 1
1√2(12 + 1
2 i)
i√2(− 1
2 + 12 i) −1
√2(− 1
2 − 12 i) −i √
2(12 − 1
2 i)
1 i −1 −i 1 i −1 −i1
√2(− 1
2 + 12 i) −i √
2(12 + 1
2 i) −1
√2(12 − 1
2 i)
i√2(− 1
2 − 12 i)
1 −1 1 −1 1 −1 1 −1
1√2(− 1
2 − 12 i)
i√2(12 − 1
2 i) −1
√2(12 + 1
2 i) −i √
2(− 1
2 + 12 i)
61 −i −1 i 1 −i −1 i
1√2(12 − 1
2 i) −i √
2(− 1
2 − 12 i) −1
√2(− 1
2 + 12 i)
i√2(12 + 1
2 i)
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
While EHE = I , it appears that EHE = NI with N = 8, meaning that the vectors are orthogonal
but not orthonormal. Nevertheless, E can be regarded as an orthogonal transformation matrix,
that can be used to transform the representation in the frequency-domain by the basis ek to the
representation in the time-domain by the basis ek:
y =1√N
E yn (2.1a)
yn =1√N
EHy (2.1b)
Vector yn denotes the normalised Fourier transformed of y. However, a more common transforma-
tion writes:
y = E y (2.1c)
y =1
NE
Hy (2.1d)
As such, vector y corresponds to the amplitudes of the complex basis waves as present in the signal
y[n].
The values for y are found:
y =[0.31 0.71 −0.25 −0.09 −0.06 −0.09 −0.25 0.71
]T+[
0.00 0.14 −0.19 −0.23 0.00 0.23 0.19 −0.14]Ti
28
2.2. FOURIER SERIES
It can be observed that y[0] is real and y[1], y[2], y[3] form complex conjugate pairs with y[7], y[6], y[5].
The absolute values |y| are shown in figure 2.4.
It is verified that yH y = 1/N yHy and yHn yn = yHy, which shows that energy is conserved
throughout the transformation. This property is known as Parseval’s identity (see section 2.4.3).
The obtained vector y is exactly the Fourier transformed of y, as will be discussed in the following
sections.
2.2 Fourier series
A decomposition of a periodic signal into its harmonic sinusoidal partials is called a Fourier series.
The concept is named after Joseph Fourier (1768 — 1830), a French mathematician who discovered
that stationary periodic signals can be expressed as a superposition of sinusoids:
y(t) =K∑
k=1
Ak cos(fkt+ θk
) −∞ < t <∞ (2.2)
In theory, a Fourier series can describe any periodic signal exactly as long as an infinite number of
partials is allowed.
2.2.1 Trigonometric series
Following the definition of equation (1.6b), the harmonic partials k = 1, . . . ,K of a periodic signal
y(t) with fundamental frequency f0 can be found in terms of ak and bk by:
ak =2
T0
∫ T0
0
y(t) cos(2πf0t
)dt (2.3a)
bk =2
T0
∫ T0
0
y(t) sin(2πf0t
)dt (2.3b)
The DC offset a0 is determined by:
a0 =1
T0
∫ T0
0
y(t) dt (2.3c)
The integration is performed over 0 ≤ t < T0, although any full period can be used. The
trigonometric series is easy to interpret, but mathematically inferior to its complex equivalent.
2.2.2 Complex exponential series
The complex exponential Fourier series provides a mathematically more elegant alternative to the
trigonometric series. Considering a signal y(t) with fundamental frequency f0, the complex partials
sections only discuss a few important window functions and ways to quantify their properties. For a
complete study on windowing, refer to [12].
2.5.1 Window application
Let y[n] be an N point signal and w[n] the window function of equal length. Then the windowed
signal is obtained by element-wise multiplication:
y(w)[n] = w[n]y[n] n = 0, . . . , N−1 (2.16)
36
2.5. WINDOWING
The time instances t[n] are:
t[n] =n−N/2
fs= n/fs − T/2 n = 0, . . . , N−1 (2.17)
Hence, the window length is T = N/fs, the time domain −T/2 ≤ t < T/2 and the window is
symmetric about t[N/2] = 0 s. The DFT of the window is denoted by w[k].
In general, windows have their maximum value at n = N/2 and reduce smoothly to zero towards the
boundaries. The obtained spectrum after windowing y(w)[k] is predictable, since it can be seen as a
convolution of the DFT of y[n] and w[n].
Since one is especially interested in the response of the window to incomplete waves, k should not be
limited to integers (k ∈ N). Instead, the Fourier transform of w[n] is considered as a function of the
continuous frequency f ∈ R in Hertz:
w(f) =1
N
N−1∑n=0
w[n] e−i 2πf nN − 1
2fs < f < 12fs (2.18)
Equation (2.18) represents the discrete-time Fourier transform (DTFT) of w[n] for the domain of
feasible frequencies (see section 2.3.2). The frequency f corresponds to the number of frequency
bins relative to DC, as will be illustrated below.
2.5.2 Window properties
The window properties are introduced using the rectangular window as an example. Figure 2.9 shows
(from left to right) the window w[n] on linear scale, w(f) on linear scale and w(f) on logarithmic
(dB) scale. The performance indicators of the window will be discussed separately.
−0.5 −0.25 0 0.25 0.50
0.5
1
1.5
2
Time [s]
Time domain − linear
−10 −5 0 5 100
0.2
0.4
0.6
0.8
1
Frequency [bins]
Frequency domain − linear
−10 −5 0 5 10−160
−120
−80
−40
0
Frequency [bins]
Frequency domain − dB
Figure 2.9: Time and frequency domain representation of the rectangular window.
Coherent gain
Let us have a look at the frequency spectrum of figure 2.9. As already mentioned, the spectrum
of a windowed signal can be seen as the spectrum of the signal convolved with the spectrum of the
window. Let the signal be composed from broad-band noise plus a single sinusoidal component. Then
37
2. FREQUENCY-DOMAIN CONCEPTS
the coherent gain (CG) is defined as the DC component of the window (f = 0) and determines the
gain of the sinusoid at its true frequency in the spectrum:
CG =1
N
∑N
w[n] (2.19)
For the rectangular window, CG = 1. However, for most windows CG < 1, meaning that the
amplitude of the principal component is reduced. Often, window functions are normalised to a
processing gain of 0 dB to make sure that the spectral amplitudes correspond to the true amplitudes,
apart from the contributions of the noise.
Equivalent noise power & bandwidth
Unfortunately, the amplitude determined by the coherent gain is biased by the neighbouring
frequency content that is accumulated according to the response of the filter for f �= 0. The total noise
power is defined as the integral of the square of the frequency response over the complete frequency
domain −fs/2 < f < fs/2:
NP =
∫ fs/2
−fs/2
|w(f)|2df =1
N
∑N
|w[n]|2 (2.20)
where Parseval’s theorem is used for the latter expression. The equivalent noise bandwidth (ENBW) is
a measure for the width of a hypothetical rectangular “filter” with coherent power CG2, that would
accumulate the same amount of noise power. In other words, it represents the width of a rectangle of
height CG2 that has the same squared area as the area under |w(f)|2. It is indicated by a dashed box
in the centre plot in figure 2.9. Using the definitions of CG and NP, it is given by:
ENBW =1
N
NP
CG2 (2.21)
The rectangular window has an ENBW of 1. For most windows however, ENBW > 1. Consequently,
the ENBW quantifies the reduction of the achievable spectral resolution compared to the rectangular
window.
Main lobe width & -6dB bandwidth
The main lobe width (MLW) is the width of the centre lobe between the first points where w(f) = 0.
It is a measure for the sharpness or spectral resolution of the DFT: lower values correlate with sharp
spectra while higher values produce more blurry spectra, making it difficult to distinguish closely
spaced frequencies. The -6dB bandwidth (BW) is a similar measure but corresponds the the bandwidth
between the points where w(f) = 0.5 = −6 dB. Note that both bandwidths are implications of the
window itself and are not related to leakage due to incomplete periods in the signal.
Side lobe level & roll-off rate
The side lobe level (SLL) and the side lobe roll-off (SLR) quantify the amount of leakage. The side lobe
level is the maximum level of the contributions of frequencies that are not part of the main lobe and
should therefore be minimised. The side lobe roll-off is a measure for the asymptotic rate of side lobe
level decrease per frequency bin, usually specified in dB per octave. It is a direct result of the order of
38
2.5. WINDOWING
discontinuity on the boundaries:
0th order 1/f −6 dB/oct
1st order 1/f2 −12 dB/oct
2nd order 1/f3 −18 dB/oct
pth order 1/fp+1 −6(p+ 1) dB/oct
The above relation is a result of the differentiation property of the Fourier transform: every additional
order of continuity adds factor 1/f to the roll-off rate [18].
2.5.3 Rectangular window
The rectangular or Dirichlet window (figure 2.9) is the most trivial window, as it is often explained as
applying no windowing at all:
w[n] = 1 n = 0, . . . , N−1 (2.22)
The DTFT of the rectangular window is given in closed form by:
w(f) =cos(πf)
πf= sinc(f) (2.23)
The coherent gain and equivalent noise bandwidth are both 1. The main lobe width is 2 bins: except
for f = 0, all integer frequency bins yield zero amplitude. The -6dB bandwidth is only 1.21 bins. The
side lobe level is −13.3 dB and the roll-off rate is of course −6 dB/oct, since the window exhibits a 0th
order discontinuity.
The rectangular window has the lowest possible ENBW, MLW and BW, meaning that it is able to
produce a very sharp DFT. However, due to the high side lobe level and slow roll-off rate, the window
suffers from severe leakage.
2.5.4 Hanning window
The Hanning window (or Hann, named after the Austrian meteorologist Julius von Hann) is perhaps
the most frequently applied window since it offers excellent leakage suppression and has a very
predictable response. The window is given by:
w[n] = 12 − 1
2 cos(2π
n
N
)= cos2
(π( nN
− 12
))n = 0, . . . , N−1 (2.24)
The window is shown in figure 2.10. The red point represent the window values at the DFT points,
i.e. integer values of f . Clearly, the CG is 0.5 due to the first term in equation (2.24). The cosine term
appears as two points with amplitude 14 at f = ±1. The ENBW is 1.5, meaning that the spectrum
is 1.5 times less sharp than the rectangular window. The SLL is −31.5 dB and the SLR is −18 dB/oct
since both the window value and its first derivative are continuous at the boundaries. This comes at
the cost of a higher bandwidth: the MLW is 4 bins and the BW is 2 bins.
The Hanning window offers much better leakage suppression then the rectangular window. In
addition, it has perfect temporal coverage when adjacent windows are observed with 12T spacing
in time, as will be discussed in chapter 5.
39
2. FREQUENCY-DOMAIN CONCEPTS
−0.5 −0.25 0 0.25 0.50
0.2
0.4
0.6
0.8
1
Time [s]
Time domain − linear
−10 −5 0 5 100
0.1
0.2
0.3
0.4
0.5
Frequency [bins]
Frequency domain − linear
−10 −5 0 5 10−160
−120
−80
−40
0
Frequency [bins]
Frequency domain − dB
Figure 2.10: The Hanning or Hann window.
−0.5 −0.25 0 0.25 0.50
0.2
0.4
0.6
0.8
1
Time [s]
Time domain − linear
−10 −5 0 5 100
0.1
0.2
0.3
0.4
0.5
Frequency [bins]
Frequency domain − linear
−20 −10 0 10 20−160
−120
−80
−40
0
Frequency [bins]
Frequency domain − dB
Figure 2.11: The Gaussian window for σ = 0.2 (blue), σ = 0.1 (green) and σ = 0.05 (red).
2.5.5 Gaussian window
The Gaussian window implements the Gaussian function with standard deviation σ:
wσ [n] = e− 1
2
⎛⎝ t[n]σ
⎞⎠
2
n = 0, . . . , N−1 (2.25)
Time t[n] is defined by (2.17). Due to the absence of the normalisation term 1/(σ√2π), the function
is not the normalised Gaussian or normal distribution with unity area. Instead, it has a peak value
w[N/2] = 1, just like the other windows. The Gaussian function is the only function known in closed
form that transforms to itself:
wσ(f) =(σ√2π)e−1
2
⎛⎝2πf
1/σ
⎞⎠
2
− 12fs < f < 1
2fs (2.26)
The proof is not trivial [1, equation 7.4.6, page 302]. The Gaussian window can thus be “tuned” by the
time-domain standard deviation σt, while the frequency-domain σf follows according to:
σf =1
2πσ(2.27)
Figure 2.11 shows the Gaussian window for σ = 0.2 (blue), σ = 0.1 (green) and σ = 0.05 (red). It can
be verified that both the time-domain and frequency-domain window have the shape of a Gaussian
40
2.5. WINDOWING
function. Note that the axes of the logarithmic frequency domain plot extend to ±20 bins.
The Gaussian function only reaches zero at infinity. Therefore the transformation of (2.26) only holds
for an time window of infinite length. In equation (2.25), the function is truncated at T/2. The
standard deviation σt determines the amount of discontinuity on the boundaries. Looking at the
logarithmic frequency domain plot, it is observed that w(f) is perfectly quadratic near the centre,
as was expected from equation 2.26. From a certain level, the response starts to show side lobes.
It is observed that the side lobe level increases with an increasing discontinuity of the time-domain
window w[0], although there is no explicit relation.
Since w(f) is logarithmic quadratic near the centre, frequency estimates can be obtained by quadratic
interpolation. Also, the Gaussian window minimises the time-frequency product, as will be discussed
in the following section.
2.5.6 Cosine and cosine-sigma window
Recall that the Hanning window can be written as a cosine function to the power of 2 (equation
(2.24)). The Hanning window is therefore part of the family of cosα windows. The family for non-
negative α can be characterised by a MLW of exactly 2+α bins and a SLR of−6(2+α) dB/oct, since
every additional cosine wave adds one order of derivative continuity. Furthermore, the following
dependencies were found for α ∈ [0, 10]:
SLL = 13.3− 7.47α dB
BW =√1.45 + 1.35α dB
The window for α = 0 is obviously the rectangular window. It is observed that for increasing α, the
window tends to converge to a Gaussian window. This observation is justified by the mathematical
limit that shows the convergence of a power cosine function to an exponential function:
limN⇒∞
cos
(t
N
)N2
= e−12 t
2
(2.28)
In contrast to the Gaussian window, the boundaries of a cosine window are always zero, meaning
that the window has a much higher roll-off rate. The cosine-sigma window is therefore suggested,
combining the properties of the cosα window and the Gaussian window:
wσ,α[n] =
⎧⎪⎨⎪⎩cos
(t[n]
σ√α
)α
− 12πσ
√α < t[n] < 1
2πσ√α
0 elsewhere
(2.29)
The parameter σ determines the theoretical Gaussian standard deviation for the case α = ∞. The
parameter α controls the exponent of the cosine function and thereby the validity of equation (2.28).
Note that the window for α = 2 and σ = 1/(π√2)= 0.225 is exactly the Hanning window.
Figure 2.12 shows a Gaussian window for σ = 0.1 (blue) and the cosine-sigma window with σ = 0.1
and α = 16 (green). On linear scale, the windows appear almost the same. On logarithmic scale, it
can be observed that the cosine-sigma window is slightly more concentrated. The measured standard
deviation is 0.097 which is close to the expected σ = 0.1. It is verified that the cosine-sigma window
converges to the Gaussian window for α >> 100.
41
2. FREQUENCY-DOMAIN CONCEPTS
−0.5 −0.25 0 0.25 0.50
0.2
0.4
0.6
0.8
1
Time [s]
Time domain − linear
−10 −5 0 5 100
0.1
0.2
0.3
0.4
Frequency [bins]
Frequency domain − linear
−20 −10 0 10 20−160
−120
−80
−40
0
Frequency [bins]
Frequency domain − dB
Figure 2.12: The Gaussian window for σ = 0.1 (blue) and the cosine-sigma window for σ = 0.1 andα = 16 (green). The cosine-sigma window has a better roll-off rate than the Gaussian window.
2.5.7 Example 4: Windowing of a simple signal
To illustrate the difference between window functions, a simple signal is considered:
y[n] = cos(2πf1t[n]
)+ cos
(2πf2t[n]
)+ 0.01 cos
(2πf3t[n]
)+ 0.01η[n] n = 0, . . . , N−1 (2.30)
The time instances are t[n] = n/fs with fs = 100Hz. The length of the sequence is N = 100, which
corresponds to T = 1 s.
The first periodic component at f1 = 10Hz is fully periodic in the window. The second component
at f2 = 25.5Hz exhibits a discontinuity and will cause leakage. A third component at f3 = 40Hz is
periodic in the window but has a 100× smaller amplitude. Furthermore, the signal is corrupted by
white noise represented by a random vector 0.01η[n] ∈ [−0.01, 0.01].
The signal is multiplied with four different window functions and the DFT is computed. The
amplitudes of the single-sided spectra are shown in figure 2.13. The four window functions are:
1. Rectangular window. The 10Hz component is represented by a single peak. The 25Hz com-
ponent however causes a severe amount of leakage, completely masking the third component
40Hz. Clearly, the rectangular window is a bad choice for signals with a so-called high dynamic
range.
2. Hanning window. The Hanning window is able to reveal all periodic components, although it
may be difficult to exactly determine the frequencies from the spectrum. The amplitudes of the
peaks are a factor 0.5 = 6 dB lower than the true amplitudes. The remaining 6 dB is spread
over the neighbouring frequency bins. Also, the window offers enough leakage suppression to
reveal the noise floor at approximately −80 dB.
3. Gaussian window with σ = 0.2. This windows yields similar results as the Hanning window.
All periodic components appear as peaks with a quadratic shape, regardless of being periodic
in the window or not.
42
2.6. UNCERTAINTY PRINCIPLE
0 10 20 30 40 50−120
−100
−80
−60
−40
−20
0
Frequency [Hz]
(a) Rectangular window
0 10 20 30 40 50−120
−100
−80
−60
−40
−20
0
Frequency [Hz]
(b) Hanning window
0 10 20 30 40 50−120
−100
−80
−60
−40
−20
0
Frequency [Hz]
(c) Gaussian window σ = 0.2
0 10 20 30 40 50−120
−100
−80
−60
−40
−20
0
Frequency [Hz]
(d) Gaussian window σ = 0.1
Figure 2.13: Four different window functions applied to a simple signal.
4. Gaussian window with σ = 0.1. The window offers a very poor frequency resolution, but
excellent leakage and noise suppression. If the aim was to estimate the spectral location of
the periodic components, this window may still be a good choice since the frequencies can be
estimated accurately using quadratic interpolation.
2.6 Time-frequency uncertainty principle
Throughout the chapter, it has become clear that temporal accuracy is inversely related with spectral
accuracy. Recall from example 1 that the time-domain basis vectors have excellent time localisation
but give no frequency information, while the Fourier basis vectors offer perfect frequency localisation
without time information. Frequency localisation means the ability to clearly identify periodic
components that are concentrated at particular frequencies [16].
In effect, by applying a non-rectangular window to a signal, one centres the observation at t = 0
and accepts that events “further away” from this centre are attenuated more than events close to the
43
2. FREQUENCY-DOMAIN CONCEPTS
centre. Thereby, windowing introduces a certain amount of temporal localisation.
To conclude the chapter, this fundamental trade-off is formalised as the time-frequency uncertainty
principle.
2.6.1 Temporal and spectral localisation
Let us once again consider a window w[n] of length N with time t[n] given by (2.17). Using the
definition for the total noise power (2.20), the temporal and spectral centres can be found by:
μt =1
NP
N−1∑n=0
t[n]∣∣w[n]∣∣2 (2.31a)
μf =1
NP
∫ fs/2
−fs/2
f |w(f)|2 df (2.31b)
Next, define the second central moments of the temporal and spectral distribution:
σ2t =
1
NP
N−1∑n=0
(t[n]− μt)2 ∣∣w[n]∣∣2 (2.32a)
σ2f =
1
NP
∫ fs/2
−fs/2
(f − μf )2 |w(f)|2 df (2.32b)
Note that σ2t and σ
2f are the standard deviations of the quadratic values of respectively w[n] and w(f).
If the window is well-localised in time, then the signal is concentrated at time instance μt and σ2t is
small. Similarly, if the window is well-localised in frequency, then the spectrum is centred around μf
and σ2f is small.
It can be verified that σ2t and σ2
f are invariant under time and frequency shift, by the definition of the
temporal and spectral centres (2.31a) and (2.31b). The relation for time-scaling by factor a writes:
σ2t
(at[n]
)=
1
|a|σ2t
(t[n])
a ∈ R (2.33a)
σ2f
(at[n]
)= |a|σ2
f
(t[n])
a ∈ R (2.33b)
This means that a decrease of the window length in time increases the spectral width proportionally.
2.6.2 Time-frequency product
The dimensionless time-frequency product is given by:
U = σtσω = 2π σtσf (2.34)
Obviously, U is invariant under time and frequency shifts. By equation (2.33a) and (2.33b), the product
is also invariant under time scaling. The product can therefore be interpreted as a measure of how
well the window is localised in both time and frequency: a low value of U correlates with a good
localisation.
44
2.7. SUMMARY
2.6.3 Uncertainty principle
The time-frequency uncertainty principle states that no window or waveform can have a time-
frequency product less than 12 , frequency being expressed in radians:
U = 2π σtσf >12 (2.35)
The proof follows from the Cauchy-Schwartz inequality. It is related to Heisenberg’s uncertainty
principle for quantum physics, that states that one can not at the same time determine the position
and momentum of a particle to an arbitrary degree of accuracy. Similarly for time-frequency analysis,
the simultaneous accuracy of time and frequency localisation is limited by (2.35).
2.6.4 Example 5: Time-frequency product of four windows
The time-frequency products of the four windows of example 4 are determined. As stated above, the
lower limit of U is 0.5. The windows are shown in figure 2.9, 2.10 and 2.11.
1. Rectangular window. The second central moment of the window is exactly ( 112 )
3/2 = 0.29 s.
The second central moment of the frequency spectrum is 8.38Hz. The time-frequency product
is 15.200, which is awfully high compared to the lower limit.
2. Hanning window. σt = 0.18 s and σf = 0.57Hz. The time-frequency product is 0.513, which
is close to the minimum.
3. Gaussian window with σt = 0.2. The second central moments are σt = 0.14 s and σf =
0.83Hz. The time-frequency product is 0.742. Although the window is a Gaussian, the lower
limit of 0.5 is not reached since the Gaussian is truncated, which causes a large discontinuity.
4. Gaussian window with σt = 0.1. The second central moments are σt = 0.07 s and
σf = 1.13Hz. The time-frequency product is 0.500. The window achieves the lower limit
for uncertainty. Compared to the Gaussian window with σt = 0.2, the truncation causes a
negligible discontinuity.
Concluding, the Hanning window appears to be a good all-round window. For specific purposes, the
Gaussian window or cosine-sigma may be preferred, especially if one needs control over temporal
and spectral localisation.
2.7 Summary
A frequency-domain representation provides spectral information of a signal or function, in contrast
to the time-domain representation that only provides temporal information. The study of signals in
terms of frequency content is called a Fourier analysis. The fundamental concept is the observation
that periodic signals can be decomposed into harmonic sinusoids, the so-called Fourier series.
The continuous-time and discrete-time Fourier transform generalise this concept to non-periodic
signals and provide a continuous frequency domain. The discrete Fourier transform only examines
the frequencies that have complete periods over the signal length. As these waves are orthogonal, the
DFT functions as a linear transformation.
45
2. FREQUENCY-DOMAIN CONCEPTS
The relationship between the four Fourier transformations and their domains is depicted schemat-
ically in figure 2.14. The continuous / discrete time domain is distinguished horizontally; the
continuous / discrete frequency domain vertically. Note that the Fourier series only fits in this scheme
if the windows length T equals the fundamental period T0. In that case, the complex series ck are
directly related to y[k].
y(t) y[n]
y(f)
y[k]FS
R ⇒ Z
DFT
N ⇒ N
CTFT
R ⇒ R
DTFT
N ⇒ R
Figure 2.14: The four Fourier transformations and their domains.
Finite-length signals usually exhibit waves that are not fully periodic over the signal length. This
causes discontinuities on the boundaries, resulting in spectral leakage. Leakage can be reduced by
applying a window function. Generally, a window function suppresses the spectral leakage, at the
cost of a reduced spectral resolution. The window function should be chosen with care, depending on
the analysis purpose.
For every finite-length signal, one can determine the localisation accuracy in time and frequency.
By choosing the window function and length, both properties can be controlled. However, the
simultaneous temporal and spectral accuracy is limited by the time-frequency uncertainty principle.
The Gaussian window is the only window that potentially reaches the minimum time-frequency
product. The trade-off between time and frequency accuracy is a fundamental issue for time-
frequency analysis.
46
Part II
Advanced concepts
Chapter3
Time Warping
Time warping is a technique that applies stretching and contraction of a signal in the time domain. It
involves a modification using a non-linear time function that realises frequency scaling. The concept
of a non-linear time may sound odd, but has some well-known equivalents in real life:
– A disk-jockey manually controlling the speed of a vinyl disc on a turn-table
– The observed frequency shift of the siren signal of a passing emergency vehicle (Doppler effect)
– The vibrato or chorus effect of a guitar, realised by a series of electronic delay circuits
In all cases, frequency scaling occurs due to a change in the way time is conceived. This conceived
time or warped time is denoted by t and can be expressed as t = φα(t) or t = ψα(t), some non-linear
functions of t. The time-warped signal then reads y(t) = y(t).
Note that the term frequency scaling is used rather than frequency modulation or shift. Frequency
modulation is achieved by multiplication with a complex wave. If this wave has frequency fm, the
frequency content of the modulated signal is shifted entirely with −fm or +fm, depending on the
sign in the exponent. This modulation is the basis of the Fourier transformations, as discussed in
chapter 2. In contrast, time warping scales all frequencies by a constant rate, keeping the ratios
between harmonics intact.
The concept of time warping is often applied for analysis of non-stationary chirp signals1, i.e. signals
with increasing or decreasing frequency. It will be shown that time warping can be used effectively
as pre-processing operation for analysis of a more general class of non-stationary signals. It thereby
concentrates the energy of the partials on centre frequencies while preserving the harmonic structure.
3.1 Linear time warping
As an introduction on the time warping topic, the following sections discuss the theory of linear
time warping in the continuous time domain. The formulation of the linear time warping function is
adopted from [21] where it was mentioned as part of the fan Chirp transform (see chapter 6). This
section discusses linear time warping as an independent operation.
1The term chirp is adopted from the sound of chirping birds.
49
3. TIME WARPING
3.1.1 Warp function
Following the notation of [21] and [5], the linear time warping function2 is defined as:
φα(t) = (1 + 12αt)t (3.1)
and the time derivative:
φ′α(t) = 1 + αt (3.2)
Note that the time warp function is a quadratic function of t for α �= 0, and simply the linear time
function for α = 0.
3.1.2 Chirp wave
Let us consider a linear chirp wave y(t) with mean frequency fc, subject to the warping function of
equation (3.1). The instantaneous frequency is defined as:
f(t) = (1 + αt)fc = φ′α(t)fc (3.3a)
and the wave with constant amplitude A reads:
y(t) = A cos
(2π
∫(1 + αt)fc dt
)= A cos
(2πfc φα(t)
) (3.3b)
fc is the instantaneous frequency at t = 0. Note that f(−1/α) = 0, meaning that the instantaneous
frequency has a focal point at t = −1/α, regardless of the value of fc. Beyond the focal point,
frequencies become negative, which in this case implies reversal of time. Since this is not the aim of
warping, the time domain is limited to −1/α < t < 1/α. In this interval, fc is the mean frequency.
3.1.3 Chirp rate
The chirp rate α is defined as the frequency increase relative to the mean frequency fc and reads:
α =f ′(t)fc
(3.4)
The derivative of (3.2) is only valid for a constant chirp rate α. For linear chirps, f ′(t) is constant sothe chirp rate α is constant as well, which is easily verified from equation (3.3a):
α =(φ′α(t) fc)
′
fc= φ′′α(t) = α
For higher order chirp waves, the time-derivative writes:
α = φ′′α(t) = α+O(t) (3.5)
indicating that the chirp rate is not constant. For linear time warping, the higher order terms must be
neglected. Non-linear time warping is discussed in section 3.2.
2Note that the warp function denoted by φα(t) is not related to the phase function ϕ(t), although the Greek letter (“phi”)may appear the same.
50
3.1. LINEAR TIME WARPING
3.1.4 Inverse warp function
Linear chirp signals that satisfy f(t) = φ′α(t)fc can be transformed back to stationary signals by
applying inverse time-warping. In order to perform inverse time-warping, an inverse expression for
φα(t) must be found: t(φα) = φ−1α (t). This is not trivial since φα(t) is a quadratic expression that
may have two solutions. For the time domain −1/α < t < 1/α, it is verified that:
0 < φ′α(t) < 2 ∀α (3.6)
Since φ′α(t) has no sign change, φ−1α (t) is a strictly monotonic increasing function. Consequently, the
inverse has only one solution:⎧⎨⎩φ
−1α (t) = − 1
α+
√1 + 2αt
αfor α �= 0
φ−1α (t) = t for α = 0
(3.7)
From here on, the inverse warp function is denoted by the symbol ψ(t) = φ−1α (t).
Note that√1 + 2αt becomes imaginary for t < −1/2α. Also, the time derivative or warped-time
rate reads:
ψ′α(t) =
1√1 + 2αt
(3.8)
The warped-time rate reaches infinity at t = −1/2α, implying that the warped time would have to
run infinitely fast and then becomes imaginary. The applicable time domain is therefore reduced to:
−1/2α < t < 1/2α (3.9)
which is referred to as the time support property of the warp function [21]. Figure 3.1 shows the
warp function (green) and inverse warp function (red) for α = 0.5 and −1 < t < 1, the maximum
allowable warp rate. It can be observed that the derivative of the inverse warp function reaches
infinity at t = −1 = −1/2α.
3.1.5 Inverse time warping
A time-warped version y(t) of the linear chirp signal y(t) can now be obtained using the inverse warp
function:
y(t) = y(ψα(t)) = A cos(2πfc φα
(ψα(t)
))(3.10)
Development of the interior time-dependent term yields:
φα(ψα(t)
)= − 1
α+
√1 + 2αt
α+
1
2α
(1
α2+
1 + 2αt
α2− 2
√1 + 2αt
α2
)= t
(3.11)
which shows that the the linear chirp signal y(t) after time warping has become a stationary signal
with constant frequency fc:
y(t) = A cos(2πfc t) − 1
2α< t <
1
2α(3.12)
51
3. TIME WARPING
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−2
−1.5
−1
−0.5
0
0.5
1
1.5
Time
War
ped
time
Warped time φα(t)
Inverse warped time ψα(t)
Figure 3.1: The warp and inverse warp functions.
It can be concluded that any linear chirp signal can be transformed into a constant-frequency signal
by applying linear time-warping, as long as the time support −1/2α < t < 1/2α is satisfied. This is
illustrated by the following example.
3.1.6 Example 6: Linear warp
An example is shown in figure 3.2 for both the time and frequency domain. The chirp signal is
composed from a cosine wave plus a sine wave with half the amplitude and twice the frequency:
f(t) = 4 + t
ϕ(t) = 2π (4t+ 12 t
2)
y(t) =1
2cos(ϕ(t)
)+
1
4sin(2ϕ(t)
)The mean frequency is fc = 4Hz; the chirp rate is α = 0.25. The time interval of interest is
t = (−1, 1), which lies safely in the supported time-domain. The warped time interval is found from
(3.7):
t = ψα(t) = (−4 + 2√2,−4 + 2
√6) = (−1.1716, 0.8990)
The original signal for t is shown in green. The warped signal y(t) = y(t) is shown in red.
Clearly, y now has a constant frequency and will appear as two distinct pulses in a Fourier
representation (provided that leakage is absent by proper choosing of window length, see section
2.5). This will be illustrated in example 5.
52
3.2. NON-LINEAR TIME WARPING
−4 −3 −2 −1 0 1 2 3 4−1
−0.5
0
0.5
1
Am
plitu
de
Time domain
−4 −3 −2 −1 0 1 2 3 40
5
10
15
Time [s]
Fre
quen
cy [H
z]
Frequency domain
Figure 3.2: A warped signal in time and frequency domain.
3.2 Non-linear time warping
The previous discussion merely applied to linear time warping functions with constant chirp rate α.
It was seen that the inverse function of φα(t) only exists is φα(t) is strictly monotonically increasing,
i.e. φ′α(t) > 0. However, that does not limit the applicability to linear warp functions.
Consider the instantaneous frequency f(t) > 0 on the time domain −T < t < T . Let us define
fc = f(0). According to (3.3a), the time-derivative of the warp function writes:
φ′(t) =f(t)
fc(3.13)
such that the non-linear warp function can be found by integration:
φ(t) =
∫φ′(t) dt (3.14)
A “quick and dirty” inverse warp function can be found by time-integration of the inverse of the warp
rate (3.13):
ψ(t) ≈∫
1
φ′(t)dt =
∫f(0)
f(t)dt (3.15)
The basic idea is that the inverse warp rate ψ′(t) for some point t should be roughly 1/φ′(t) to stretchout frequency modulations in f(t). An approximation of the inverse warp function is then found by
time-integration, which can be performed either numerically or symbolically. Note that it may be
necessary to bias ψ(t) by some value, to make sure that ψ(0) = 0.
Equation (3.15) applied to the linear chirp function yields:
ψ(t) ≈∫
1
1 + αtdt =
log(1 + αt)
α
For small chirp rates, (3.15) yields a good approximation of the exact inverse warp function (3.7).
53
3. TIME WARPING
3.3 Discrete implementation & interpolation
The example above describes an analytical signal in a rather ideal situation, since every time
point ψα(t) is available from the function. As discrete signals are sampled at a finite number of
instances t[n], n = 1, . . . , N , the instances ψα(t[n]) do generally not correspond to existing samples.
Hence, non-uniform interpolation is required to obtain y for the instances t[n]. The quality of
the interpolation process is crucial: a badly performed interpolation increases the spectral leakage
dramatically. The issue of interpolation is addressed in the following sections.
3.3.1 Interpolation approaches
Two interpolation approaches are distinguished:
1. Upsampling — cheap interpolation
2. Expensive interpolation
Method 1 first samples the signal up by a factor r to increase the number of time points. Next, a
numerically cheap interpolation algorithm is employed to obtain the samples for t[n]. Cheap methods
include nearest neighbour (0th order) and linear (1st order) interpolation, or polynomial methods of
low degree e.g. cubic spline interpolation. Since resampling can be performed quickly, this method
does not necessarily take more time than the second approach. Reference [5] suggests upsampling by
a factor 2 and linear interpolation.
Method 2 skips the resampling step and performs the interpolation right away. To obtain similar
or better results than the first approach, a numerically more expensive interpolation method should
be employed. Theoretically, a sinc interpolation achieves the best result: according to the Nyquist-
Shannon sampling theorem, a band-limited signal can be revealed exactly from its samples by:
y(t) =
N∑n=1
y[n] sinc(fs(t− t[n])
)(3.16)
The sinc function is defined as:
sinc x =cos(πx)
πx
For every point t, the complete sequence is evaluated and weighted by the sinc function. This
implementation is computationally far to expensive and thus seldom used. Luckily, similar results
can be obtained using higher order spline interpolation [15].
3.3.2 Spline interpolation
The B-spline or basis spline interpolation method generalises a large class of interpolation techniques.
The interpolation algorithm for order q writes:
y(t) =
N∑n=1
c[n]βq(d[n])
(3.17a)
d[n] = fs(t− t[n]) (3.17b)
54
3.3. DISCRETE IMPLEMENTATION & INTERPOLATION
−3 −2 −1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Deviation
Wei
ghtin
g
0
1
2
3
4
5
6
7
8
Figure 3.3: Spline basis functions obtained by convolution of the 0th order function
d[n] is the sample deviation of t to every point t[n], for instance: a value of 0.5 corresponds to a time
instance exactly in the middle of two existing time points in t[n]. βq(d) is the basis function of order
q as a function of the deviation d. The basis functions for 0th to 8th order are shown in figure 3.3. They
are the result of repetitive convolution of the 0th order basis function:
βq(d) =
∫ ∞
−∞βq−1(δ)β0(d− δ) dδ (3.18a)
with
β0(d) =
⎧⎨⎩1 for |d| ≤ 0.5
0 for |d| > 0.5(3.18b)
The coefficients c[n] are found by solving the linear system of N equations:
y[n] =
N∑n=1
c[m]βq(d[n])
m = 1, . . . , N (3.19)
Looking at the basis functions, it can be observed that the width of function βq is limited to q, which
is a direct result of the convolution. Therefore both (3.17a) and (3.19) can be performed by sparse
linear arithmetic [15].
It may be observed that spline interpolation for order q = 0, 1, 2, 3 is equivalent to respectively
nearest neighbour, linear, quadratic and cubic interpolation.
3.3.3 Interpolation performance
To test the accuracy of the interpolation methods, 4 different test signals yref[n], −2 < t[n] < 2 are
considered at a sample rate of fs = 1024Hz:
55
3. TIME WARPING
1 2 4 8 16 32
10−12
10−9
10−6
10−3
100
Resampling ratio (r) or interpolation order (q)
Rel
ativ
e er
ror
(a) Cosine wave with f0 = 10Hz
1 2 4 8 16 32
10−12
10−9
10−6
10−3
100
Resampling ratio (r) or interpolation order (q)
Rel
ativ
e er
ror
(b) Cosine wave with f0 = 100Hz
1 2 4 8 16 3210
−3
10−2
10−1
100
Resampling ratio (r) or interpolation order (q)
Rel
ativ
e er
ror
(c) Block wave with f0 = 10Hz
1 2 4 8 16 32
10−9
10−6
10−3
100
Resampling ratio (r) or interpolation order (q)
Rel
ativ
e er
ror
(d) White noise band-limited to fs/4 = 256Hz
Figure 3.4: Interpolation errors. Blue, green and red correspond to respectively nearest neighbour,linear and spline interpolation. The x-axis represents resampling ratio r for the nearest neighbourand linear interpolation methods and interpolation order q for the spline interpolation.
1. Cosine wave with f0 = 10Hz
2. Cosine wave with f0 = 100Hz
3. Block wave with f0 = 10Hz
4. White noise band-limited to fs/4 = 256Hz
All signals are first warped using equation (3.1) to yw1[n] and then warped back to yw2[n] according
to (3.7). The chirp rate is chosen α = 0.25 for all signals. For evaluation of yw2[n], the time domain
−1 ≤ t < 1 s is considered. The error is determined as the norm of the difference between the original
signal yref[n] and warped signal yw2[n]:
εt =
∥∥∥yw2[n]− yref[n]∥∥∥∥∥∥yref[n]∥∥∥
The latter is obtained from the DFT of the time signals.
The results are shown in figure 3.4. Blue, green and red correspond to respectively nearest neighbour,
linear and spline interpolation. For the nearest neighbour and linear interpolation methods, the x-axis
indicates the ratio of resampling r. For the spline interpolation, the x-axis corresponds to the order
56
3.3. DISCRETE IMPLEMENTATION & INTERPOLATION
of the spline algorithm q .
Some observations:
– The nearest neighbour interpolation performs worst, followed by the linear interpolation. The
spline interpolation with q > 3 outperforms both methods. However, the spline method takes
considerably more time to compute, since it first needs to solve the system of equation (3.19)
and then computes the desired points by (3.17a).
– The nearest neighbour interpolation errors decrease with approximately 1/r. The linear
interpolation errors decrease in some cases with approximately 1/r2 and in some cases remain
equal.
– The spline interpolation errors for q = 1 equals the linear interpolation errors. This confirms
that the methods are the same for q = 1without resampling. The quadratic spline interpolation
for q = 2 appears to be a bad choice.
– It was observed that the interpolation methods that use resampling (nearest neighbour and
linear interpolation) suffer from a small time shift. This might explain the high error compared
to spline interpolation.
– All methods have difficulties with warping of the block wave. The block wave can be considered
a worst-case signal, as it comprises large discontinuities that are difficult to fit with any of the
interpolation methods. Also, since the frequency content of the block wave extends to fs/2,
aliasing occurs due to warping. Resampling or the use of higher order spline methods do not
yield any improvements.
– The band-limited noise signal can be warped to reasonable accuracy. If the noise was not band-
limited, the results would be almost as bad as for the block wave.
Generally, it can be concluded that spline interpolation is a good choice for a wide range of signals. A
basic implementation of multi-order spline interpolation is included in the Curve Fitting ToolboxTM
It is observed that the normalised phase functions reduce to the constant normalised phase shifts:
ϕnk (t) = θnk .
4.1.2 Complex normalisation
The normalisation steps can be performed efficiently using the complex notation of (4.1). Multiplica-
tion of ck with complex scalar rp is equivalent to scaling with ‖rp‖ and rotation with p∠r. If r hasunit length, only the rotation is effective. This property can be used for normalisation of the partials:
cnk =ck‖c1‖
(c1
‖c1‖)−k
(4.6)
60
4.2. INSTANTANEOUS TIMBRE
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−360
−180
0
180
360
540
720
900
1080
1260
Time [s]
Pha
se [d
eg]
φ
1(t)
φ2(t)
φ3(t)
φ1n(t)
φ2n(t)
φ3n(t)
Figure 4.1: Three phase functions (continuous lines) together with their normalised phase functions(dashed lines).
The first term normalises the amplitude, the second term rotates with angle −k∠c1. Depending on
the purpose, one can choose to leave out the amplitude normalisation and only normalise the phases:
cθk = ck
(c1‖c1‖
)−k
(4.7)
4.1.3 Timbre vector
The normalised partials from equation (4.6) can be combined to a K-dimensional complex timbre
vector:
cn =
⎛⎜⎜⎜⎝cn1cn2...
cnK
⎞⎟⎟⎟⎠ or cθ =
⎛⎜⎜⎜⎝cθ1cθ2...
cθK
⎞⎟⎟⎟⎠ (4.8)
In the ideal case of stationary partials, this vector remains constant and uniquely defines the timbre
of the tonal component.
4.2 Instantaneous timbre
The definition of timbre (4.5) applies to harmonically stationary signals. In other words, (4.5) only
holds when the frequencies of the partials are kept at an exact ratio throughout the time interval. It
will not come as a surprise that this concerns a rather hypothetical situation.
Consider for example the harmonics of a vibrating piano string. If the dynamics of the string
were fully governed by linear dynamics, one would find the harmonics at exact multiples of the
fundamental frequency. Due to their mutual orthogonality, there would be no interaction between
61
4. TIMBRE
the harmonics, hence the timbre would remain stationary throughout a free vibration.
However, little dynamic systems behave perfectly linear. The string, for example, may exhibit a
certain amount of inharmonicity: a small discrepancy between the actual harmonic frequencies and
their ideally expected values. This can for instance be caused by bending stiffness or in-elasticity of
the material; effects that can not be fully described by linear dynamics. As a result, the timbre is no
longer exactly stationary.
Thus, for real-life signals, it can be interesting to see to what extend the timbre remains stationary
over time. This can be studied on the basis of the instantaneous timbre (IT). For an arbitrary non-
stationary signal, the instantaneous timbre at time t can be interpreted as a “cross-section” of the
analytical signal for a certain instant in time. That is: the instantaneous amplitude and phase per
partial, normalised to the fundamental partial.
4.2.1 Definition
Consider a non-stationary signal y(t) consisting of a tonal component with instantaneous fundamen-
tal frequency f0(t). Then the instantaneous amplitude and phase of partial k are determined by:
ck(t) = 21
T
∫ T/2
−T/2
w(τ) y(t + τ) e−i 2π kf0τdτ k = 1, . . . ,K (4.9)
Equation (4.9) is in fact a Fourier transform for a single frequency component fk = kf0, centred
around time instance t. Factor 2 compensates for the absence of the negative frequencies. T is the
length of the window. w(t) is a window function that is symmetric about τ = 0 (see section 2.5). The
complex exponential is the modulator that shifts all frequency content of y(t) by −kf0.
Using equation (4.6) and (4.8), the instantaneous timbre reads:
cn(t) =
⎛⎜⎜⎜⎝cn1 (t)
cn2 (t)...
cnK(t)
⎞⎟⎟⎟⎠ or cθ(t) =
⎛⎜⎜⎜⎝cθ1(t)
cθ2(t)...
cθK(t)
⎞⎟⎟⎟⎠ (4.10)
4.2.2 Discrete implementation
The discrete-time implementation of (4.9) uses the DTFT to compute the partials. Consider the signal
y[n] of lengthN with corresponding time vector t[n] = n/fs, n = 0, . . . , N−1. In order to obtain the
instantaneous timbre for a certain instant in time tb, a smaller block yb[m] of sizeM < N is analysed.
Block yb[m] has length Tb =M/fs and is obtained from y[n] by:⎧⎨⎩yb[m] = y[m−M/2 + nb] m = 0, . . . ,M−1
nb = tbfs nb ∈ n(4.11)
with nb the sample index corresponding to the time instance tb of interest.
62
4.2. INSTANTANEOUS TIMBRE
0 0.5 1 1.5 20
0.05
0.1
0.15
0.2
Time [s]M
agni
tude
0 0.5 1 1.5 2−360
0
360
Time [s]
Pha
se [d
eg]
f1
f2
f3
f4
f5
f6
Figure 4.2: Timbre of a stationary trumpet note, normalised to the phase of the fundamental.
TheK partials for f0 at the particular point in time tb are then determined by:
ck(tb) = 21
M
M−1∑m=0
w[m] yb[m] e−i 2π kf0mfs k = 1, . . . ,K (4.12)
and the timbre vector cn(tb) or cθ(tb) is constructed using (4.10). Equation (4.12) can be evaluated
efficiently by a small C-code, that can be found in appendix A.1.
4.2.3 Example 8: Trumpet timbre
The timbre of a stationary trumpet note is determined from a fs = 16000Hz recording of a B�4 with
fundamental frequency f0 = 466Hz. Six partials are estimated using (4.12). A Gaussian window is
used with lengthM = 4000, equal to Tb = 0.25 s.
Figure 4.2 shows the timbre cθ(t) of the six partials. The normalised phase of the first partial is zero
throughout the interval, as expected from the definition. The other partials are quite stable throughout
the stationary part of the signal. From t = 1.8 s, the amplitudes drop and the phases change. This is
understandable, since the signal is no longer stationary.
4.2.4 Bandwidth considerations
Example 8 concerns a monophonic signal, i.e. a signal with only one tonal component. For polyphonic
signals or signals with disturbances, the bandwidth of the timbre estimation by equation (4.12) can be
an important issue, as will be discussed next.
63
4. TIMBRE
0 0.5 1 1.5 2 2.5 3 3.5 40
0.2
0.4
0.6
0.8
Time [s]M
agni
tude
0 0.5 1 1.5 2 2.5 3 3.5 4−180
−90
0
90
180
Time [s]
Pha
se [d
eg]
f1
f2
f3
f4
Figure 4.3: Timbre of a disturbed signal for M = fs/4. The bandwidth is to high to neglect thedisturbance.
Consider the following signal sampled at fs = 1024Hz with a tonal component at f0 = 10Hz, build
up from 4 partials:
y[n] =
4∑k=1
1
kcos
(2πkf0
n
fs
)n = 0, . . . , N−1
Hence, the frequencies present in the signal are 10, 20, 30, 40Hz. Let the signal be disturbed by an
enharmonic wave at fd = 22Hz:
yd[n] = y[n] + cos
(2πfd
n
fs
)n = 0, . . . , N−1
First, the timbre of the tonal component at f0 is estimated using a window size of M = 256,
Tb = 0.25 s and a Hanning window. The results are shown in figure 4.3; note that the amplitudes
are attenuated by factor 12 due the coherent gain of the Hanning window (see section 2.5.4).
Clearly, the disturbance has influence on the second partial. It follows from DFT theory that the
frequency spacing between two orthogonal waves is M/fs = 1/Tb = 4Hz. Hence, the spectral
resolution or bandwidth is 4Hz. In addition, the Hanning window has a −6 dB bandwidth of 2 bins,
thereby increasing the effective bandwidth to 8Hz.
In fact, the algorithm “feels” the two waves at 20Hz and 22Hz as one wave at 20Hz with some
amplitude modulation. From a continuous-time approximation, partial c2(t) reads:
c2(t) =
∫Tb
(12 cos(2π 20 t) + cos(2π 22 t)
)e−i 2π 20tdt = a+ b cos(2π 2 t)
The scalars a and b follow from the window characteristics. Indeed, the wave at 22Hz is enharmonic
to the fundamental, which explains the mismatch in the normalised phase (equation 4.5). The
64
4.2. INSTANTANEOUS TIMBRE
0 0.5 1 1.5 2 2.5 3 3.5 40
0.2
0.4
0.6
0.8
Time [s]M
agni
tude
0 0.5 1 1.5 2 2.5 3 3.5 4−180
−90
0
90
180
Time [s]
Pha
se [d
eg]
f1
f2
f3
f4
Figure 4.4: Timbre of a disturbed signal forM = fs. The bandwidth is small enough to suppress thedisturbance.
normalised phase seems to travel 720 degrees per second upwards, suggesting that the frequency
is +2Hz off. The amplitude modulation confirms this observation.
A complete suppression of the disturbance requires an effective bandwidth of 2 Hz. This is achieved
by increasing the block size toM = 1024 samples, or Tb = 1 s. The newly obtained timbre is shown in
figure 4.4. The disturbance is complete suppressed and the partials appear with the correct amplitude
and phase.
Summarizing: the disturbance of figure 4.3 was removed by increasing the window size and thereby
decreasing the effective bandwidth. In this perspective, the single-frequency Fourier transform of (4.9)
can be considered as a band-pass filter around f = kf0 with a certain effective bandwidth, realised
by the spectral resolution of the window 1/Tb and the -6dB bandwidth. Note that an increase of
the spectral resolution always compromises the temporal localisation, by the uncertainty principle
(section 2.6).
4.2.5 Example 9: Helicopter timbre
A 5 second microphone recording of a distant helicopter is analysed. The signal is sampled at
fs = 4096Hz. It is known that the main rotor has fundamental frequency f0 = 29.4Hz. The
timbre is first analysed using window size M = 512, Tb = 0.125 s. Again, the Hanning window is
applied, which causes the effective bandwidth to be 16Hz.
The result is shown in figure 4.5. The third partial, estimated at 3f0 = 88.2Hz exhibits amplitude
modulation, suggesting the presence of another wave. From the figure, the amplitude modulation is
approximated at 8.5Hz. Furthermore, the phase seems to be running downwards. This suggests the
65
4. TIMBRE
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.005
0.01
0.015
Time [s]
Mag
nitu
de
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−180
−90
0
90
180
Time [s]
Pha
se [d
eg]
f1
f2
f3
f4
Figure 4.5: Timbre of a helicopter for M = 512. The amplitude modulation on the third partialsuggests the presence of another wave with similar frequency.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.005
0.01
0.015
Time [s]
Mag
nitu
de
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−180
−90
0
90
180
Time [s]
Pha
se [d
eg]
f1
f2
f3
f4
Figure 4.6: Timbre of a helicopter forM = 1024. The disturbing source at 80Hz is suppressed.
presence of a disturbing wave at approximately 80Hz. This is verified from a DFT; the 80Hz wave is
in fact the fundamental frequency of the tail rotor.
To suppress the disturbing wave from the timbre, the effective bandwidth (including the 2 bins
window bandwidth) is decreased to 8Hz by increasing the window size to M = 1024, Tb = 0.25 s.
The new timbre plot is shown in figure 4.6. The amplitude modulation is suppressed and the phase
has become stable.
66
4.3. WARPED TIMBRE
4.3 Warped timbre
Let us get back to the discrete-time definition of instantaneous timbre (4.12). The equation accepts
blocks yb[m] of a time signal y[n] and returns the instantaneous timbre centred around time instances
tb. However, any block of correct sizeM may be inserted into the equation, including blocks obtained
by time-warping. Let y{t} denote a continuous-time interpolant function that was obtained by
interpolation of y[n]. Then the warped signal block can be found by:
yb[m] = y{t[m] + tb
}(4.13)
t[m] represents the warped local time vector centred around 0, as defined in section 3.3. tb is the
which can be found in appendix B.4. The function returns the instantaneous timbre for a given set of
time instances, frequencies and linear chirp rates.
4.4 Summary
The timbre of a tonal component is in this work defined as the amplitude and phase of the harmonic
partials, relative to the fundamental partial. It is observed that the normalised phase function of
a partial reduces to a constant phase shift, as long as the partials remain exactly harmonic. The
timbre, formed by the normalised DTFT coefficients of the partials, characterises a tonal component
independently from fundamental frequency and total amplitude.
The assumption of exact harmonicity of the partials is grounded by theory from linear dynamics.
As most real-life dynamic systems are not perfectly linear (or even highly non-linear), it can be
interesting to observe the development of the timbre over time. The instantaneous timbre makes
use of the DTFT, which brings up time-frequency considerations as discussed in chapter 2. For a
good estimate of the timbre, one must make sure that the effective bandwidth (the result of spectral
resolution and window bandwidth) is small enough to suppress contributions of enharmonic waves.
The timbre may also be determined from time-warped blocks. An implementation of the required
time-warp, interpolation, windowing and DTFT operations is found in appendix B.4.
Timbre provides a useful and intuitive approach to representing a periodic signal, that bears a much
closer resemblance to the human perception of sound.
67
Part III
Short-time Spectral Analysis
Chapter5
Short-Time Fourier Transform
The Fourier analysis techniques discussed in the previous chapters were predominantly applied to
signals in their entirety. This methods is useful if the signal is short or if the signal is reasonably
stationary throughout the time-domain. In practice, most signals obtained from experiments are
non-stationary and can be minutes long. It would be impractical to analyse these signal as a whole.
Besides, one is often interested in how the signal changes over time.
Short-time analysis divides a longer signal into many shorter blocks, that usually have some overlap
in time. Every block is centred at a certain point in time and can be analysed using standard Fourier
techniques as discussed in chapter 2.
The short-time Fourier transform (STFT) (or formally: short-time discrete Fourier transform) is
perhaps the most popular and generally applicable method for short-time spectral analysis. This
chapter will discuss the basic aspects and time-frequency considerations.
5.1 Short-time blocks
Consider an arbitrary N -point signal y[n] sampled at fs. The time domain of the signal is 0 ≤ t < T
with total duration T = N/fs. It was already seen that a N -point DFT offers excellent frequency
information, but does not give any temporal information.
Therefore, signal y[n] is subdivided into B blocks yb[m]. Similar to n, m is the sample index of the
signal blocks: m = 0, . . . ,M − 1. The blocks are counted b = 1, . . . , B and have a smaller size
M < N , which corresponds to length Tb =M/fs.⎧⎨⎩yb[m] = y[m−M/2 + nb] m = 0, . . . ,M−1
nb = tbfs nb ∈ n(5.1)
The blocks are centred in time at the instances tb. The shift between two adjacent blocks, i.e. tb+1− tbis called the shift length Tl. Similarly, the shift size is L = Tlfs. To make sure that no samples of y[n]
are skipped, L < M and consequently Tl < Tb.
The first block b1 lies at t1 = 12Tb. The time instances for the other blocks b are given by:
tb =12Tb + (b− 1)Tl b = 1, . . . , B (5.2a)
71
5. SHORT-TIME FOURIER TRANSFORM
T
Tb
Tl
tb
yb[n]
y[n]
t = 0 t = T
Figure 5.1: The short-time Fourier transform divides a signal sequence into many shorter blocks ofequal length.
The block centre index is given by nb:
nb =12M + (b − 1)L b = 1, . . . , B (5.2b)
The short-time blocks are shown schematically in figure 5.1. The symbols involved in equations (5.1),
(5.2a) and (5.2b) are listed in table 5.1.
symbol domain name
y[n] R signal vectort[n] [0, T 〉 time vectorfs R sample raten 0, . . . , N−1 signal sample indexN N signal sizeT R signal length
yb[m] R block vectorm 0, . . . ,M−1 block sample indexM N < N block sizeTb R < T block length
b 1, . . . , B block numberB N total number of blockstb [0, T 〉 block centre timenb 0, . . . , N−1 block centre sample index
L N < M shift sizeTl R < Tb shift length
Table 5.1: Symbols used to define the short-time blocks.
5.1.1 Shift size & overlap
Figure 5.1 shows a certain amount of overlap between two blocks. The overlap is determined by the
block size and the shift size: M/L. Theoretically, the shift size can be anything from 1 sample to the
block size. A too low number for L results in a large number of blocks and consequently many DFT
72
5.2. SHORT-TIME DFT
computations. For instance, if L is much smaller than M , the DFTs of successive blocks will not be
much different. If however L is too large, some events in the signal may not be detected.
When a non-rectangular window is applied, an overlap of at least 2× is necessary to make sure every
sample of y[n] is contained in the spectrum. In particular for the Hanning window, an overlap of
exactly 2 ensures that every sample in y[n] is covered equally over the successive blocks.
5.2 Short-time DFT
The distribution of y[n] over short-time blocks yields B blocks yb[m] of size M . The DFTs of the
blocks can be obtained by the fast Fourier transform (section 2.4.4), that performs the following
transformation:
yb[k] =1
M
M−1∑m=0
y(w)b [m] e−i 2π k m
M k = 0, . . . ,M−1 b = 1, . . . , B (5.3)
The blocks y(w)b [m] are windowed by a window function w[m] according to (2.16).
After FFT computation, array yb[k] is obtained that consists of B ×M complex elements. B is the
number of DFT blocks for the time instances tb. M is the number of frequencies fk that follow from
(2.10b). The frequency resolution is determined by the block size and the sample rate:
Δf =M/fs (5.4)
A popular representation of yb[k] is the waterfall plot or spectrogram, which shows the amplitudes of
the single-sided spectrum colours on a 2D time-frequency grid. Examples of the spectrogram will be
shown in the following sections.
5.3 Time-frequency considerations
Although (5.1) — (5.3) involve many different symbols, only three independent choices are left for the
spectral representation:
1. Block sizeM . The block length follows from Tb =M/fs.
2. Shift size L. The shift length follows from Tl = L/fs.
3. Window type.
The meaning of these properties for the spectrogram are discussed on the basis of a STFT of a fly-by
helicopter, sampled at fs = 4096Hz.
5.3.1 Spectral/temporal resolution
Figure 5.2(a) shows the STFT for block sizeM = fs/1 = 4096 and shift size L = M/2 = 2048. The
overlap ratio isM/L = 2×. Figure 5.2(b) has a 4× smaller block size: M = fs/4 = 1024. The shift
size is reduced to L =M/8 = 512, keeping the same overlap ratio of 2×. Both STFTs use a Hanning
window.
Clearly, the first STFT yields more spectral detail, whereas the second STFT has better temporal
localisation.
73
5. SHORT-TIME FOURIER TRANSFORM
Frequency [Hz]
Tim
e [s
]
0 50 100 150 200
1
2
3
4
5
6
7
8
9
[dB]0
10
20
30
40
50
60
70
80
(a) STFT for Δf = 1Hz and Tl = 0.5 s
Frequency [Hz]
Tim
e [s
]
0 50 100 150 200
1
2
3
4
5
6
7
8
9
[dB]0
10
20
30
40
50
60
70
80
(b) STFT forΔf = 4Hz and Tl = 0.125 s
Frequency [Hz]
Tim
e [s
]
0 50 100 150 200
1
2
3
4
5
6
7
8
9
[dB]0
10
20
30
40
50
60
70
80
(c) STFT forΔf = 1Hz and Tl = 0.125 s
Frequency [Hz]
Tim
e [s
]
0 50 100 150 200
1
2
3
4
5
6
7
8
9
[dB]0
10
20
30
40
50
60
70
80
(d) STFT for (b) with 3× zero-padding
Frequency [Hz]
Tim
e [s
]
0 50 100 150 200
1
2
3
4
5
6
7
8
9
[dB]0
10
20
30
40
50
60
70
80
(e) STFT for (c) using a rectangular window
Frequency [Hz]
Tim
e [s
]
0 50 100 150 200
1
2
3
4
5
6
7
8
9
[dB]0
10
20
30
40
50
60
70
80
(f) STFT for (c) using a Gaussian window, σ = 0.1 s
Figure 5.2: Six different DTFTs for the same signal. Figure (a) and (b) show the effect of a differentspectral/temporal resolution. Figure (c) increases the overlap ratio. Figure (d) applies 3× zero-paddingto (b). Figure (e) and (f) use different window functions for the settings of (c).
74
5.4. SUMMARY
5.3.2 Overlap
An overlap ratio of 2× ensures that every point of the original signal is contained in the STFT.
Following this reasoning, a higher ratio will undoubtedly bring up some redundancy in the STFT.
Nevertheless, a higher ratio can yield a better time-frequency localisation. Figure 5.2(c) shows the
STFT for M = fs/1 = 4096 and L = M/8 = 2048; the overlap ratio is 8×. Compared to figure
5.2(a), the STFT is better localised in time.
5.3.3 Zero-padding
Zero-padding is an elegant trick to reach a somewhat higher frequency resolution while keeping
the same block size. By padding a block of M samples by (for instance) 3M zeros before being
transformed by the FFT, the spectral resolution becomes 4 times higher. Although no new information
is added to the block, the increased resolution canmake it easier to recognise closely spaced frequency
peaks. The effect of 3× zero-padding is shown in figure 5.2(d).
5.3.4 Windowing
The quality of an STFT heavily depends on the chosen window function (section 2.5). Two STFTs are
shown as a comparison with figure 5.2(c). Figure 5.2(e) uses a rectangular window, resulting in sharp
frequency lines but a high level of leakage. Figure 5.2(f) uses a Gaussian window with σ = 0.1 s,
resulting in more “blurry” frequency lines.
Time and frequency localisation was formalised in section 2.6.1 by the temporal and spectral second
moments: σt and σf . These values objectively quantify the performance of the window:
figure window σt σf5.2(c) Hanning 0.14 s 0.58Hz
5.2(e) Rectangular 0.29 s 8.38Hz
5.2(f) Gaussian 0.07 s 1.13Hz
Looking at figures 5.2(c), 5.2(e) and 5.2(f), it is verified that the Gaussian window has the best temporal
localisation of the three windows. Also, the Hanning window yields the best spectral localisation.
5.4 Summary
The short-time Fourier transform is a popular method for short-time spectral analysis. The DTFT
divides a signal into shorter segments and applies the DFT to the individual blocks. The obtained
spectrum can be visualised by a waterfall diagram, with time on one axis and frequency on the other.
The temporal and spectral localisation can be controlled by proper choosing of the block size, shift size
and window type. As a last resort, zero-padding can be applied to increase the number of frequency
points. Yet, the STFT is subject to the time-frequency uncertainty principle, limiting the simultaneous
temporal and spectral localisation.
75
Chapter6
Fan Chirp Transform
Many real-life signals are non-stationary by nature. Examples are countless and include recordings
or measurements of speech, music, passing vehicles, engines, etc. The short-time Fourier transform
(STFT) as discussed in chapter 5 can be applied to any signal. However, if the signal comprises rapidly
changing frequency content, the results can be cumbersome: the DFT tries to “project” the changing
frequencies on a rectangular-tiled time-frequency grid, resulting in undesired frequency spreading.
For highly instationary signals, the question arises if the rectangular grid provides the best basis
for analysis. The answer is “not really” and an ingenious alternative is provided by the Fan Chirp
Transform (FChT). The fan chirp transform was proposed in 2007 by Luis Weruaga and Márian
Képesi [21]. It effectively provides basis vectors in a fan-geometry by pre-processing the signal with
time-warping technique as discussed in chapter 3. The Short-Time Fan Chirp Transform (STFChT)
implements this time-warping and operates per block, similar to the STFT.
Figure 6.1 illustrates the difference between the FChT and STFChT schematically. Considering the
fact that the harmonic structure of a tonal component ensures constant ratios to the fundamental
frequency, the skewness of the grid must increase accordingly.
frequency
tim
e
(a) DFT basis grid
frequency
tim
e
(b) FChT basis grid
Figure 6.1: Schematic representation of a non-stationary tonal component against a rectangular-tiledand fan-tiled basis grid.
77
6. FAN CHIRP TRANSFORM
6.1 Formulation of the Fan Chirp Transform
The Fan Chirp Transform1 for the continuous-time domain as formulated in [21] reads:
yα(f) =
∫ ∞
−∞y(t)
√|φ′α(t)| e−i 2πfφα(t) dt −∞ < f <∞ (6.1)
The linear time warp function φα(t) is defined by equation (3.1). The term√φ′α(t) is a normalisation
function that preserves the unitarity of the transformation2. Equation (6.1) can be interpreted as a
projection of y(t) onto a set of chirping basis functions e−i 2πfφα(t).
By applying the change of variable τ = φα(t) and inversely t = φ−1α (τ) = ψα(τ), equation (6.1) is
placed on the warped-time axis and becomes:
yα(f) =
∫ ∞
−∞y(ψα(τ))
√∣∣φ′α(ψα(τ))∣∣ e−i 2πfτ dτ
=
∫ ∞
−∞y(τ) ρα(τ) e
−i 2πfτ dτ −∞ < f <∞ (6.2)
The latter equation uses two substitutions:
1. The warped signal y(τ) obtained by the procedure of linear time-warping, as discussed in
section 3.1.
2. A normalisation function ρα(τ) which can be shown to be [21, 9]:
ρα(τ) =1
4√|1 + 2ατ | (6.3)
Equation (6.2) is easily recognised as a Fourier transform of the product of y(τ) and ρα(τ). It can
therefore be computed efficiently by the fast Fourier transform.
6.2 Short-time Fan Chirp Transform
The short-time fan chirp transform (STFChT) combines the STFT and FChT and computes the
following transformation:
yb[k] =1
M
M−1∑m=0
yb[m] ραb[m]w[m] e−i 2π k m
M k = 0, . . . ,M−1 b = 1, . . . , B (6.4)
Vector yb[m] denotes the warped signal block b centred at time instance tb and obtained by linear time-
warping with chirp rate αb. Note that the warped block is obtained by non-uniform interpolation (see
section 3.3) and as a consequence, the time instances do not fully correspond with those of the STFT
blocks. This was illustrated by example 6.
The vector ραb[m] is the normalisation term of (6.3) for block b. w[m] represents a symmetric window
function. Note that windowing is applied on the warped-time axis, while it can also be applied on the
linear-time axis, i.e. prior to the time-warping. [5] suggests the first method, motivated by the fact that
the main function of the window is improving the spectral representation, rather than distributing the
signal evenly over time. For both cases however, the peak of the window will always correspond to
the centre time instance tb.
1The term “fan” points out that all frequencies focus in the same focal point, see section 3.1.2.2It can be shown that due to this term, the transformation is still unitary and therefore Parseval’s theorem also applies to
the FChT [21].
78
6.2. SHORT-TIME FCHT
6.2.1 Block chirp rate
In contrast to the STFT, the frequencies of the STFChT are instationary, as was illustrated by the
skewed grid of figure 6.1(b):
fkb(t) = (1 + αb(t− tb)) − 1
2Tb ≤ (t− tb) <12Tb (6.5)
The chirp rates can be set individually for each block, within the time support limit of (3.9):
− 1
2αb< (t− tb) <
1
2αb
The block chirp rate is therefore limited by the block length Tb:
αb <1
Tb(6.6)
Considering a tonal component with instantaneous fundamental frequency f0(t), the ideal linear
chirp rate for block b is simply estimated by:
αb =f ′0(tb)
f0(tb)b = 1, . . . , B (6.7)
As the fundamental frequency is often given or approximated as a discrete-time vector f0[n], the chirp
rate can be found by finite differences, for instance:
αb =f0[nb + 1]− f0[nb − 1]
2Δt f0[nb]b = 1, . . . , B (6.8)
If the fundamental frequency is not (precisely) known, one can vary the chirp rate and choose the rate
that yields the sharpest spectrum / highest peaks. This approach is used in [5].
6.2.2 Example 10: Short-time fan chirp transform of a chirp wave
Consider a T = 5 s signal y[n] of a chirp signal sampled at fs = 1024Hz. The instantaneous
frequency is given by:
f0(t) = 10 + 1220t
2 0 ≤ t < T
The signal consists of two partials with frequencies f0(t) and 2f0(t). Using equation (6.7), the chirp
rate for block b at time tb reads:
αb =20tb
10 + 10t2bb = 1, . . . , B
The maximum of αb is 1 at tb = 1 s. Therefore, the block length Tb is chosen to be 1 s to satisfy the
time-support of (6.6).
An STFT and STFChT are computed for y[n] and shown in figure 6.2. Both transformations use block
length Tb = 1 s, shift length Tl = 1/32 s and a Hanning window. The time-warping for the FChT is
performed by 8-point spline interpolation.
79
6. FAN CHIRP TRANSFORM
Frequency [Hz]
Tim
e [s
]
0 100 200 300 400 5000.5
1
1.5
2
2.5
3
3.5
4
[dB]−160
−140
−120
−100
−80
−60
−40
−20
0
(a) STFT
Frequency [Hz]
Tim
e [s
]
0 100 200 300 400 5000.5
1
1.5
2
2.5
3
3.5
4
[dB]−160
−140
−120
−100
−80
−60
−40
−20
0
(b) STFChT
Figure 6.2: A instationary wave represented by a STFT and a STFchT.
As expected, the chirp waves appear as widely spread frequency bands in the STFT of figure 6.2(a).
This can be regarded as a very bad analysis, since the chirp signal covers a considerable part of the
spectrum.
The STFChT of figure 6.2(b) shows much more concentrated lines. However, the warping operation
introduces some leakage. Most leakage is found around t = 1 and t = 4.5. The leakage around t = 1
can be explained by the fact that the chirp reaches the time-support limit as defined by (6.6). The
leakage around t = 4.5 is due to aliasing: the frequencies at t = 4.5 are warped in the vicinity of the
Nyquist sampling limit 12fs = 512Hz. It was verified that both types of leakage are absent for a signal
with fundamental frequency f0(t) = 10+ 1210t
2 with the maximum chirp rate 0.707 and frequencies
extending to 270Hz.
6.2.3 Example 11: Short-time fan chirp transform of an engine run-up
The possibilities of the STFChT are illustrated on the basis of a typical signal from dynamic
experiments: a microphone measurement of a car engine during a run-up. The signal and
transformation are characterised by:
– Sample rate: fs = 8192Hz.
– Signal length: T = 30 s.
– Block length: Tb = 1 s.
– Shift length: Tl = 1/8 s.
– Window: Hanning.
The chirp rate was estimated from the tacho-vector (the engine speed in RPM) by finite differences
using equation (6.8). Since the engine speed increases slowly, the chirp rate is rather low: α stays
below 0.05. Still, the STFChT offers better spectral information than STFT, as can be observed from
figure 6.3.
80
6.2. SHORT-TIME FCHT
Frequency [Hz]
Tim
e [s
]
DFT Engine run−up
0 500 1000 150010
11
12
13
14
15
16
17
18
19
20
[dB]0
10
20
30
40
50
60
70
80
(a) STFT
Frequency [Hz]
Tim
e [s
]
FCHT Engine run−up
0 500 1000 150010
11
12
13
14
15
16
17
18
19
20
[dB]0
10
20
30
40
50
60
70
80
(b) STFChT
Figure 6.3: An STFT and STFchT of an engine run-up.
81
6. FAN CHIRP TRANSFORM
Frequency [Hz]
Tim
e [s
]
DFT Engine run−up
600 650 700 750
20
21
22
23
24
(a) STFT
Frequency [Hz]
Tim
e [s
]
FCHT Engine run−up
600 650 700 750
20
21
22
23
24
(b) STFChT
Figure 6.4: Detail of the STFT and STFchT.
Figure 6.4 shows a detail of the STFT and STFChT. The linear warp operation successfully transforms
the instationary signal blocks into stationary blocks.
6.3 Summary
A well-known shortcoming of the DFT is its inability to detect and localise instationary signals. As
the basis vectors of the DFT are constant (rectangular), an instationary wave always projects on a
group of frequencies. The fan chirp transform (FChT) effectively provides basis functions in a fan-
geometry, matching the harmonic structure of an instationary tonal component. This is realised by
applying time-warping to the original signal, prior to being processed by the DFT.
The short-time fan chirp transform (STFChT) is in essence the STFT of time-warped signal blocks.
The improvement depends on the quality of the time-warp process. If the fundamental frequency of a
dominant tonal component is known, the required linear block chirp rates can be estimated by finite
differences. If however no fundamental frequency information is available, one can vary the chirp
rate per block and choose the chirp rate that yields the sharpest spectrum.
82
Part IV
Pitch Tracking & Order Extraction
Chapter7
Pitch Tracking Techniques
Pitch tracking can be an important task in analysis of dynamic measurements. Let us for example
consider a measurement of the vibration of the exhaust pipe of combustion engine during a run-up
from 1000 RPM to 3000 RPM. Such measurements are performed to characterise the response of the
mechanical parts to its rotational inputs. After measurement, the acquired data is stored in time-
domain signals. Short-time spectral analysis can then be performed right away, for example with
techniques described in chapter III. However, one often wants to replace the time axis by a linear
scale of RPM and the frequencies by equivalent engines orders. The first implementation is called a
Campbell diagram, the latter an order plot.
There are several ways to achieve this [2]. Regardless of the method, one will need exact information
of speed (RPM) as a function of time. In most cases this information is obtained during the
measurement by so-called tacho pulses: a pulse is released for every full or partial revolution of
the engine shaft. From these pulses, a vector can be constructed that accurately describes the
instantaneous speed, which is related to the fundamental frequency of the spectrum by some ratio
depending on the engine configuration (number of cylinders, 2 or 4-stroke).
Additionally, one may want to extract orders from the time-domain signal in order to analyse them
separately. As long as a tacho vector is supplied with the measurements, the diagrams and orders
are obtained rather easily. However, the availability of such data is not obvious, for example in the
following cases:
– Acoustic (microphone) measurements of a car while driving
– Asynchronous components in measurements
– Acoustic measurements on drive-by or fly-by vehicles
– Any other dynamic system for which no a-priori fundamental frequency data is available
For these situations, a pitch tracking algorithm can be employed to determine the instantaneous
frequency of instationary periodic components in a signal.
Note:
In most research, the term pitch is used instead of fundamental frequency. Although strongly related,
the fundamental frequency is an objective property, while the term pitch is subjective to a listener and
should actually be preserved for a perceptual context, as stated in a.o [4, 11]. Nevertheless, both terms
are used interchangeably in this chapter.
85
7. PITCH TRACKING TECHNIQUES
7.1 Pitch detection
Pitch detection is somewhat different than pitch tracking. Pitch detection (or fundamental frequency
estimation) is the procedure by which a fundamental frequency of a stationary periodic component
is sought within a signal. This is usually done without any knowledge of the expected location of
the fundamental. Pitch detection algorithms (PDA) exist for both time and frequency domain. Most
applications are the detection of pitch and automatic transcription of musical or speech signals. Often,
PDAs are limited to monophonic signals, i.e. signals with only one tonal component.
Popular time-domain techniques are often limited to detection of monophonic signals and include: