Peter Sollich [email protected] arXiv:1906.09526v1 [stat.ML] … · 2019-06-25 · [email protected] Zoran Cvetkovic Department of Informatics King’s College London...

Parzen Filters for Spectral Decomposition of Signals

Dino OglicDepartment of Informatics

King’s College [email protected]

Zoran CvetkovicDepartment of Informatics


Peter SollichDepartment of Mathematics


Abstract

We propose a novel family of band-pass filters for efficient spectral decompositionof signals. Previous work has already established the effectiveness of represen-tations based on static band-pass filtering of speech signals (e.g., mel-frequencycepstral coefficients and deep scattering spectrum). A potential shortcoming ofthese approaches is the fact that the parameters specifying such a representation arefixed a priori and not learned using the available data. To address this limitation,we propose a family of filters defined via cosine modulations of Parzen windows,where the modulation frequency models the center of a spectral band-pass filterand the length of a Parzen window is inversely proportional to the filter width inthe spectral domain. We propose to learn such a representation using stochasticvariational Bayesian inference based on Gaussian dropout posteriors and sparsityinducing priors. Such a prior leads to an intractable integral defining the Kullback–Leibler divergence term for which we propose an effective approximation based onthe Gauss–Hermite quadrature. Our empirical results demonstrate that the proposedapproach is competitive with state-of-the-art models on speech recognition tasks.

1 Introduction

We consider the problem of learning an effective representation of signal data for supervised learningtasks such as classification of phonemes and/or sounds. The effectiveness of any supervised learningalgorithm depends crucially on the effectiveness of a data representation. The latter is typicallyevaluated using the ability of a learning algorithm to generalize from training examples to unseen datapoints. A desirable property of an effective representation is invariance to nuisance transformationssuch as translations [31], and stability to the actions of small diffeomorphisms that distort/warpsignals [23, 38]. For instance, the empirical effectiveness of state-of-the-art convolutional neuralnetworks can be to a large extent attributed to their ability to encode invariance to local translations viaconvolutional weight sharing and pooling operators [20, 31]. This type of inductive bias is especiallyeffective in supervised learning tasks with speech signals and images [e.g., see 21, 33, 41].

Previous work [e.g., see 3, 11, 27] has established the effectiveness of speech signal representations(e.g., mel-frequency cepstral coefficients and scattering representations) based on static band-passfiltering of signals. A potential shortcoming of these approaches (reviewed in Section 2.1) is thefact that the parameters specifying such a representation of signal data are fixed a priori and notlearned using the available data. As a result, the hypothesis space of a supervised learning algorithmis selected beforehand and does not necessarily provide an ideal inductive bias for the learningprocess. To overcome this shortcoming and allow more flexible feature extraction, we propose anovel family of band-pass filters for efficient spectral decomposition of signals (Section 2.2). Thefilters are defined via cosine modulations of Parzen windows, typically encountered in kernel densityestimation [e.g., see 26]. While the modulation frequency models the center of a spectral band-passfilter, the length of the Parzen window is inversely proportional to its width in the spectral domain.The optimization of the two parameters allows a flexible choice of band-pass filters and can be done

Preprint. Under review.

arX

iv:1

906.

0952

6v1

[st

at.M

L]

23

Jun

2019

in combination with learning of more abstract features (e.g., via convolutions). We propose to learnthese filters and other parameters of the representation using stochastic variational Bayesian inferencebased on the Gaussian dropout posterior distributions [19, 25]. Section 3.1 provides a brief review ofthis inference technique and covers different prior functions known for promoting sparse solutions.Such prior functions typically lead to intractable integrals defining the Kullback–Leibler divergenceterm that is responsible for regularization in variational inference. To side-step this issue, samplingbased approximations [5] or formulas based on empirical estimates of the divergence are typicallyemployed [19, 25]. We propose to address the approximation of this divergence term more rigorouslyby making a theoretical contribution (Propositions 2 and 3) which shows that the divergence integralcan be approximated effectively using the Gauss-Hermite quadrature (Section 3.2).

In Section 4, we evaluate the proposed approach empirically on a standard benchmark dataset forspeech recognition (TIMIT), relative to previously reported results of state-of-the-art baselines forcontext dependent models. In contrast to some of the baselines, we opt for a simple model and donot employ delta features or speaker normalization techniques known for improving the accuracy ofacoustic models [11]. We also perform training with a single log-likelihood term, whereas some of theapproaches combine both context-dependent (tri-phone) and -independent (monophone) likelihoodsinto a multi-objective loss function. Our empirical results demonstrate that simple models based onParzen filters and variational inference are competitive with state-of-the-art acoustic models.

2 Spectral Decomposition of Signals

In this section, we introduce a novel and highly flexible family of differentiable band-pass filters forspectral decomposition of signals. Each filter is defined with not more than three parameters and canbe easily integrated (as a pre-processing step) into standard classification and regression models suchas artificial neural networks and/or kernel methods. The ultimate goal is to find an operator whichmaps the space of signals into a Hilbert space such that the representation of the signal data is stableto the action of a small diffeomorphism and invariant to nuisance transformations such as translations.In the remainder of the section, we formalize these concepts and provide a brief review of featureextraction procedures based on band-pass filtering of speech signals. Following this, we introduceour family of filters and discuss some advantages over the static band-pass filtering of signals.

Let L(Rd)

denote the space of square integrable functions defined on Rd and assume that a con-tinuous signal x ∈ L

(Rd). In this paper, we restrict ourselves to the case with d = 1, which

can correspond to representation of speech signals (e.g., an image could be represented with a twodimensional signal). An operator Φ: L

(Rd)→ H is a mapping of a signal into a Hilbert spaceH.

Let Tcx (t) = x (t− c) denote the translation of a signal x by some constant c ∈ Rd. An operatorΦ is called translation invariant if Φ (Tcx) = Φ (x) for all x ∈ L

(Rd)

and c ∈ Rd. To preservestability in the space of continuous signals L

(Rd), Φ should be a non-expansive operator [23], i.e.,(

∀x, y ∈ L(Rd))

‖Φ (x)− Φ (y)‖H ≤ ‖x− y‖ .Let Dτx (t) = x (t− τ (t)) denote a diffeomorphism of a signal given by a displacement fieldτ (t) ∈ C2

(Rd). For example, one can take τ (t) = εt with ε ∈ R and ε → 0. To preserve

stability relative to a small diffeomorphism of a signal, it is sufficient to ensure that the operator Φis Lipschitz continuous [3, 23]. An operator Φ is Lipschitz continuous with respect to actions ofC2-diffeomorphisms if for any compact Ω ⊂ Rd there exists a constant L such that for all signalsx ∈ L

(Rd)

supported on Ω and all τ ∈ C2(Rd)

it holds that [for more details see, e.g., 23]

‖Φ (x)− Φ (Dτx)‖H ≤ L ‖I−Dτ‖∞ ‖x‖ := L

(supt∈Ω‖∇τ (t)‖∞ + sup

t∈Ω‖∇∇τ (t)‖∞

)‖x‖ .

The spectrogram of a signal x is given by |x (t, ω)| =∣∣∫ x (u) ζt (u) exp (−iωu) du

∣∣, where |·|denotes the modulus of a complex number and ζt is a window of duration T centered at sometime-index t with

∫ζt (u) du = 1. The spectrogram is an operator that can provide an approximately

locally time-translation invariant representation over durations limited by a window [3]. In processingof speech signals, the signal data is typically represented with a set of overlapping windows of fixedduration (e.g., 25 ms windows with 15 ms overlaps) so that an approximate locally time-translationinvariant operator can be constructed. While the spectrogram of a signal can provide local time-translation invariance, Mallat [23] has demonstrated that the operator is not Lipschitz continuous and,

2

thus, it does not necessarily provide stability to the action of a small diffeomorphism. The operatorspresented in the following section aim at providing this property in addition to translation invariance.

2.1 Scattering Operators

We start with a brief review of mel-frequency cepstral coefficients [8, 11] and motivation behind thissignal mapping operator frequently used in speech recognition. The main idea behind the approachis to perform averaging of the spectrogram energy with band-pass filters and in this way obtainthe (approximate) Lipschitz continuity of the operator mapping. More formally, mel-frequencyspectrogram of a signal is given by

(Mx) (t, η, α, β) =1

2π

∫|x (t, ω)|2 ψ (ω | η, α, β)

2dω ,

where ψ (ω | η, α, β) is the square root of a triangular probability distribution with mode η andsupport on the interval [α, β]. Mel-frequency spectrograms are typically defined with a family oftriangular distributions (e.g., 50 band-pass filters). The modes of these distributions are selectedso that they are equidistant in the log-space of the spectrum (the mel-scale characteristic to thisfamily of filters corresponds to the natural logarithm) and the support of each distribution is definedby the modes of the neighboring filters. As a result of this, mel-frequency spectrograms performaveraging over high frequencies with larger frequency bandwidths compared to low frequenciesand are typically stable to the action of a small diffeomorphism. Moreover, in [3] it is argued thatmel-frequency spectrograms typically define a translation invariant Lipschitz continuous operator.

An alternative band-pass filtering approach for spectral decomposition of signals has been outlinedin [23]. The approach is motivated by the fact that as a result of spectrogram averaging with respect toshort windows, mel-cepstral coefficients can contribute to information loss, which can have a negativeimpact on the performance of a supervised learning algorithm. The main idea is to re-arrange theterms in the integral defining the mel-spectrogram of a signal and in this way introduce an operatorcapable of performing the filtering of the whole signal instead of filtering only windows of fixedlength determined by ζt. More specifically, the mel-frequency spectrogram operator can be written as

(Mx) (t, θ) =1

2π

∫|x (t, w)|2

∣∣∣ψ (w | θ)∣∣∣2 dw =

∫ ∣∣∣∣∫ x (u) ζt (u)ψ (v − u | θ) du

∣∣∣∣2 dv ,

where ψ(· | θ) is the Fourier transform of some filter ψ(· | θ) (abbreviated ψθ) defined witha hyperparameter vector θ. The second equality is a consequence of Plancherel’s theorem (andconvolution theorem), which states that the integral of the square of the Fourier transform of afunction is equal to the integral of the square of the function itself [e.g., see 28, 30]. Now, the mainidea in [23] is to re-arrange the terms appearing in the integral defining the mel-spectrogram andfilter the whole signal instead of filtering it by parts determined by a window of fixed length T . Theresulting operator is called the (squared) first order scattering operator and it is defined by [3, 23]

S21x (t, θ) =

∫ ∣∣∣∣∫ x (u)ψ (v − u | θ) du

∣∣∣∣2 |ζ (t− v)|2 dv =(|x ∗ ψθ|2 ∗ |ζ|2

)(t) ,

where ∗ denotes the convolution of one dimensional signals. As the square operator can amplify largecoefficients, in [3, 23] it is proposed to replace it with the modulus operator and, thus, define a morestable signal representation S1x (t, θ) = (|x ∗ ψθ| ∗ ζ) (t).

In the latter operator, the windowing function ζ acts as a low-pass filter and performs weightedl1-average pooling of the previously filtered signal. The scattering operator can be extended to ahigher order signal decomposition by applying the scattering operation to already filtered signals [3].In [23] it is argued that the scattering operator is a contraction and that it can provide stability tothe action of a small diffeomorphism. The application of the modulus operator to the filtered signalcan be seen as an activation function and, thus, the output of the scattering operator resembles tothe output of a convolution layer in artificial neural networks. Similar to mel-frequency coefficients,scattering operators rely on static filters defined by wavelets [e.g., see 23, for more details].

2.2 Parzen Filters

A potential shortcoming of mel-frequency spectrograms and scattering operators is in the fact thatthe parameters defining these signal representations are selected a priori without relying on data. As

3

a result, the hypothesis space of a supervised learning algorithm is selected beforehand and doesnot necessarily provide an ideal inductive bias for all learning tasks based on signals. To addressthis limitation, we propose to replace them with band-pass filters based on Parzen windows whichare differentiable with respect to their hyperparameters. As the triangular probability distribution isan example of a Parzen window typically encountered in kernel density estimation, one could alsoperform the spectrogram averaging using smoother windows. Examples of Parzen windows withsuch properties are (square) Epanechnikov and Gaussian window functions. In Table 1, we provide aformal definition of these window functions and specify their hyperparameters. The Epanechnikovwindow is not differentiable at zero and behaves similar to the hinge loss in primal optimizationof support vector machines. This can create an issue in hyperparameter optimization which can beresolved by squaring the max operator (hence, the squared Epanechnikov window is introduced).

PARZEN WINDOWS FREQUENCY-DOMAIN FILTERS TIME-DOMAIN FILTERS

EPANECHNIKOV α ·max0, 1− γ · ‖ω − η‖2

α · cos (2πηt) ·max

0, 1− γ · |t|2

SQ-EPANECHNIKOV α ·max

0, 1− γ · ‖ω − η‖2

2α · cos (2πηt) ·max

0, 1− γ · |t|2

2

GAUSSIAN α · exp(−γ · ‖ω − η‖2

)α · cos (2πηt) · exp

(−γ · |t|2

)Table 1: The table lists three differentiable Parzen filters for spectral decomposition of signals based on band-passfiltering. The symbols t and ω are used to denote time- and frequency-domain inputs to filters, γ is the parametercontrolling the filter bandwidth, η is the parameter controlling the center frequency, and α is a scaling parameter.

Now, there are two possible directions for performing spectral decomposition of signals: frequency-and time-domain filtering. The operators [3, 11] reviewed in Section 2.1 are based on spectrogramaveraging and, thus, filters are defined in the frequency domain. An alternative to this would be torealize the band-pass filtering via time-domain convolutions of a signal with filters. In particular,the convolution theorem implies that for two signals x1, x2 ∈ L (R) the Fourier transform of theirconvolution is equal to the product of the corresponding Fourier coefficients, i.e.,

F (x1 ∗ x2) (ω) = F (x1) (ω) · F (x2) (ω) ,

where F denotes the Fourier transform of a signal and ∗ is the convolution operator. Thus, a mel-spectrogram coefficient corresponding to a triangular filter could equivalently be obtained by firstperforming the time-domain convolution with an appropriate filter and then summing the squaredamplitudes of the resulting signal. Here, it is important to note that while the first and higher orderscattering operators are introduced via time-domain convolutions they are always implemented usingfrequency-domain filters. This is mainly because of the parametrization of filters which is based onwavelets and that allows explicit specification of band-pass parameters [e.g., see 23].

While frequency-domain filtering using Parzen windows (along the lines of operators introduced inSection 2.1) is rather straightforward, time-domain filtering is slightly more complicated. This ismainly because of the parameterization for the center frequency of a filter, which typically correspondsto the filter mode. In particular, plain Parzen windows with center and bandwidth parameters onlyprovide smooth band-pass filters in the frequency domain and are good approximations for thetriangular distribution used in mel-frequency cepstral coefficients. To adapt these filters to time-domain band-pass filtering one needs to be able to parametrically change the center frequency of afilter as well as its bandwidth. It is well known that the width of a window centered at the originof the time-domain is inversely proportional to the frequency bandwidth of a filter and, thus, whatremains is the parameterization of the center frequency. To address this, we rely on an identity whichstates that cosine modulation of a filter shifts its frequency bandwidth by the modulation frequency,

F (ψω0) (ω) =1

2(F (ψ) (ω + ω0) + F (ψ) (ω − ω0)) with ψω0 (t) = ψ (t) · cos (2πω0t) ,

and where ω0 is the modulation frequency and a parameter in time-domain Parzen filters (see Table 1).Thus, the cosine modulation frequency parameter allows positioning of the band-pass filter at anypoint in the frequency domain and the width of its Parzen window controls the bandwidth of the filter.

Parzen filters generate a spectral decomposition of a signal which can provide a much more structuredinput to a supervised learning algorithm compared to raw signal data. Scattering operators also workon spectral decompositions of signals via averaging of the resulting spectrograms (obtained via staticband-pass filters). As these feature extraction methods enjoy nice theoretical properties (Section 2.1),we aim at mimicking that process with a more fine grained feature construction via convolutionlayers. In particular, instead of averaging over long windows of speech characteristic to both deep

4

scattering spectrum [3, 23] and mel-frequency coefficients [8, 11], we rely on several convolutionlayers and in this way aim at incorporating local time-translation invariance and foster stability tothe action of a small diffeomorphism. We investigate: i) one dimensional convolutions that extractfeatures by combining all the filters over small number of samples (Appendix B, Table 6), and ii) twodimensional convolutions which work on blocks of filters (with potentially overlapping bandwidths)and several samples of filtered signals (Appendix B, Table 7). The structure of our models was in partmotivated by considerations in [33], where only one dimensional convolutions have been considered.However, even in that case we tend to keep our network structure simpler (pretty much identicalconvolution and pooling operators across layers in 1D case) than the one proposed in [33, github].

3 Learning Parzen Filters using Stochastic Variational Inference

In this section, we first provide a brief review of stochastic variational inference that will be used forfilter-learning in Section 4. The approach is based on optimizing a lower bound on the log-marginallikelihood of the model. The optimization objective involves an analytically intractable integral thatdefines the Kullback–Leibler divergence term acting as a regularizer. Previous work has proposedapproximations based on Monte Carlo sampling [5] and an empirically estimated formula [25].We follow this line of research and propose a theoretically motivated approximation based on theGauss-Hermite quadrature, illustrated with scale-mixture and log-scale uniform priors (Section 3.2).

Let X ⊂ L (R) be an instance space containing signals in its interior and Y the space of categoricallabels. Suppose that a set of labeled examples (xi, yi)ni=1 has been drawn independently from alatent Borel probability measure defined on X × Y . We assume that the conditional probability of alabel y ∈ Y given an instance x ∈ X can be approximated with an exponential family model [2, 15]

p (y | x, θ,W ) = exp(θ>φ(x,y|W ))/∑

y′∈Y exp(θ>φ(x,y′|W)) ,

where θ ∈ Θ is a parameter vector and φ (x, y |W ) is a sufficient statistic of y | x, defined withsome set of hyperparameters W . Typically, the sufficient statistic of the model is selected such thatφ (x, y |W ) = vec(ey φ (x |W )

>), where e>y is the so called one-hot vector with one at position

of the categorical label y and zero elsewhere, and φ (x |W ) is a sufficient statistic of x ∈ X .

We can now take some prior distribution on the parameters p (θ,W ) and derive the posterior distribu-tion of ∆ = (θ,W ) given a sample of labeled examples

log p (∆ | Xn, Yn) = log p (∆) +

n∑i=1

log p (yi | xi,∆) with Xn = xini=1 ∧ Yn = yini=1 .

The mode of the posterior distribution is known as the maximum a posteriori estimator and we denoteit with ∆∗n = arg max∆ log p (∆ | Xn, Yn). The maximum a posteriori estimator is most frequentlyused as an empirical estimator of the conditional probability of a label given an instance. Typically,the posterior distribution p (∆ | Xn, Yn) is not analytically tractable. The log-marginal likelihood ofa model is frequently used for (hyper)parameter optimization [e.g., see 32] and it is given by

p (Yn | Xn) =

∫p (Yn | Xn,∆) p (∆) d∆ .

3.1 Stochastic Variational Inference

Bayesian variational inference [5, 6, 18, 19, 25, 40] is a popular technique for approximation ofposterior distributions involving analytically intractable integrals. The main idea is to introduce a vari-ational probability density function q (∆) in order to approximate the actual posterior p (∆ | Xn, Yn).In particular, the Jensen inequality implies that the log-marginal likelihood can be lower bounded by

log p (Yn | Xn) = log

∫p (Yn | Xn,∆) p (∆)

q (∆)q (∆) d∆ ≥ Eq [log p (Yn | Xn,∆)]−KL (q || p) ,

where KL (q || p) denotes the Kullback–Leibler divergence [22] between distributions p and q. Invariational inference, the parameters of the probability density function q are selected by maximizingthe lower bound on the log-marginal likelihood of the model, or equivalently

q∗ = arg minq∈Q

KL (q || p)−n∑i=1

Eq [log p (yi | xi,∆)] , (1)

5

where Q is some pre-defined family of variational distributions. Typically, the variational distributionis assumed to be the product of univariate Gaussian distributions q (∆) =

∏pi=1 N

(∆i | µi, σ2

i

),

where p is the total number of parameters in the model and N(∆i | µi, σ2

i

)denotes the fact that

random variable ∆i follows the univariate Gaussian distribution with mean µi and variance σ2i .

The log-likelihood term Ln (q) =∑ni=1 Eq [log p (yi | xi,∆)] is not analytically tractable and the

main idea in stochastic Bayesian variational inference is to approximate it using minibatches [19]

Ln (q) ≈ Lm (q) =n

m

m∑i=1

log p (yi | xi, γ) with γj = µj + εjσj , εj ∼ N (εj | 0, 1) (1 ≤ j ≤ p),

and where (xi, yi)mi=1 is a minibatch with m random examples. The estimator is differentiable withrespect to variational parameters υ = (µi, σi)pi=1 and unbiased. As a result, its gradient is alsounbiased and∇υLn (q) ≈ n/m

∑mi=1∇υ log p (yi | xi, γ) with γj ∼ N

(γj | µj , σ2

j

)(1 ≤ j ≤ p).

In this paper, we follow [19, 25, 36] and rely on a special type of variational distribution known asthe dropout posterior. The main idea is to parameterize the Gaussian variational distribution suchthat it has mean µj and variance σ2

j = αjµ2j , where αj > 0 is a scaling parameter and 1 ≤ j ≤ p.

The dropout posteriors provide a generalization of the dropout regularization technique frequentlyused in artificial neural networks. In particular, [13] has proposed a regularization technique thatin each step of stochastic gradient descent draws a set of active units in each layer of the networkby sampling from the Bernoulli distribution with probability 1 − p, where p is called the dropoutprobability. The concept was generalized to continuous Gaussian distributions in [36], which thenmotivated the introduction of dropout posteriors in variational Bayesian inference [19, Appendix B].

3.2 Approximation of Kullback–Leibler Divergence for Dropout Posteriors

The Kullback–Leibler divergence term in Eq. (1) acts as a regularization term in the variational objec-tive function. It takes as an argument the prior distribution p (∆) which determines the generalizationproperties of variational hypotheses given by q∗. The integral defining this divergence measure isanalytically intractable and needs to be approximated. The choice of the approximation techniquedepends on the choice of the variational and prior distributions. As mentioned in Section 3.1, we relyon the Gaussian dropout posteriors as variational distributions and this implies then that Kullback–Leibler divergence can be expressed as a sum of one dimensional integrals with respect to a Gaussianmeasure. Such integrals can be effectively approximated using the Gauss-Hermite quadrature. Thefollowing theorem provides a mean to approximate the divergence term for any prior distribution.Theorem 1. [Abramowitz and Stegun, 1] The Gauss–Hermite quadrature is a quadrature over theinterval (−∞,∞) with weighting function exp

(−u2

). For a univariate function h and an integral

J =

∫ ∞−∞

h (u) exp(−u2

)dx ,

the Gauss-Hermite approximation of order s satisfies J ≈∑si=1 wih (ui), where uisi=1 are the

roots of the physicist’s version of the Hermite polynomial Hs (u) = (−1)s

exp(u2)

ds

dus exp(−u2

)and the corresponding weights wisi=1 are given by wi = 2s−1s!

√π/s2Hs−1(ui)

2.

The scale-mixture is a prior distribution known for promoting sparsity of hypotheses and it was firstproposed in [5] in the context of Bayesian variational inference. It resembles the so called spike andslab prior [7, 12, 24] and is given by

psm (∆i) = λ · N(∆i | ξ, σ2

1

)+ (1− λ) · N

(∆i | ξ, σ2

2

),

where ∆i is a parameter of the model (see Eq. 1), σ21 and σ2

2 are variance parameters with σ1 σ2,ξ is the prior mean, and 0 ≤ λ ≤ 1. The first mixture component is chosen such that σ1 1, whichforces many of the parameters to concentrate tightly around the mean ξ (e.g., around zero for ξ = 0).The second mixture component has higher variance and heavier tails allowing parameters to movefurther away from the mean. We re-parameterize the variance parameters of the scale mixture priorwith ξ 6= 0 such that σ2

1 = α1ξ2 and σ2 = α2ξ

2, where α1, α2 > 0 are some scaling parameters. Thevariance parameters are shared between all variational parameters and this is an important differencecompared to approaches based on the spike and slab prior [7, 12, 24], where each model parameter has

6

a different variance parameter. In contrast to [5], we rely on the dropout parametrization of varianceparameters and do not approximate the Kullback–Leibler divergence term using samples from theposterior distribution q but rely on the Gauss-Hermite quadrature with approximation order s ≈ 16.The latter is formalized with the following proposition for which a proof is given in Appendix A.Proposition 2. The Kullback–Leibler divergence term defined with a dropout Gaussian posteriorand a (dropout) scale-mixture prior distribution can be approximated by

KL (q || psm) ≈ − log√

2παµ2 − 1/√π

s∑i=1

wi log psm (vi)− 1/2 with vi =(√

2αui + 1)µ ,

and where uisi=1 are the roots of the Hermite polynomial with corresponding quadrature weightswisi=1, α and µ are variational parameters, and psm is some scale-mixture prior distribution.

An alternative choice of prior distribution also known as the improper log-scale uniform prior wasconsidered in [19, 25]. The distribution is motivated by a result in [19] which shows that variationalinference with this prior is equivalent to learning with Gaussian dropout [36]. The log-scale uniformprior distribution of a parameter ∆i is given by [19, 25] plsu (log |∆i|) ∝ const.⇔ p (|∆i|) = 1/|∆i|.

Two different approximations of Kullback–Leibler divergence between this prior distribution andGaussian variational dropout posteriors have been provided in [19] and [25]. The latter approximationis based on an empirically synthesized formula (based on millions of samples from the posteriors)and is considered the state-of-the-art. We propose here to approximate this regularization term usingthe Gauss-Hermite quadrature and formalize this in the following proposition (see Appendix A).Proposition 3. The Kullback–Leibler divergence term defined with a dropout Gaussian posteriorand the log-scale uniform prior distribution can be approximated by

KL (q || plsu) ≈ −1/2 logα+ 1/√π

s∑i=1

wi log |vi|+ const. with vi =√

2αui + 1 .

As the roots of Hermite polynomial and the corresponding quadrature coefficients are symmetricaround zero, for zero-mean priors it is possible to get an estimate of order swith only s/2 roots/weights.

4 Experiments

We perform experiments on a standard benchmark dataset for automatic speech recognition TIMIT [10].The data splits (training/development/validation) originate from the Kaldi framework [29]. In allthe experiments, we train a context dependent model based on frame labels generated using theDNN TRI-PHONE model from Kaldi with 25 ms frames and 10 ms stride between the successiveframes. In the pre-processing step, we assign a Kaldi frame label to a 200 ms long segment of rawspeech centered at the original Kaldi frame (keeping 10 ms stride between the successive frames ofraw speech). A similar choice of input raw-speech frame length has been reported in [3, 33]. Aftercompletion of training, we take the resulting log-posterior probabilities (scaled by log-class priors)and pass them to Kaldi decoding to obtain the reported phoneme error rates. We configure the decoderjust as in the DNN decoding, which equipped us with frame labels (in total, 1936 HMM state ids).

We train our models using the approach described in Section 3. In all the experiments, the minibatchsize was set to 128 samples. For the log-scale uniform prior, the feature extraction part involvingParzen filters and convolution layers that synthesize features across filtered signals is trained using theRMSPROP algorithm [37] with initial learning rate 0.0008. The model part involving fully connectedblocks has been trained using the standard stochastic gradient descent with initial learning rate 0.08.This combination of optimization algorithms has been found to be the most effective for log-scaleuniform priors, confirming the findings in [33]. Alternative algorithms that were tried and found to betoo aggressive (low training error but not as good generalization) were ADAM [17], NADAM [9], andSGD with momentum. For scale-mixture priors, on the other hand, we rely on NADAM with initiallearning rate 0.001. The learning rates were decreased by a factor of 1/2 if at the end of an epoch therelative improvement in validation error was below 0.1%. Moreover, if the validation error degradedthe training would continue using the model from the previous epoch (learning rates would againbe decreased by 1/2). We terminate the training process after at most 30 epochs or upon observingno improvement in the validation error for 3 successive epochs. The training procedure, as well as

7

PARZEN WINDOW CNN 1D CNN 2DEPANECHNIKOV 18.9 18.6

SQ. EPANECHNIKOV 18.7 18.5

GAUSSIAN 18.8 18.6

Table 2: The table reports the phoneme error ratesof context dependent models based on Parzenfilters and standard convolutions as feature ex-traction blocks (combined with max pooling andlog-scale uniform priors). The outputs of such ablock are passed to MLP with RELU activations.

BASELINE PER (%)SINCNET [multi-objective training, 33] 18.0

MFCC MLP [multi-objective training, 34] 18.2

RAW SPEECH CNN [multi-objective training, 33, 34] 18.3

TD-FILTERBANK CNN-MLP [41] 18.0

DSS-CNN [27] 18.7

WAVENET [39] 18.8

Table 3: The table provides the phoneme error rates of dif-ferent context dependent models from relevant related work(neither based on stochastic variational Bayesian inference).

the required Bayesian variational components, have been implemented using the MXNET package(PYTHON). A more detailed description of the employed models can be found in Appendix B.

In Tables 2 and 3, we summarize our empirical results and compare them to the numbers reportedin relevant related work for context dependent models. Generally, we found that the stochasticvariational inference promotes sparse solutions and tends to trim too many parameters in the earlystages of the training. This is a known issue characteristic to stochastic Bayesian variational learningthat has also been observed in [35] and [25]. To address this issue, [35] has proposed to rescale theKullback–Leibler regularization term with a hyperparameter ρt such that ρt+1 = min1, ρt+ c withρ0 = 0 and some constant 0 < c < 1 (e.g., c = 0.2), and where t denotes the epoch number (startingfrom t = 0). We have followed this heuristic and observed an improvement in accuracy. The mostsimilar related approach to our work is SINCNET [33], where filters defined via the difference betweentwo sinc functions have been pursued. In contrast to this work, we offer more flexibility by providinga whole family of filters with the ability to directly influence the center frequency as well as thebandwidth and scale of a filter. Moreover, our filters have different support in time-domain whereasall the filters in [33] are of the fixed length. Beside the parametric difference, we rely on variationalinference instead of standard backpropagation and do not perform multi-objective training to improvegeneralization. In particular, the latter refers to training the model by optimizing a combination oflog-likelihood loss functions, one for the context dependent (i.e., tri-phone) and the other for thecontext independent (i.e., monophone) model. The approach in [41] is also motivated by scatteringoperators and learns a family of filters for filtering short frames of raw speech signal (25 ms), just asin mel-frequency coefficients. Several neighboring frames are then concatenated into a representationthat also involves delta and delta-delta features [e.g., see 11]. While possible, we have not consideredadding delta features which are known for improving the accuracy by 1-2%. Important differencescompared to this work are in that we are filtering much longer signals, relying on variational inferencefor filter-learning, and using only 3 or 4 convolution blocks on top of our Parzen filters (compared to 7in [41]). Another closely related approach is [27], where the authors have used several static waveletresolutions from [3] in combination with convolutions and fully connected multi layer perceptrons.Overall, the empirical results indicate a competitive performance of our representation and variationalinference. We also note here that to the best of our knowledge this is the first time that a variationalapproach proved to be competitive to state-of-the-art artificial neural networks on speech recognition.

POOLING TYPE PER (%)MAX POOLING 18.7

L1 AVERAGE POOLING 19.3

L2 AVERAGE POOLING 19.2

Table 4: The table reports the phoneme error ratesof context dependent models for different poolingoperators combined with squared EpanechnikovParzen filters, 1D convolutions, and log-scale uni-form priors (with Gauss–Hermite quadrature).

PRIOR DISTRIBUTION PER (%)SCALE-MIXTURE (KL from [5]) 18.9

SCALE MIXTURE (GAUSS-HERMITE) 18.8

LOG-SCALE UNIF. (KL from [25]) 18.8

LOG-SCALE UNIF. (GAUSS-HERMITE) 18.7

Table 5: The table reports the phoneme error rates obtainedusing context dependent models based on different priorsand approximations of Kullback–Leibler divergence integrals(squared Epanechnikov filters, max-pooling, and 1D convs.).

In addition to the described experiments, we also evaluate the influence of different pooling operatorsand the effectiveness of the proposed approximation for the Kullback–Leibler divergence term. Ta-bles 4 and 5 summarize the results of these experiments and indicate that the proposed approximationof the Kullback–Leibler divergence based on the Gauss–Hermite quadrature is effective across differ-ent priors. As for the considered pooling operators, our findings indicate that max-pooling seems tobe the most effective in combination with Parzen filters and one dimensional convolutions.

We conclude with a reference to another recent related work based on scattering operators wherewavelet filters were optimized jointly with neural networks [16]. In contrast to our empirical study, theapproach considers context independent (i.e., monophone) models for automatic speech recognitionand the reported results are, thus, not directly comparable. Also, the parameterization of filters (i.e.,center frequency and bandwidth) is slightly more complicated than the one pursued in this work.

8

Acknowledgment: The authors were supported in part by EPSRC grant EP/R012067/1.

References

[1] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, andMathematical Tables. New York: Dover, 9th edition, 1972.

[2] Y. Altun, A. J. Smola, and T. Hofmann. Exponential families for conditional random fields. In Proceedingsof the 20th Conference on Uncertainty in Artificial Intelligence, pages 2–9. AUAI Press, 2004.

[3] J. Andén and S. Mallat. Deep scattering spectrum. IEEE Transactions on Signal Processing, 62(16):4114–4128, 2014.

[4] L. J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. arXiv pre-print arXiv:1607.06450, 2016.[5] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In

F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning,volume 37 of Proceedings of Machine Learning Research, pages 1613–1622. PMLR, 2015.

[6] W. L. Buntine and A. S. Weigend. Bayesian back-propagation. Complex Systems, 5:603–643, 1991.[7] H. Chipman. Bayesian variable selection with related predictors. Canadian Journal of Statistics, 24(1):

17–36, 1996.[8] S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition

in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366, 1980.

[9] T. Dozat. Incorporating nesterov momentum into. 2015.[10] W. Fisher, G. Doddington, and K. Goudie-Marshall. The DARPA speech recognition research database:

specifications and status. In Proceedings of DARPA Workshop on Speech Recognition, pages 93–99, 1986.[11] M. Gales and S. Young. The application of hidden Markov models in speech recognition. Foundations and

Trends in Signal Processing, 1(3):195–304, 2007.[12] E. I. George and R. E. McCulloch. Variable selection via Gibbs sampling. Journal of the American

Statistical Association, 88(423):881–889, 1993.[13] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks

by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal

covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, pages 448–456.PMLR, 2015.

[15] E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106:620–630, 1957.[16] H. Khan and B. Yener. Learning filter widths of spectral decompositions with wavelets. In Proceedings of

the 32Nd International Conference on Neural Information Processing Systems, pages 4606–4617. CurranAssociates Inc., 2018.

[17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the InternationalConference on Learning Representations, 2015.

[18] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In 2nd International Conference onLearning Representations, 2014.

[19] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick.In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in NeuralInformation Processing Systems 28, pages 2575–2583. Curran Associates, Inc., 2015.

[20] R. Kondor and S. Trivedi. On the generalization of equivariance and convolution in neural networks tothe action of compact groups. In J. Dy and A. Krause, editors, Proceedings of the 35th InternationalConference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages2747–2755. PMLR, 2018.

[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neuralnetworks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in NeuralInformation Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.

[22] S. Kullback. Information Theory and Statistics. Wiley, 1959.[23] S. Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):

1331–1398, 2012.[24] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the

American Statistical Association, 83(404):1023–1032, 1988.[25] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsifies deep neural networks. In

D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning,volume 70 of Proceedings of Machine Learning Research, pages 2498–2507. PMLR, 2017.

[26] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics,33(3):1065–1076, 1962.

9

[27] V. Peddinti, T. Sainath, S. Maymon, B. Ramabhadran, D. Nahamoo, and V. Goel. Deep scattering spectrumwith deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 210–214, 2014.

[28] M. Plancherel. Contribution á l’étude de la représentation d’une fonction arbitraire par les intégralesdéfinies. In Rendiconti del Circolo Matematico di Palermos, volume 30, pages 289–335, 1910.

[29] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek,Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely. The Kaldi speech recognition toolkit. InIEEE ASRU, 2011.

[30] P. Prandoni and M. Vetterli. Signal Processing for Communications. New York: Taylor & Francis Group,1st edition, 2008.

[31] A. Raj, A. Kumar, Y. Mroueh, T. Fletcher, and B. Schölkopf. Local group invariant representations viaorbit embeddings. In A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference onArtificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages1225–1235. PMLR, 2017.

[32] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computationand Machine Learning. MIT Press, 2005.

[33] M. Ravanelli and Y. Bengio. Speech and speaker recognition from raw waveform with sincnet. arXivpre-print arXiv:1812.05920, 2018.

[34] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio. Batch-normalized joint training for dnn-based distantspeech recognition. 2016 IEEE Spoken Language Technology Workshop (SLT), pages 28–34, 2016.

[35] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder variational autoencoders.In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages3745–3753. Curran Associates Inc., 2016.

[36] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way toprevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[37] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recentmagnitude. COURSERA: Neural Networks for Machine Learning, 2012.

[38] A. Trouvé and L. Younes. Local geometry of deformable templates. SIAM Journal on MathematicalAnalysis, 37(1):17–59, 2005.

[39] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv pre-print arXiv:1609.03499,2016.

[40] S. Wang and C. Manning. Fast dropout training. In S. Dasgupta and D. McAllester, editors, Proceedings ofthe 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine LearningResearch, pages 118–126. PMLR, 2013.

[41] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schatz, G. Synnaeve, and E. Dupoux. Learning filterbanks fromraw speech for phone recognition. In ICASSP, pages 5509–5513. IEEE, 2018.

10

A Proofs

Proposition 2. The Kullback–Leibler divergence term defined with a dropout Gaussian posteriorand a (dropout) scale-mixture prior distribution can be approximated by

KL (q || psm) ≈ − log√

2παµ2 − 1/√π

s∑i=1

wi log psm (vi)− 1/2 with vi =(√

2αui + 1)µ ,

and where uisi=1 are the roots of the Hermite polynomial with corresponding quadrature weightswisi=1, α and µ are variational parameters, and psm is some scale-mixture prior distribution.

Proof. We can re-write the Kullback–Leibler divergence term as

KL (q || psm) =

∫q (u) log q (u) du−

∫q (u) log psm (u) du = −H(q)− Eq [log psm (u)] ,

where H (q) is the entropy of the univariate dropout Gaussian distribution given by

q (u) =1√

2παµ2exp

(− (u− µ)

2

2αµ2

).

As the entropy of a Gaussian distribution defines an analytically tractable integral [e.g., see 22, 32],we have that the entropy of q is given by

H (q) = log√

2παµ2 + 1/2 .

On the other hand, the expected log-likelihood of the scale-mixture prior can be approximated usingthe Gauss-Hermite quadrature by observing that

Eq [log psm (u)] =1√

2παµ2

∫exp

(− (u− µ)

2

2αµ2

)log psm (u) du =

1√π

∫log psm

(√2αµ2t+ µ

)exp

(−t2

)dt .

The result now follows from Theorem 1 by taking h (t) = log psm

(√2αµ2t+ µ

).

Proposition 3. The Kullback–Leibler divergence term defined with a dropout Gaussian posteriorand the log-scale uniform prior distribution can be approximated by

KL (q || plsu) ≈ −1/2 logα+ 1/√π

s∑i=1

wi log |vi|+ const. with vi =√

2αui + 1 .

Proof. From [19, Appendix C], we know that the Kullback–Leibler divergence is given by

KL (q || plsu) = EN (ε|1,α)

[log |ε|

]− 1

2logα+ const.

The expectation with respect to the Gaussian random variable ε can be re-written as

EN (ε|1,α)

[log |ε|

]=

1√2πα

∫exp

(− (ε− 1)

2

2α

)log |ε| dε =

1√π

∫log∣∣∣√2αt+ 1

∣∣∣ exp(−t2

)dt .

The result now follows from Theorem 1 by taking h (t) = log∣∣√2αt+ 1

∣∣.

11

B Models

BLOCK PARAMETERS INPUT (NUM. SAMPLES) OUTPUT (NUM. SAMPLES)

MEAN-VARIANCE NORMALIZATION – 3200 (200 ms) 3200

LAYER NORMALIZATION 2× 3200 3200 3200

PARZEN FILTERS (1-25 ms) 2× 128× 3 3200 128× 2802

MAX POOLING 1D – 128× 2802 128× 934

LAYER NORMALIZATION + RELU 2× 934 128× 934 128× 934

CONV 1D 2× 64× 128× 5 128× 934 64× 930

MAX POOLING 1D – 64× 930 64× 310


CONV 1D 2× 64× 64× 5 64× 310 64× 306

MAX POOLING 1D – 64× 306 64× 102


CONV 1D 2× 64× 64× 5 64× 102 64× 98

MAX POOLING 1D – 64× 98 64× 33


LINEAR BLOCK 2× 2113× 1024 2112 1024

BATCH NORMALIZATION + RELU 2× 1024 1024 1024

LINEAR BLOCK 2× 1025× 1024 1024 1024


LINEAR BLOCK 2× 1025× 1024 1024 1024


LINEAR BLOCK 2× 1025× 1024 1024 1024


LINEAR BLOCK 2× 1025× 1024 1024 1024


LINEAR BLOCK + SOFTMAX 1025× 1936 1024 1936

Table 6: The table describes the structure of our one dimensional convolution model based on Parzen filters.

BLOCK PARAMETERS INPUT (NUM. SAMPLES) OUTPUT (NUM. SAMPLES)

MEAN-VARIANCE NORMALIZATION – 3200 (200 ms) 3200

LAYER NORMALIZATION 2× 3200 3200 3200

PARZEN FILTERS (1-25 ms) 2× 128× 3 3200 128× 2802

MAX POOLING 2D (1× 3) – 128× 2802 128× 934


CONV 2D 2× 32× 10× 5 128× 934 32× 25× 930

MAX POOLING 2D (1× 3) – 32× 25× 930 32× 25× 310

LAYER NORMALIZATION + RELU 2× 310 32× 25× 310 32× 25× 310

CONV 2D 2× 32× 5× 5 32× 25× 310 32× 21× 306

MAX POOLING 2D (2× 3) – 32× 21× 306 32× 11× 102


CONV 2D 2× 32× 3× 5 32× 11× 102 32× 9× 98

MAX POOLING 2D (2× 3) – 32× 9× 98 32× 5× 33


CONV 2D 2× 32× 3× 5 32× 5× 33 32× 3× 29

MAX POOLING 2D (1× 3) – 32× 3× 29 32× 3× 10


LINEAR BLOCK 2× 961× 1024 960 1024


LINEAR BLOCK 2× 1025× 1024 1024 1024


LINEAR BLOCK 2× 1025× 1024 1024 1024


LINEAR BLOCK + SOFTMAX 1025× 1936 1024 1936

Table 7: The table describes the structure of our two dimensional convolution model based on Parzen filters.

12

C Initialization Scheme

We initialize the centers of Parzen filters by keeping them equidistant in the mel-scale and use theheuristic from [41] for the filter bandwidths. We limit the filter lengths in time domain such that thewidth of a Parzen window is at least 1 ms and at most 25 ms long (note that for Epanechnikov filtersthe time-domain filter has finite support, whereas the Gaussian filter has infinite support). The centerfrequency of a Parzen filter was bounded/clipped so that the minimal possible frequency is 50 Hz andthe maximal one is 7950 Hz. The Kullback–Leibler divergence term was re-scaled as in [35] withc = 0.2 (see also the definition of the scaling hyperparameter ρt in Section 4). The dropout parameterα is stored in the log-form and the initial value across blocks (apart from normalization and softmaxblocks) was set to −3.0, which corresponds to variational standard deviation of ≈ 0.22 |µ|, where µdenotes the variational mean of a network parameter. The parameter α was also bounded/clippedas in [25] so that the minimal value is 0.0001 and the maximal one is set to 16. Moreover, fornormalization layers and the final softmax block the prior parameter α was set to value close tozero because dropout is rarely applied to these network blocks (we are unaware of such models).The fully connected layers were initialized by sampling uniformly at random from the interval(−0.01/

√p+q, 0.01/

√p+q), where p and q denote the number of inputs and outputs corresponding to

such a block. The bias parameters corresponding to fully connected blocks are initialized withzero-vectors. The mean and scale parameters in normalization layers [4, 14] were initialized to zeroand one, respectively. The convolution parameters are initialized by sampling uniformly at randomfrom the interval (−1/

√r, 1/√r), where r denotes the total number of parameters in a convolution

filter (the same interval and sampling strategy was used to initialize the convolution bias parameters).

We have observed that the optimization of a two dimensional convolution model can encounternumerical problems with FLOAT32 numerical precision (moving loss goes to −inf). In such cases, itappears that a minor change to the loss function resolves the issue without a significant impact on theaccuracy. In particular, we transform the log-softmax probabilities as follows

log p −→ log(

(1− 2κ) p+ κ)

with κ→ 0 (jitter constant close to zero) .

We have placed zero-mean priors on the weights of fully connected blocks, convolution filters, as wellas on the parameters for means and scales (parameterized as 1− scale) of normalization blocks. Forthe weights of Parzen filters, on the other hand, we have opted for the prior mean to be equal to theinitial values of the parameter filters (the variances are scaled-means, just as in dropout posteriors).

The best results with scale-mixture priors were obtained using the following combination of parame-ters: λ = 0.25, σ1 = 0.005, and σ2 = 1.0 or σ2 = 0.5.

13

Peter Sollich [email protected] arXiv:1906.09526v1 [stat.ML] … · 2019-06-25 · [email protected] Zoran Cvetkovic Department of Informatics King’s College London...

Documents