DATA-DRIVEN NEURAL NETWORK BASED FEATURE FRONT-ENDS FOR AUTOMATIC SPEECH RECOGNITION by Samuel Thomas A dissertation submitted to The Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy. Baltimore, Maryland December, 2012 c Samuel Thomas 2012 All rights reserved
155
Embed
DATA-DRIVEN NEURAL NETWORK BASED FEATURE FRONT-ENDS FOR
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DATA-DRIVEN NEURAL NETWORK BASED FEATURE
FRONT-ENDS FOR AUTOMATIC SPEECH RECOGNITION
by
Samuel Thomas
A dissertation submitted to The Johns Hopkins University in conformity with the
requirements for the degree of Doctor of Philosophy.
3.1 Word Recognition Accuracies (%) using different Tandem features derivedusing only 1 hour of English data . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Word Recognition Accuracies (%) using Tandem features enhanced usingcross-lingual posterior features . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 Word Recognition Accuracies (%) using two languages - Spanish and English 673.5 Word Recognition Accuracies (%) using three languages - Spanish, German
4.1 Word Recognition Accuracies (%) using different amounts of Callhome datato train the LVCSR system with conventional acoustic features . . . . . . . 77
4.2 Word Recognition Accuracies (%) with semi-supervised pre-training . . . . 834.3 Word Recognition Accuracies (%) at different word confidence thresholds . 894.4 Word Recognition Accuracies (%) with semi-supervised pre-training . . . . 90
xiii
LIST OF TABLES
4.5 Word Recognition Accuracies (%) with semi-supervised acoustic model training 91
5.2 Performance in terms of Min DCF (×103) and EER (%) in parentheses ondifferent NIST-08 conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3 Integrating MLP based event detectors with ASR . . . . . . . . . . . . . . . 108
6.1 Performances in a low-resource setting using different data-driven front-endsproposed in the thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xiv
List of Figures
1.1 Broad Classification of Feature Transforms for ASR. . . . . . . . . . . . . . 81.2 Spectral basis functions derived using PCA on the bark-spectrum of speech
from the OGI stories database - Eigenvalues of the KLT basis, total covari-ance matrix projected on the first 8 KLT vectors, first 6 KL spectral basisfunctions derived by PCA analysis. . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 LDA-derived spectral basis functions of the critical band spectral space de-rived from the OGI Numbers corpus. . . . . . . . . . . . . . . . . . . . . . . 13
1.4 (a) Frequency and impulse responses of the first three discriminant vectorsderived by applying LDA on trajectories of critical-band energies from cleanSwitchboard database, (b) Frequency and impulse responses of the RASTAfilter and the RASTA filter combined with the delta and double-delta filters. 14
2.1 Illustration of the all-pole modeling property of FDLP. (a) a portion of thespeech signal, (b) its Hilbert envelope (c) all pole model obtained using FDLP. 35
2.2 PLP (b) and FDLP (c) spectrograms for a portion of speech (a). . . . . . . 382.3 Schematic of the joint spectral envelope, modulation features for posterior
3.1 Schematic of the proposed training technique with multiple output layers . 583.2 Deriving cross-lingual and multi-stream posterior features for low resource
LVCSR systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.3 Tandem and bottleneck features for low-resource LVCSR systems. . . . . . 68
4.1 (a) Wide and (b) Deep neural network topologies for data-driven features . 714.2 Data driven front-end built using data from the same language but from a
different genre. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.3 A cross-lingual front-end built with data from the same language and with
large amounts of additional data from a different language but with sameacoustic conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xv
LIST OF FIGURES
4.4 LVCSR word recognition accuracies (%) with 1 hour of task specific trainingdata using the proposed front-ends . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Average precision for different configuration of the wide topology front-ends 105
xvi
Chapter 1
Introduction
This chapter introduces the automatic speech recognition problem and machinery.
The theme of the thesis - developing data-driven feature extractors for speech recognition, is
motivated along with a discussion on techniques that have been developed in the past. The
chapter also outlines the thesis and its contributions.
1.1 Overview of Automatic Speech Recognition
Automatic speech recognition is the process of transcribing speech into text. Cur-
rent speech recognition systems solve this task in a probabilistic setting using four key
components: a feature extraction module, an acoustic model, a pronunciation dictionary
and a language model. In a word recognition task, given an acoustic signal corresponding
to a sequence of words X = x1x2 . . . xn, the feature extraction module first generates a
compact representation of the input as sequence of feature vectors Y = y1y2 . . . yt. The
1
CHAPTER 1. INTRODUCTION
acoustic model, pronunciation dictionary and a language model are then used to find the
most probable word sequence X given these feature vectors. This is done by expressing the
desired probability p(X|Y ) using Bayes theorem as
X = arg maxX
p(X|Y ) = arg maxX
p(Y |X)p(X)p(Y )
(1.1)
p(X) is the a priori probability of observing a sequence of words in the language, inde-
pendent of any acoustic evidence and is modeled using the language model component.
p(Y |X) corresponds to the likelihood of the acoustic features Y being generated given the
word sequence X.
In current ASR systems, both the language model and the acoustic model are
stochastic models trained using large amounts training data [1, 2]. Hidden Markov Models
(HMMs) or a hybrid combination of neural networks and HMMs [3] are typically used as
acoustic models.
For large vocabulary speech recognition, not all words have adequate number of
acoustic examples in the training data. The acoustic data also covers only a limited vo-
cabulary of words. Instead of modeling incorrect probability distributions of entire words
or utterances using limited examples, acoustic models for basic speech sounds are instead
built. By using these basic units, recognizers can also recognize words without acoustic
training examples.
To compute the likelihood p(Y |X), each word in the hypothesized word sequenceX
is first broken down into its constituent phones using the pronunciation dictionary. A single
composite model for the hypothesis is then constructed by combining individual phone
HMMs. In practice, to account for the large variability of basic speech sounds, HMMs
2
CHAPTER 1. INTRODUCTION
of context dependent speech units with continuous density output distributions are used.
There exist efficient algorithms like the Baum-Welch algorithm to learn the parameters of
these acoustic models from training data [4].
N -grams, typically bi-grams or tri-grams, are used as language models to generate
the a priori probability p(X) [2]. Although p(X) is the probability of a sequence of words,
N -grams model this probability assuming the probability of any word xi depends on only
N-1 preceding words. These probability distributions are estimated from simple frequency
counts that can be directly obtained from large amounts of text. To account for the inability
to estimate counts for all possible N -gram sequences techniques like discounting and back-off
are used [5].
1.2 Conventional Feature Extraction Techniques for ASR
Front-ends for ASR which have traditionally evolved from coding techniques like
linear predictive coding (LPC) [6] start by performing a short-term analysis of the speech
signal. Based on the assumption that speech is stationary in sufficiently short-time intervals,
the power spectrum (squared magnitude of the short-time Fourier spectrum) of the signal
is computed every 10 ms in overlapping Hamming analysis windows of 25 ms duration
[7, 8]. This spectral representation of speech is then transformed into an auditory-like
representation by warping the frequency axis to the Mel or Bark scale and applying a non-
linear cubic root or logarithmic compression. Mel-frequency Cepstral Coefficients (MFCC)
[9] or Perceptual Linear Prediction (PLP) [10] features for speech recognition are cepstral
coefficients derived by projecting the auditory-like representation onto a set of discrete cosine
3
CHAPTER 1. INTRODUCTION
transform (DCT) basis functions. Since these techniques analyze the speech signal only in
short analysis windows, information about local dynamics of the underlying speech signal is
often provided by augmenting these features with derivatives of the cepstral trajectories at
each instant [11]. In speech recognition applications, the first 13 cepstral coefficients along
with their delta and double deltas are typically used.
1.3 Integrating Training Data with Feature Extraction
In practical classification settings, the goal of a classifier is to assign one of J class
labels to an entity given an N dimensional feature vector x. One approach to this problem,
involves inferring posterior class probabilities p(Cj |x) of each class given the features. The
entity is then assigned to the class that gives the highest class posterior probability [12].
The posterior probability p(Cj |x) of each class can be estimated in multiple ways.
In a Bayesian formulation, p(Cj |x) can be expanded as p(x|Cj)p(Cj)p(x) . Each of the quantities
p(x|Cj) and p(Cj) are then separately computed from generative models trained to capture
these distributions from data. The probability p(Cj |x) can also be estimated directly from
a parametric model, whose parameters have also been optimized using the training data.
A non-probabilistic approach to the classification problem involves discriminant
functions that predict the class label of the input [13]. In this framework, classification
is viewed as partitioning of the input feature space into different classes using decision
boundaries or surfaces. For a simple two class problem, a linear discriminant function can
be constructed as the linear combination of the input feature vector with a weight vector
4
CHAPTER 1. INTRODUCTION
w as
f(x,w) = wTx + w0. (1.2)
In the N-dimensional input space, the function f(x,w) = wTx + w0 forms an N-1 di-
mensional hyperplane that assigns x to class C1 if f(x,w) ≥ 0 and to class C2 otherwise.
Discriminant functions can be further extended as generalized linear discriminant functions,
of the form
f(x,w) = wTφ(x) + w0, (1.3)
where φ(.) is a fixed linear or non-linear vector function of the original input vector x. Using
these functions, for the J class problem we can design for example, a J-class discriminant
with J linear functions of the form
fj(x,wj) = wTj φ(x) + w0. (1.4)
x is assigned to class Ck if fk > fj for all k �= j.
From a feature extraction perspective, discriminant functions provide an interest-
ing avenue for integrating information from the data through the data dependent transfor-
mations of the input features. An example of a linear discriminant function is Fisher’s linear
discriminant. In this method, instead of using the linear combination of the input vector
to form a hyperplane for class assignment, the linear combination is used as a dimension-
ality reduction technique. The weight vector w is designed as a set of basis functions that
projects the feature vector x to a lower dimension such that there is maximal separation
between class means and variance within each class is minimum. A common criteria used
5
CHAPTER 1. INTRODUCTION
for this objective is defined as
F(w) = trace(S−1w Sb), (1.5)
where Sw and Sb are within-in class and between class covariance matrices of the data. If
the dimensionality of the new projection space is M , the weight vector can be shown to
be the set of basis functions corresponding to M eigenvectors of S−1w Sb with the largest
eigenvalues [12].
More powerful discriminant functions can be designed by using non-linear basis
functions. In feed-forward neural networks, which are classic examples of these models, the
generalized linear discriminant function is modified as
f(x,w) = g(K∑
k=1
wkφk(x)), (1.6)
where g(.) is a non-linear activation function and φk is now a non-linear. During the training
phase, the basis functions and the weights are adjusted using the training data [13].
In a two layer neural network for example, the processing starts by creating linear
combinations of the N dimensional feature vector at each of the K hidden layer units. With
each of the hidden nodes being connected to every input node through a set of weights, an
activation input of the form
ak =N∑
n=1
wnkxn + wk0, (1.7)
is first produced at each node. Each of node activation then passes through a differentiable,
nonlinear activation function ψ(.) to produce output activations bk = ψ(ak). Commonly
used activation functions are nonlinear sigmoidal functions like the logistic sigmoid or the
‘tanh’ function. Weight wnk is a trainable parameter connecting input node n and hidden
6
CHAPTER 1. INTRODUCTION
node k. wk0 is fixed bias term of the hidden node. Activation outputs of the hidden layer
are then linear combined again to form output unit activations. Each of the M output
nodes receives an activation input
am =K∑
k=1
wkmbm + wm0 (1.8)
to produce an output of the form cm = σ(am), where σ(.) is the ‘softmax’ activation function
defined as
σ(am) =exp(am)∑m exp(am)
, (1.9)
for multi-class classification problems. Using (1.7)-(1.9), the overall network function can
be written as
hm(x,w) = σ( K∑
k=0
wkmψ( N∑
n=0
wnkxn
)). (1.10)
Comparing (1.6) with (1.10) shows how the non-linear basis functions ψ(.) are also now
learnt like the weight parameters. There are different training algorithms to learn these
parameters. In commonly used training methods, model parameters are optimized by us-
ing a cross-entropy error criteria and techniques like error back-propagation. For speech
applications, multilayer perceptrons (MLP) can be used to estimate posterior probabilities
of speech classes like phonemes, conditioned on the input features [3, 14].
1.4 Review of data-driven feature transforms for ASR
Both the transforms reviewed above - transforms with linear basis functions and
transforms with non-linear basis functions form starting points for the development of more
7
CHAPTER 1. INTRODUCTION
Time−frequencyrepresentation
Data−drivenprojections
Data independentprojections
BasisFunctions Functions
TransformsFront−end Feature
based TransformTraining Criteria
based TransformTraining Criteria
Features
Speech Signal
Back−end Feature Transforms
ModelsAcoustic
Features DiscriminativeTraining FrameworkTraining Framework
Maximum Likelihood
Linear Non−linearBasis
Figure 1.1: Broad Classification of Feature Transforms for ASR.
complex data-driven feature transforms and acoustic model backends in speech recognition.
Although transforms like the discrete Fourier transform and the discrete cosine transform
have been used, neither of these transforms are data driven. There has hence been consid-
erable interest to improve these front-ends with more powerful data-driven techniques.
Figure 1.1 is a schematic of how data-driven feature extraction or transformation techniques
for ASR can be broadly classified. There are clearly two distinct sets of transformation
classes - while one set of transforms are strongly tied with the feature extraction module,
8
CHAPTER 1. INTRODUCTION
the second set is strongly coupled with the acoustic model and its training criteria. We call
the first class front-end feature transforms and the second class back-end feature transforms.
1.4.1 Front-end feature transforms
Data-driven feature extractors at the front-end operate directly on time-frequency
representations of speech. As shown in Figure 1.1, these transforms can be further cate-
gorized into two broad groups - data independent projections and data-driven projections.
Examples of data independent projections are the DCT transforms discussed earlier. Al-
though these are a set of fixed cosine basis functions, they are very similar to basis functions
that can be derived from a direct principal component analysis (PCA) [15] on the auditory
spectrum of speech. Principal component analysis or the Karhunen-Loeve transform (KLT)
is a mathematical procedure that transforms a set of observations from possibly correlated
variables into a new set of values corresponding to linearly uncorrelated variables or prin-
cipal components. Figure 1.2 (reproduced from [16]) shows a set of spectral basis derived
using the data-dependent Karhunen-Loeve transform (KLT) on filter bank outputs using 2
hours of speech from the OGI Stories database [17]. The basis functions are very similar to
the cosine functions used in conventional features. The flatness of the first basis function
shows that the variation in the average energy is what contributes the most to the variance
of auditory representations.
LDA using the Fisher discriminant criteria described earlier has been used as a
useful tool in the development of many techniques in the second class of projections - data-
dependent projections. This class is sub-divided further into two groups - a set of transforms
that use linear basis derived by solving a generalized eigenvalue decomposition problem
9
CHAPTER 1. INTRODUCTION
Figure 1.2: Spectral basis functions derived using PCA on the bark-spectrum of speech fromthe OGI stories database - Eigenvalues of the KLT basis, total covariance matrix projectedon the first 8 KLT vectors, first 6 KL spectral basis functions derived by PCA analysis.
and those which use neural network based techniques with non-linear basis functions. In
early work, Brown [18] and Hunt [19] have used LDA on features in speech recognition.
Hunt and his colleagues integrated LDA with Mel-auditory representations of speech in
framework they called IMELDA - integrated Mel-scale representation with LDA [19,20]. A
host of techniques have since been developed based on using LDA with HMM based speech
recognizers to improve recognition performances. These techniques have focused on the use
of different types of output classes like phones, subphones or HMM states and improvements
to LDA limitations - class-conditional distributions are assumed to be normal with equal
covariances matrices. Apart from improving recognition performances, a series of work
10
CHAPTER 1. INTRODUCTION
by Malayath, van Vuuren, Valente and Hermansky [21–23] have analyzed the usefulness
of LDA with phonemes as output classes. Table 1.1 summarizes their key observations of
using LDA with different time-frequency representations of speech. All these techniques
while decorrelating the input feature vectors also maximize the class separability of the
desired output classes, leading to improvements in the recognition performances of ASR
systems.
Time-Frequency Representation Observations
Short-time Fourier spectrum - Discriminant vectors have a non-uniform
LDA is applied to the analysis resolution with frequency - low
to the log-spectra of speech frequency parts of the spectrum are analyzed
with higher resolution than high frequency
parts. This is consistent with the properties
of Mel/Bark filter-bank analysis used in
conventional feature extraction techniques.
Consistent with the properties of hearing,
sensitivity of features derived using
these functions is inversely related to
formant frequencies.
Critical-band spectrum - Unlike the first cosine function,
LDA is applied to total energy of the spectrum is not used.
critical band spectral features Second and third discriminants capture
spectral ripples in the central portion of
11
CHAPTER 1. INTRODUCTION
Time-Frequency Representation Observations
the critical-band spectrum. The fourth basis
uses information above 5 bark period.
Figure 1.3 (reproduced from [22])
shows the important basis functions
compares the basis functions.
Trajectories of Discriminant vectors form a set of FIR filters.
critical-band energies - LDA is Frequency response of the first three discriminant
applied to long segments of vectors are consistent with the RASTA, delta and
time trajectories double-delta features used in ASR.
Figure 1.4 (reproduced from [21]
compares the basis functions with the RASTA
and delta and double-delta filters proposed
by Furui [24]
Table 1.1: LDA with different representations of speech.
While the PCA and LDA techniques described above are useful in describing
transforms in the Euclidean space, manifold based techniques characterize data as being
embedded in a manifold space [25–27]. Several generic manifold learning techniques have
been adopted to be applied on speech data. While learning the manifold structure, several
of these techniques also model both global and local relationships between data points in
the manifold space as constraints. These learning problems are usual solved as optimization
12
CHAPTER 1. INTRODUCTION
Figure 1.3: LDA-derived spectral basis functions of the critical band spectral space derivedfrom the OGI Numbers corpus.
problems or as generalized eigenvector problems.
The second important class of front-end based transforms use neural networks. For acoustic
modeling, multilayer perceptrons (MLP) based systems are trained on different kinds of
feature representations of speech to estimate posterior probabilities of output classes like
phonemes, conditioned on the input features [14]. Neural network based acoustic models
provide several key advantages -
Training criteria - Neural networks are trained to discriminate between output classes
using non-linear basis functions, with its cross-entropy training criteria. This training
can also be scaled efficiently to work on large amounts of training data.
13
CHAPTER 1. INTRODUCTION
Figure 1.4: (a) Frequency and impulse responses of the first three discriminant vectorsderived by applying LDA on trajectories of critical-band energies from clean Switchboarddatabase, (b) Frequency and impulse responses of the RASTA filter and the RASTA filtercombined with the delta and double-delta filters.
Input feature assumptions - These networks can model high dimensional input features
without any strong assumptions about the probability distribution of these features.
Several different kinds of correlated feature streams can also be integrated together
14
CHAPTER 1. INTRODUCTION
since there are also no strong assumptions on statistical independence.
Output representations - MLPs trained on large amounts of data from a diverse col-
lection of speakers and environments, can achieve invariance to these unwanted vari-
abilities. Since posterior probabilities are produced by these networks, outputs from
several networks trained on different feature representations can be combined in a
multi-stream fashion to improve the final posterior estimations.
In hybrid HMM/MLP systems [3], these posterior probabilities are used directly
as the scaled likelihoods of sound classes in HMM states instead of conventional state-
emission probabilities from GMM models (discussed in detail in Chapter 2). Alternatively,
these posteriors can be converted to features that replace conventional acoustic features,
in HMM/GMM based system via the Tandem technique [28] (also discussed in detail in
Chapter 2). Features from intermediate layers of neural networks have also been shown to
be useful for speech recognition [29,30].
Pinto et.al [31,32] use a Volterra series based analysis to understand the behavior of
the non-linear transforms that are learned by MLPs trained to estimate phoneme posterior
probabilities. The linear Volterra kernels used to analyze MLPs trained on Mel-filter bank
features reveal interesting spectro-temporal patterns learnt by the trained system for each
phoneme class. An extended study on a hierarchy of MLPs using the same framework,
shows that when a second MLP classifier is trained on posteriors estimated by an initial
MLP, it learns phonetic temporal patterns in the posterior features. These patterns include
phonetic confusions at the output of the first MLP as well as phonotactics of the language
learnt from the training data.
15
CHAPTER 1. INTRODUCTION
1.4.2 Back-end feature transforms
As shown in Figure 1.1, acoustic features after front-end level transforms are used
to train acoustic models. The distributions of basic speech sounds like phones are typically
represented by a Hidden Markov Model (HMM). Phone HMMs are constructed as finite state
machines with typically five states - a start state, three emitting states and an end state,
connected in a simple left-to-right topology. In each of the emitting states, multivariate
continuous density Gaussian mixture models are used to model the emission probability
distribution of feature vectors. To cover the large phonetic variability, separate HMMs
are trained for every basic speech unit, typically a phone, in context with a left and right
neighboring phone. Individual Gaussian parameters along with the mixing coefficients of the
Gaussian mixture models are estimated in a maximum likelihood framework [2]. However
since the number of trainable tri-phone parameters is huge, additional techniques like state-
tying with phonetic decision trees are used. In a second stage of training, the acoustic
models are then discriminatively trained using objective functions such as maximum mutual
information (MMI) [33, 34], minimum phone error (MPE) [35] or minimum classification
error (MCE) [35]. To improve the performance in each of these two passes of acoustic
models training, separate feature transforms which adapt features to each of the training
phases have been proposed. These set of transforms form the second major class of feature
transforms called back-end feature transforms.
In the past linear discriminant analysis has been investigated in several different
settings - to process feature vectors [18], as a transform to improve the discrimination be-
tween HMM states [36] and also a feature rotation and reduction technique in a maximum
16
CHAPTER 1. INTRODUCTION
likelihood setting [37]. Kumar and Andreou generalized LDA with Heteroscedastic linear
discriminant analysis (HLDA) [38] by relaxing the assumption of sharing the same covari-
ance matrix among all output classes. Also developed in a maximum likelihood setting, the
Maximum Likelihood Linear Transform (MLLT) [39] has been shown to be a special case
of HLDA when there is is no dimensionality reduction.
Feature space transforms like fMMI [40] and fMPE [41] on the other hand, are
linear transforms also applied on feature vectors but in a discriminative framework to opti-
mize the MMI/MPE objective functions. Similar to the early work in [42], region dependent
linear transforms (RDLT) [43] are an extension to the fMMI/MPE by first partitions the
feature space into different regions using a GMM. Each feature vector is then transformed
by a linear transform corresponding to the region that the vector belong via posterior prob-
abilities from the pre-trained GMM.
State-of-the-art-system use a combination of both the front-end and back-end
transforms. Studies like [44] have shown that although these transforms are separately
applied at the feature and model level, they can be combined to significantly improve ASR
performances.
1.5 Focus of the thesis
The feature extraction module plays a very crucial “gate-keeper” role in any pat-
tern recognition task. If useful information for classification is discarded by a poorly de-
signed feature extractor from the signal, it cannot be recovered again and the classification
task suffers. On the other hand if the feature extraction module allows irrelevant and re-
17
CHAPTER 1. INTRODUCTION
dundant detail to remain in the features, the classification module has to be additionally
developed to cope up with this. In speech recognition, a similar setting exists - a feature
extraction front-end first produces features for a pattern recognition back-end to recognize
words. To improve the performances in this setting, this thesis focuses on developing better
features for ASR through an efficiently designed front-end.
The review presented above describes one avenue of improvement for current
speech recognition feature front-ends - the development of better data-driven features. Fig-
ure 1.5 reiterates this again. The primary goal of speech recognition is to extract the mes-
sage, the human communicator produced using an inventory of basic speech units. However
the message is embedded with several constituent components of the speech signal as it
passes through a communication channel influenced by both the human speaker, transmis-
sion mechanism and the environment before it is captured by a machine using a microphone.
It is the goal of the feature extractor module to remove these irrelevant variabilities while
extracting useful features for the speech recognition back-end to recover the message.
Current speech recognition front-ends largely rely on information in the short-term spectrum
of speech. This representation is however very fragile and easily corruptible by channel
artifacts. It is hence necessary to extend the scope of information extraction to other
sources of knowledge. The best source of information is the data itself. This thesis hence
puts focus on data-driven techniques to improve features for ASR.
In earlier sections, several techniques that allow data integration into feature ex-
traction were reviewed. Neural networks provide very interesting mechanisms of integrating
information not only because they are discriminatively trained and use non-linear basis func-
tions to transform the data but also because they have been shown to have several other
18
CHAPTER 1. INTRODUCTION
Figure 1.5: Thesis contributions to developing better data-drive neural network features forASR pipeline.
key advantages. For example they can accommodate large feature dimensions and do not
place strong assumptions on the distributions of these features. A very significant advan-
tage is that they can also directly produce posterior probabilities of speech classes, making
the posteriogram representation of speech - the evolution of the posteriors of speech classes
like phonemes over time, a useful source of information for speech recognition (see figure
1.5). As can be seen, this representation is void of speaker and channel variabilities and is
linked more closely to the underlying speech message encoded using basic speech units like
phonemes.
The performance of these data-driven feature extractors is however linked to sev-
19
CHAPTER 1. INTRODUCTION
eral factors. The MLP estimates posterior probabilities of phoneme classes ci conditioned
on the input acoustic features x and the model parameters w as p(ci|x,w). The factors
that hence determine the goodness of the posteriogram representation are -
(a) The input acoustic features: Robust acoustic features which capture information from
the rich spectro-temporal modulations of speech need to be designed.
(b) The amount of training data: Significant amounts of task dependent data needs to be
used to train the parameters of neural network models
(c) Network architectures: Suitable network architectures have to be used to learn the
data-driven transforms.
1.6 Outline of Contributions
The thesis contributes to improvements in each of the above mentioned factors in
Through a number of studies it has been shown that speech perception is sensitive
to relatively slow modulations of the temporal envelope of speech. [59, 60]. Most of the
energy in the modulation spectrum peaks around 4 Hz which is also corresponds to syllabic
rate of speech. In the presence of noise although these components are affected [61, 62],
modifying modulation components in the 1-16 Hz range results in significant degradation
of speech intelligibility [59,60].
Information from the modulation spectrum can be derived from a spectral analysis
of temporal trajectories of spectral envelopes of speech [63]. However, in order to achieve
sufficient spectral resolution at the low modulation frequencies frequencies described above,
relatively long segments of speech signal need to be analyzed. For example to capture
modulation spectrum components around 4 Hz, an analysis window of at least 250 ms is
30
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
necessary. Analysis windows of this length are also consistent with the time intervals of co-
articulation - a speech production phenomena, forward masking - an auditory perception
phenomena and the linguistic concept of syllable [64]. By deriving features for ASR using
these kinds of analysis windows, information about the dynamics of spectral components
are explicitly being captured.
In [65, 66], 1 second long temporal trajectories of individual critical sub-band en-
ergies were used for phoneme recognition experiments. In this multi-stream framework,
separate neural network classifiers were trained on long-term features from each sub-band
before being combined together by a second level neural network. Since features from
each sub-band were used independently, the comparable performance of this feature ex-
traction technique with conventional short-term spectral features, demonstrates that there
is significant information in the local temporal dynamics being captured. These temporal
pattern features (TRAPS) have been extended in different configurations (for example [67])
as modulation features after applying a cosine transform [68] or filtering using modulation
filters [47].
2.2.2 Parametric models of temporal envelopes
The modulation features discussed above are extracted from sub-band energies
of speech using long analysis windows. The sub-band energies are not directly modeled
but are instead produced with an inherent limited resolution as outputs of Bark/Mel scale
integrators on the power spectrum in short-analysis windows every 10 ms (see Section
2.1.1). For more effective features that capture the evolution of the temporal envelopes, it
is necessary to directly model the temporal envelopes.
31
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
As described in Section 2.1.1, conventional feature extraction techniques use LPC
in time to effectively capture spectral resonances. Based on duality properties, LPC can
similarly be performed in the frequency domain to directly model and capture important
temporal events. This framework is based on the notion that speech can be considered to
be composed of several amplitude modulated signals at different carrier frequencies. The
AM component of each of these signals is the squared magnitude of their corresponding
analytical signals. The squared magnitude of the analytical signal is also called the Hilbert
envelope and is a description of temporal energy. Instead of computing the analytic signal
directly, an auto-regressive modeling approach can be used. This modeling approach also
called Frequency Domain Linear Prediction (FDLP) is the dual of conventional time domain
linear prediction used to model the power spectrum of speech [69,70]. Instead of modeling
the power spectrum, FDLP models the evaluation of signal energy in the time domain
by the application of linear prediction in the frequency domain using the discrete cosine
transform of the signal. This parametric model can be used as an alternate technique to
directly model sub-band envelopes of speech [71,72].
2.2.3 Neural network features
The modulation features described in the earlier sections are typically high dimen-
sional correlated features. Both these limitations prevent them from being used directly
with ASR systems. These features have hence been used in conjunction with neural net-
works which have much more relaxed assumptions on feature distributions. As described
in the previous chapter neural networks can be trained to estimate posterior probabilities
of speech classes. These probabilities can then be used directly as scaled likelihoods in the
32
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
hybrid HMM-ANN ASR framework.
Another approach to using neural network posterior outputs, is to convert the
posteriors to features similar to traditional acoustic features for ASR systems. In the Tan-
dem processing approach [28], posterior features from neural networks are post-processed
to be decorrelated and approximately have a normal distribution. This is done in a two
step procedure - a log transform is first applied to the posteriors to Gaussianize the vectors
followed by a dimensionality operation using the KL transform. Several other approaches
have been proposed to derive features from the outputs of neural networks. In the HATS
technique [73] non-linear outputs from the penultimate layer of a network have been used.
This has been further extended to deriving features from an intermediate bottleneck layer
which reduces the feature dimension as well [30].
2.2.4 Combination of information from multiple streams
A key benefit from the development of long-term features are the significant
LVCSR gains obtained from combining these features with conventional short-term fea-
tures [73]. The best combination of features is obtained by first training neural networks
using both the long-term modulation features and short-term spectral energy based features
separately. The outputs of the neural networks are then combined using a merger neural
network or using different combination rules before being used as data-driven features for
LVCSR tasks [74]. As discussed in [75] this approach is useful because of several reasons -
• The MLP features derived from neural networks trained on conventional short-term
spectral features and long-term modulation features capture complimentary informa-
33
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
tion about phone classes.
• Although the MLPs are trained on different inputs, since they have the same target
classes the complimentary outputs can be effectively combined.
• During the training phase the neural networks are able to discriminatively learn
class boundaries and produce data-driven features that are useful for classification
of sounds. These features are also relatively speaker invariant.
• After the application of post-processing techniques like Tandem, the data-driven neu-
ral network features can easily be modeled by HMM-GMM based LVCSR systems.
2.3 Novel Short-term and Long-term Features for Speech
Recognition
We propose a novel feature extraction scheme along the lines of the techniques
described above, to derive two kinds of features - short-term spectral features and long-
term modulation features for ASR. The technique starts by creating a two-dimensional
auditory spectrogram representation of the input signal. This is formed by stacking sub-
band temporal envelopes in frequency instead of stacking short-term spectral estimates in
time.
The sub-band temporal envelopes are obtained by analyzing speech using Fre-
quency Domain Linear Prediction (FDLP). The FDLP technique, as described earlier, fits
an all pole model to the Hilbert envelope of the signal (See Figure 2.1). These representa-
tions of the speech signal are able to capture fine temporal events associated with transient
34
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
0 0.6 1.3-5000
0
5000(a)
0 0.65 1.30
5000(b)
0 0.65 1.30
5000(c)
Time (s)
Figure 2.1: Illustration of the all-pole modeling property of FDLP. (a) a portion of thespeech signal, (b) its Hilbert envelope (c) all pole model obtained using FDLP.
events like stop bursts while at the same time summarize the signals gross temporal evo-
lution [76]. Short-term features are derived by integrating the auditory spectrogram in
short analysis windows. Long-term modulation frequency components are obtained after
the application of the cosine transform on compressed (static and adaptive compression)
sub-band temporal envelopes.
35
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
2.3.1 FDLP based time-frequency representation
The FDLP time-frequency representation is created through the following steps [72] -
(a) Change of processing domain - The FDLP spectrogram is a 2 dimensional time-frequency
representation of speech constructed by stacking sub-band temporal envelopes of a
speech signal across frequencies. Each of these temporal envelopes corresponds to a
sub-band frequency signal. To facilitate this, the speech signal is first projected into
the frequency domain via the DCT transform.
(b) Analysis of speech into sub-band frequency signals - Sub-band frequency signal are
obtained by windowing the DCT transform using a set of overlapping Gaussian windows
usually placed on a Bark or Mel scale.
(c) Computation of auto-correlation coefficients via a series of dual operations of time do-
main linear prediction (TDLP) - Among the many approaches, one way of applying
TDLP is using the auto-correlation of the time signal. The auto-correlation coeffi-
cients are in turn derived from the power spectrum since the power spectrum and
auto-correlation of the time signal form Fourier transform pairs. In the FDLP case, the
Hilbert envelope and the auto-correlation of the DCT signal form Fourier transform
pairs.
Since the sub-band DCT signals have already been derived in the previous step, their
the auto-correlation coefficients can be computed. We start by computing the squared
magnitude of the inverse discrete Fourier (IDFT) transform of the DCT signal. The
application of a second Fourier transform produces the desired auto-correlation coeffi-
36
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
cients.
(d) Application of linear prediction - By solving a system of linear equations, the auto-
regressive model of each sub-band Hilbert envelope is finally derived from the auto-
correlation coefficients. Using the set of prediction coefficients {ai} the estimated
Hilbert envelope in each sub-band HEs can be represented as
HEs(n) =G
|∑i=pi=0 aie−i2πkn|2 (2.4)
The parameter G is called the gain of the model. In [77], by normalizing the gain G, the
estimated sub-band envelopes have been shown to become robust to convolutive distor-
tions like reverberations and telephone channel artifacts. Additional robustness to additive
distortions by short-term subtraction of an estimate of noise have also been shown in [78].
There are several parameters that control the temporal resolution of the estimated envelopes
as well as the type and extent of analysis windows for different applications. These have
been elaborated in [72].
Figure 2.2 shows the PLP and FDLP spectrograms for a portion of speech. Criti-
cally spaced sub-bands energies of speech are derived in short-analysis windows in the PLP
case. The representation is hence smooth across frequencies in each analysis windows. In-
dividual sub-bands of speech are directly modeled in FDLP technique, resulting in a better
temporal resolution - for example the transient regions are well captured in this representa-
tion. Two kinds of features are derived from two-dimensional time-frequency representation
of speech formed by sub-band temporal envelopes derived using FDLP.
37
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
0 0.24 0.45-5000
0
5000(a)
(b)
Bar
k ba
nd
0 0.24 0.45
20
10
0
(c)
Bar
k ba
nd
Time (s)0 0.24 0.45
20
10
0
Figure 2.2: PLP (b) and FDLP (c) spectrograms for a portion of speech (a).
2.3.2 Short-term Features
In conventional feature extraction techniques like PLP, the power spectrum is
first integrated using Mel/Bark integrators in short analysis windows to create sub-band
trajectories of spectral energy. In the FDLP time-frequency representation, instead of the
sub-band trajectories of spectral energy, identical distributions of energy in the time domain
(sub-band Hilbert envelopes), are estimated. Short-term cepstral features can be derived
from these representations.
38
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
This is done by first integrating the envelopes in short term analysis Hamming
windows (of the order of 25 ms with a shift of 10 ms). The integrated sub-band energies
are then converted to cepstral coefficients by applying the log transform and taking the
DCT transform across the spectral bands in each of the frames. For most applications we
use 13 cepstral coefficients. First and second derivatives of these cepstral coefficients are
also appended to form a 39 dimensional feature vector [79,80], similar to conventional PLP
features. In [72, 81], a set of FDLP modeling parameters that improve the performance
of these short-term features for ASR in noisy environments has been identified. These
parameters and their effects are summarized in Table 2.1. In all these experiments, both
clean and noisy reverberant test data is evaluated on models trained with clean speech.
FDLP Model Parameter Observations
Gain Normalization Gain normalization significantly improves
feature robustness in reverberant environments [77,82].
Using rectangular analysis windows on a Mel
scale for sub-band decomposition also contributes
to robustness by reducing the mismatch between
the clean and noisy reverberant data.
Number of Sub-bands Increasing the spectral resolution improves
robustness in reverberant conditions. The assumptions
made for gain normalization are more valid with
increased number of sub-bands. In reverberant
conditions using up to 96 linear bands has shown to
39
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
FDLP Model Parameter Observations
be useful [77].
Model Order Model order relates to the model’s ability
to capture sufficient detail of the envelopes.
In clean conditions, a higher model order
is useful. A lower model order is however better
in reverberant conditions [72,81].
Envelope Expansion Envelope expansion relates to how the all-pole model
models peaks and valleys of the Hilbert
envelope. While envelope expansion is useful
in noisy environments to capture dominant reliable
peaks, no significant gains are observed
in clean conditions [72,81].
Table 2.1: FDLP model parameters that improve robustness
of short-term spectral features.
2.3.3 Long-term Features
In techniques like TRAPS and MRASTA described earlier, modulation frequency
features are derived by analyzing temporal trajectories of spectral energy estimates in indi-
vidual sub-bands using long analysis windows. As described earlier, since FDLP estimates
the temporal envelope in sub-bands, modulation features can be derived from these en-
40
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
velopes as well [79].
Before we derive the long-term features, we compress the sub-band temporal en-
velopes both statically and dynamically. The envelopes are compressed statically using the
logarithmic function. Dynamic compression of the envelopes is achieved using an adapta-
tion circuit which consists of five consecutive nonlinear adaptation loops proposed in [83].
These loops are designed so that sudden transitions in the sub-band envelope that are fast
compared to the time constants of the adaptation loops are amplified linearly at the out-
put, while the steady state regions of the input signal are compressed logarithmically. The
compressed temporal envelopes are then transformed using the Discrete Cosine Transform
(DCT) in long term windows (200 ms long, with a shift of 10 ms). We use 14 modulation
frequency components from each cosine transform, yielding modulation spectrum in the 0-
35 Hz range with a resolution of 2.5 Hz [84]. The static and dynamic modulation frequency
components of each critical band are then stacked together before being used as features.
In [85], the proposed modulation features have been compared with other similar
Fepstum [87]. In these experiments FDLP based modulations are significantly better than
features derived from the other approaches. An additional set of FDLP modeling parameters
that improve the performance of these long-term features for ASR have also been identified
based on a set of phoneme recognition experiments. These parameters and their effects are
summarized in Table 2.2.
41
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
FDLP Model Parameter Observations
Modulation analysis window The analysis window used to derive the
the modulation coefficients can be varied.
The best recognition performance was obtained
using a window of 200 ms, which also corresponds
to the syllabic rate of speech.
Extent of modulations The number of DCT coefficients can be varied to
change the extent of the modulation spectrum.
The best range was found to be using 14 DCT
coefficients covering the 0-35 Hz range.
Type of modulation spectrum As described earlier, two kinds of compression
schemes are used for the modulation features.
While the static log modulation features improve the
the phoneme recognition performances on fricatives
and nasals, the dynamic adaptive loops based features
help in better recognition of plosives and affricatives [85].
A combination of both these features provides significant
improvements to all classes [88].
Table 2.2: FDLP model parameters that improve perfor-
mance of long-term modulation features.
42
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
�����������
�����������
������������������������
������������������������
��������������������������������
features
Speech FDLPAdaptive
compression
Staticcompression
statically compressed sub−bands envelopes
adaptively compressed sub−bands envelopes
freq
uenc
y
sub−bands envelopestime
auditory
featuresmodulation
Posteriorprobabilityestimator
probabilityestimator
Posterior
probabilitymerger
Posterior TandemProcessing for ASR
Features
Figure 2.3: Schematic of the joint spectral envelope, modulation features for posterior basedASR
2.3.4 Data-driven Features
Each of these acoustic features are converted into data-driven features by using
them to first train two separate 3-layer multilayer perceptrons to estimate posterior prob-
abilities of phoneme classes. Each frame of the short-term spectral envelope features is
used with a context of 9 frames during training. As described earlier, static and dynamic
modulation frequency features of each critical band are stacked together and used to train
a separate MLP network. The spectral envelope and modulation frequency features are
then combined at the phoneme posterior level using the Dempster Shafer (DS) theory of
evidence [74]. These phoneme posteriors are first Gaussianized by using the log function
and then decorrelated using the Karhunen-Loeve Transform (KLT) [28]. This reduces the
dimensionality of the feature vectors by retaining only the feature components which con-
tribute most to the variance of the data. We use 25 dimensional features in our Tandem
43
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
representations similar to [75]. Figure 2.3 shows the schematic of the proposed feature
extraction technique.
2.4 Speech Recognition Experiments and Results
We perform a set of experiments using Tandem representations of the proposed
spectral envelope and modulation frequency features along with other state-of-the-art fea-
tures for ASR. These include a phoneme recognition task, a small vocabulary continuous
digit recognition task and a large vocabulary continuous speech recognition (LVCSR) task.
For each of these experiments, we train three layered MLPs to estimate phoneme posterior
probabilities using these features. The proposed features are compared with three other
feature extraction techniques - PLP features [10] with a 9 frame context which are similar
to spectral envelope features derived using FDLP (FDLP-S), M-RASTA features [47] and
Modulation Spectro-Gram (MSG) features [86] with a 9 frame context, which are both
similar to modulation frequency features (FDLP-M). We combine FDLP-S features with
FDLP-M features using the DS theory of evidence to obtain a joint spectro-temporal fea-
ture set (FDLP-S+FDLP-M). Similarly, we derive two more feature sets by combining PLP
features with M-RASTA features (PLP+M-RASTA) and MSG features (PLP+MSG). 25
dimensional Tandem representations of these features are used for our experiments. We also
experiment with 39 dimensional PLP features without any Tandem processing (PLP-D).
44
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
2.4.1 Phoneme Recognition
Our first experiment is to validate the usefulness of Tandem representation of our
features for a phoneme recognition task using HMMs. We perform experiments on the
TIMIT database, excluding ‘sa’ dialect sentences. All speech files are sampled at 16 kHz.
The training data consists of 3000 utterances from 375 speakers, cross validation data set
consists of 696 utterances from 87 speakers and the test data set consists of 1344 utterances
from 168 speakers. The TIMIT database, which is hand-labeled using 61 labels is mapped to
the standard set of 39 phonemes [89]. A three layered MLP is used to estimate the phoneme
posterior probabilities. The network consisting of 1000 hidden neurons, and 39 output
neurons (with soft max nonlinearity) representing the phoneme classes is trained using the
standard back propagation algorithm with cross entropy error criteria. The learning rate
and stopping criterion are controlled by the error in the frame-based phoneme classification
on the cross validation data.
The Tandem representation of each feature set is used along with a decision tree
clustered triphone HMM with 3 states per triphone, trained using standard HTK maximum
likelihood training procedures. The emission probability density in each HMM state is mod-
eled with 11 diagonal covariance Gaussians. We use a simple word-loop grammar model
using the same standard set of 39 phonemes. Table 2.3 shows the results for phoneme recog-
nition accuracies across all individual phoneme classes for these techniques. The proposed
features (FDLP-S+FDLP-M) significantly improve the recognition accuracy compared to
the baseline PLP-D feature set.
45
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.3: Phoneme Recognition Accuracies (%) for different feature extraction techniqueson the TIMIT database
Features Phoneme Rec. Acc. (%)
PLP-D 68.3
PLP 70.1
FDLP-S 70.1
M-RASTA 66.8
MSG 65.1
FDLP-M 70.6
PLP+M-RASTA 71.2
PLP+MSG 71.4
FDLP-S+FDLP-M 72.5
2.4.2 Small Vocabulary Digit Recognition
In our second experiment, we use these features on a small vocabulary continuous
digit recognition (OGI Digits database) to recognize eleven (0-9 and zero) digits with 28
pronunciation variants [47]. MLPs are trained using these features to estimate posterior
probabilities of 29 English phonemes using the whole Stories database plus the training
part of Numbers95 database with approximately 10% of data for cross-validation. Tandem
representation of the features are used along with a phoneme-based HMM system with
22 context-independent three-state phoneme HMMs, each model distribution represented
by 32 Gaussian mixture components [47]. Table 2.4 shows the results for word recognition
accuracies. For this task, the proposed spectral envelope features (FDLP-S) and modulation
46
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.4: Word Recognition Accuracies (%) on the OGI Digits database for different featureextraction techniques
Features Word Recog. Acc. (%)
PLP-D 95.9
PLP 96.2
FDLP-S 96.6
M-RASTA 96.3
MSG 96.0
FDLP-M 96.8
PLP+M-RASTA 97.1
PLP+MSG 97.0
FDLP-S+FDLP-M 97.1
frequency features (FDLP-M) improve word recognition accuracies compared to PLP and
MRASTA features respectively.
2.4.3 Large Vocabulary Continuous Speech Recognition
In our third experiment, we use these features on an LVCSR task using the AMI
LVCSR system for meeting transcription [90]. The training data for this system uses indi-
vidual headset microphone (IHM) data from four meeting corpora; NIST (13 hours), ISL
(10 hours), ICSI (73 hours) and a preliminary part of the AMI corpus (16 hours). MLPs
are trained on the whole training set in order to obtain estimates of phoneme posteriors for
each of the feature sets. Acoustic models are phonetically state tied triphone models trained
using standard HTK maximum likelihood training procedures. The recognition experiments
47
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.5: Word Recognition Accuracies (%) on RT05 Meeting data, for different featureextraction techniques. TOT - total word recognition accuracy (%) for all test sets, AMI,CMU, ICSI, NIST, VT - word recognition accuracies (%) on individual test sets
Features TOT AMI CMU ICSI NIST VT
PLP-D 58.1 57.6 60.6 68.7 49.1 53.6
PLP 53.6 59.1 56.3 70.0 45.3 34.9
FDLP-S 57.5 58.4 58.5 66.9 48.4 54.5
M-RASTA 54.6 53.3 58.4 63.2 46.6 51.0
MSG 55.6 56.1 59.3 65.5 47.9 47.7
FDLP-M 60.5 62.3 66.3 60.6 54.6 58.3
PLP+M-RASTA 59.5 59.5 62.2 71.5 51.1 52.1
PLP+MSG 60.4 61.2 60.7 72.7 53.4 52.4
FDLP-S+FDLP-M 64.1 63.8 65.8 72.2 57.1 61.0
are conducted on the NIST RT05 [91] evaluation data. The AMI-Juicer large vocabulary
decoder is used for recognition with a pruned trigram language model [92]. This is used
along with reference speech segments provided by NIST for decoding and the pronuncia-
tion dictionary used in AMI NIST RT05s system. Table 2.5 shows the results for word
recognition accuracies for these techniques on the RT05 meeting corpus. The proposed fea-
tures (FDLP-S+FDLP-M) obtain a significant relative improvements for the LVCSR task
compared to other feature representations.
48
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.6: Recognition Accuracies (%) of broad phonetic classes obtained from confusionmatrix analysis
Class PLP FDLP-S M-RASTA FDLP-M PLP + FDLP-S +
M-RASTA FDLP-M
Vowel 85.3 84.9 82.4 85.7 86.1 87.3
Diphthong 78.2 79.1 74.2 76.8 78.4 79.8
Plosive 83.8 82.8 81.6 84.1 84.6 85.4
Affricative 73.5 74.4 68.6 75.6 72.9 78.0
Fricative 85.8 85.9 83.5 86.8 86.4 88.0
Semi Vowel 76.2 74.9 72.9 77.1 77.8 79.0
Nasal 84.2 82.8 80.4 84.9 85.8 86.6
Avg. 81.0 80.7 77.7 81.6 81.7 83.4
2.5 Conclusions
In this chapter, we proposed a framework for deriving data-driven features for
ASR. The framework uses 4 key elements -
• A linear prediction technique that models sub-band temporal envelopes of speech -
We outlined the steps involved in building these auto-regressive models. We also
showed that this technique based on FDLP can capture important details in speech
that conventional techniques do not capture.
• Two kinds of acoustic features - a short-term spectral feature and a long-term modula-
tion feature. Table 2.6 shows the results for phoneme recognition accuracies across all
individual phoneme classes for the proposed techniques using the TIMIT database.
49
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
The FDLP-S features provide comparable results as the PLP features. The mod-
ulation features (FDLP-M) result in broad class recognition rate for all the broad
phonetic classes to other modulation features.
• A combination of the feature streams at the phoneme posterior level - From Table
2.6, the joint spectral envelope and modulation features yield improved broad class
recognition in all cases compared to the baseline systems.
• Data-driven processing of these features with neural networks followed by Tandem
post-processing allows these features to be used for ASR systems. In all our experi-
ments, Tandem representations of the proposed features improve ASR accuracies over
other features.
In the following chapters we will use this data-driven framework in many other
scenarios. The key scenario is a low-resource setting where the amount of training data
is limited, unlike the ASR settings assumed in this chapter where the amount of training
data is not restricted. We devise techniques to improve the effectiveness of the proposed
front-ends in those settings.
50
Chapter 3
Data-driven Features for
Low-resource Scenarios
This chapter presents two novel techniques for building data-driven front-ends in
low-resource settings with very limited amounts of transcribed data for acoustic model train-
ing. Both the techniques improve performance in the low-resource settings using data from
multiple languages circumventing issues with different phone sets used in each language.
3.1 Overview
In LVCSR systems, an important factor that impacts performance is the amount
of available transcribed training data. When LVCSR systems are built for new languages
or domains with only few hours of transcribed data, the performance is lower. To improve
performance, unlabeled data from this new language or domain has been used to increase
51
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
the size of the training set [93]. This is done by first recognizing the unlabeled data and
incrementally adding reliable portions to the original training set. For these self-training
techniques to be effective, a low error rate recognizer is required to annotate the unlabeled
data. However in several scenarios like ASR systems for new languages, recognizers built
using limited amounts of training data have very high error rates. Additional improvements
are hence not easily achieved via these techniques.
Another potential solution to this problem is to use transcribed data available from
other languages to build acoustic models which can be shared with the low-resource language
[94,95]. However training such systems requires all the multilingual data to be transcribed
using a common phone set across the different languages. This common phone set can
be derived either in a data driven fashion or using phonetic sets such as the International
Phonetic Alphabet (IPA) [96]. More recently cross-lingual training with Subspace Gaussian
Mixture Models (SGMM) [97,98] have also been proposed for this task.
An alternative approach to this problem moves the focus from using the shared
data to build acoustic models, to training data-driven front-ends. The key element in
this data-driven approach, is a multi-layer perceptron (MLP) which is trained on large
amounts of task independent data. In [99, 100], a task independent approach has been
used to first train MLPs with large amounts of data. Features derived from these nets
are then shown to reduce the requirement of task specific data to train subsequent HMM
stages. In these experiments, although the task specific data comes from the same language
as the task independent data, the data sources are collected in different domains. More
recently this approach has been shown useful also in cross-domain and cross-lingual LVCSR
52
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
tasks [75,101]. In [101], Tandem features trained on English CTS data are shown to improve
performance when used in other domains (meeting data) within the language and even
in other languages (Mandarin and Arabic). Even though MLPs are trained on different
phone sets in different languages, Tandem features are able to capture common phonetic
distinctions among languages and improve performance of conventional acoustic features.
In this chapter, we investigate two approaches to building neural network based
data-driven front-ends in low-resource settings. We assume the availability of only 1 hour
of transcribed task specific data to train the acoustic models. To improve over the poor
performance of acoustic models using conventional features in these settings, we use data-
driven feature front-ends that integrate the following additional sources of information -
(a) Multilingual task independent data - Transcribed data from other languages other than
the target language are first used to train initial neural networks models. These task-
independent models are then adapted using limited amounts of task-specific data.
(b) Multiple feature representations - Significant gains were demonstrated in the previous
chapter using different feature representations. We show how these features can be
effective also in low-resource settings.
One of the key problems in training neural network systems using data from mul-
tiple domains are differences in how data sources are transcribed. Although there are
phoneme sets like the IPA which can be used to uniformly label data across languages, only
few data sources are labeled using such sets. This chapter proposes techniques that can be
used to train neural networks in such scenarios.
In low-resource settings, the performance of other modules of the ASR pipeline -
53
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
for example the language model or pronunciation dictionary are also affected. We however
focus our attention only on the feature extraction module and acoustic models.
3.2 Training Using a Combined Phone set
In this section we describe a training approach using two data sets - H and L. H
is a task independent data set with significantly more amounts of training data than the
low-resource data set - L. Both H and L are transcribed using different phoneme sets H
and L. We train a neural network system using the following steps -
(a) Train an initial network using data set H - We start by training a multilayer perceptron
(MLP) on the high resource task independent data set. After it has been trained, this
network estimates posterior probabilities of speech sounds in H, conditioned on the
input feature vectors.
(b) Find a mapping between phoneme sets H and L - If the two phoneme sets share the
same phonetic transcription scheme for example the IPA, it is relatively easy to find
such a mapping. However, this is often not the case.
In the proposed training scheme we investigate the use of a data-driven technique based
on an analysis of confusion matrices to find such a mapping. Confusion matrices have
been used in the past to measure the reliability of human speech recognition [102]. More
recently they have also been used to study the performance of ASR systems [103,104].
We start by forward passing the low-resource task specific data L through the MLP
trained on task independent data in step (a) to obtain phoneme posteriors. To un-
54
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
derstand the relationship between phonemes, we treat the phoneme recognition system
as a discrete, memory-less, noisy communication channel with the phonemes in L as
source symbols to the system. Using the recognized phonemes belonging to H at the
output of the recognizer as received symbols, confusion matrices that characterize the
data sets are then built.
Each time a feature vector corresponding to phoneme li is passed through the trained
MLP, posterior probabilities corresponding to all phonemes in set H are obtained at
the output of the MLP. We treat each of these posterior probabilities as soft-counts
to populate a phoneme confusion matrix. From a fully populated CM c, the following
counts can be derived. Entry (i, j) of the confusion matrix corresponds to the soft count
aggregate c(i, j) of the total number of times task-specific phoneme li was recognized
as task-independent phoneme hj . Marginal count c(i) of each row is the total number
of times phoneme li occurred in the task-specific data. Similarly count c(j) of each
column is the total number of times phoneme hj of the task-independent data set was
recognized. C is the total number of counts in the confusion matrix.
Given such a CM, we would like to find the best map for every phoneme li among the
phones of H based on these counts. A useful information theoretic quantity that can
be used is the empirical point wise mutual information [105]. In [104], the use of this
quantity in conjunction with confusion matrices has been shown. For an input alphabet
A and output alphabet B, using the count based confusion matrix, the empirical point
wise mutual information between two symbols ai from A and bj from B is expressed as
IAB(ai, bj) = logNij .N
Ni.Nj, (3.1)
55
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
where Nij is the number of times the joint event (A = ai, B = bj) occurs and Ni, Nj
are marginal counts∑
j Nij and∑
iNij .
Using our soft count based confusion matrix between two phone sets H and L, we
similarly define the empirical point wise mutual information between phoneme pairs
(li, hj) as
I(li, hj) = logc(i, j).Cc(i).c(j)
, (3.2)
using quantities defined earlier. For a given task specific phoneme li we compare
I(li, hj), ∀hj ∈ H. In this comparison since total count C and the monotonically
increasing log function are common, simplified count based measure -
J(li, hj) =c(i, j)c(i).c(j)
(3.3)
is instead used.
Using this measure, for each label li, the more frequently a particular label hj occurs,
higher the value of J(li, hj). We hence map each phoneme li in the task specific phoneme
set to a phoneme hj in the task independent set which has the highest J(li, hj).
If assumptions that there exists a one-to-one mapping between the phoneme sets and
the cardinality of H is greater than L, can be made, multiple phoneme assignments
can be avoided. This can be done by removing an assigned phoneme from the list of
available phonemes once it has been mapped.
(c) Re-transcribe L using a new mapped phone set H - Using the mapping derived using
confusion matrices from above, the task specific data L can now be re-transcribed into
the phone set used to train the initial network.
56
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
(d) Adapt the network using data set L - The initial task independent neural network can
now be adapted using the task specific data since it has been mapped to the same phone
set. The neural network is adapted by retraining it using the new data after initializing
it with its original weights.
(e) Extract data-driven features - Posterior features are derived for ASR after Tandem
processing the phoneme posterior outputs of these networks.
3.3 Training Using Multiple Output Layers
In this section we propose a second training technique for training neural network
systems across different data sets without having to map all the data using a common
phoneme set. As before we describe the training approach using two data sets - H and
L. H is a task independent data set with significantly more amounts of training data the
low-resource data set - L. Both H and L are transcribed using different phoneme sets H
and L, with cardinalities h and l respectively. The network is trained using an acoustic
representation with dimension d in the following steps -
(a) Train the MLP on the task independent set H - We start by training a 4 layer MLP
of size d×m1×m2×h on the high resource language with randomly initialized weights.
While the input and output nodes are linear, the hidden nodes are non-linear. While
the dimension of m1 is high, m2 is low dimensional and is known as the ‘bottleneck’
layer. We are motivated to introduce the bottleneck layer to allow the network to learn
a common low dimensional representation among the languages.
57
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Figure 3.3: Tandem and bottleneck features for low-resource LVCSR systems.
Table 3.4 shows the results of using the proposed MLP based features. We train
the 1 hour HMM-GMM system on 39 dimensional PLP features (13 cepstral + Δ + ΔΔ
features) as our baseline system.
Training with 3 languages
We extend our training on 2 languages to train a multilingual MLP system on 3
languages - Spanish, German and English. The training procedure starts as outlined earlier
with 15 hours of Spanish. The networks are then initialized to train with the German data
in two stages - with weights from the Spanish system till the bottleneck layer and with
weights from single layer network trained to the German data. After the net has been
trained on the German data, we do a re-training using the 1 hour of English data. Figure
3.3 is a schematic of the training and feature extraction procedure. Table 3.5 shows the
results of using the proposed MLP based features.
68
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Table 3.5: Word Recognition Accuracies (%) using three languages - Spanish, German andEnglish
Tandem features 35.8
Bottleneck features 37.2
The above results show the advantage of the proposed approach to training MLPs
on multilingual data. Unlike in earlier approaches we are able to train on multiple languages
without using a common phoneset among the languages.
3.5 Conclusions
In this chapter we have demonstrated the usefulness of data-driven feature front-
ends over conventional features in low-resource settings. In these settings, data-driven fea-
tures are built using task independent data. However in most cases, this data is transcribed
using different phoneme sets. We have addressed this issue using two methods. Features
extracted using these techniques are used to train LVCSR systems in the low-resource lan-
guage. In our experiments, the proposed features provide a relative improvement of about
30% in an low-resource LVCSR setting with only one hour of training data. In the next
chapter we investigate more complex front-end for these scenarios.
69
Chapter 4
Wide and Deep MLP Architectures
in Low-resource Settings
Significant improvements in ASR performance have been observed when additional
processing layers have been added to neural network front-ends. To train these additional
parameters, large amounts of training data are also required. This chapter explores how
these additional layers can be incorporated in low-resource settings with only few hours of
task specific training data.
4.1 Overview
In the previous chapter, improvements were observed in low-resource settings by
using multiple feature representations of the acoustic signal. To allow these parallel streams
of information to be trained, task independent data from different languages were used in
70
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Feature FeatureStream 2
MLP
Post−processingFeature
Speech
Acoustic Feature Extraction
Data−driven Feaures
Acoustic Feature Extraction
Speech
FeatureStream
MLP MLP
FeaturePost−processing
Data−driven Feaures
MLPInteractions via
intermediateoutputs
(a) (b)
Stream 1
Figure 4.1: (a) Wide and (b) Deep neural network topologies for data-driven features
conjunction with simple neural network topologies. In this chapter, in addition to these
parallel feature streams, we explore if more complex neural network architectures which are
currently being used in state-of-the-art ASR systems can also be trained in low-resource
settings.
In [113], these complex neural network architectures have been broadly classified
into two categories - wide networks and deep networks. In wide networks, several parallel
neural network modules that interact with each other are used. On the other hand, in deep
networks topologies several interacting neural network layers are stacked one after the other
in a serial fashion. Figure 4.1, illustrates these topologies.
71
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Several wide network topologies have been used in processing long-term modula-
tion features for example the architectures used in the TRAPS [66] or HATS framework [73].
In a more recent approach [114], modulation features are first divided into two separate
streams as shown in Figure 4.1. The phoneme posterior outputs of a neural network trained
on high modulations (> 10Hz) are then combined with low modulation features to train
a second network. Tandem processed features from the second network are then used for
ASR.
Hierarchical networks where the outputs of one neural network processing stage are
further processed by a second neural network have been used in [100, 115]. More recently,
Deep Belief Networks with several layers (5-6 hidden layers) have been used in acoustic
modeling. In this approach individual layers of the deep network are usually pre-trained
before being assembled together and trained together [116–118].
In this chapter we discuss techniques to train both these classes of complex net-
works in low-resource settings. Faced with limited amounts of task specific data in these
scenarios we demonstrate the use of task independent data to build these networks.
4.2 Wide Network Topologies
4.2.1 Building the Data-driven Front-ends
We use two kinds of task independent data sources in building the proposed front-
end with wide network topologies -
(a) Up to 20 hours of data from the same language collected for a different task. Although
this data has a different genre, it has similar acoustic channel conditions as the low
72
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Data driven front−end
Speech (from thelow−resource
setting)
{1/2/5/10/15/20} hours
Same language,
− PLPFeaturesAcoustic
of dataon N hrs
MLP trainedterm detection
for LVCSR/SpokenPosterior features
different genre −
Figure 4.2: Data driven front-end built using data from the same language but from adifferent genre.
resource data.
(b) 200 hours of data from a different language but with similar acoustic channel conditions.
We build two kinds of front-ends on varying amounts of these task independent training
data.
1. A monolingual front-end trained on varying amounts of data from the same language as
the low-resource task. As shown in Figure 4.2, we train different configurations of this
front-end on 1 to 20 hrs of data (N hours). The primary advantage of this kind of a
front-end is that even though the genre is different, the MLP learns useful information
that characterizes the acoustics of the language. This improves as the amount of training
data increases. For our current experiments we also choose task independent data from
similar acoustic conditions as the low resource setting. Features generated using this
front-end are hence enhanced with knowledge about the language and have unwanted
variabilities from the channel and speaker removed. We use conventional short-term
acoustic features to train these nets.
2. A cross-lingual front-end that uses large amounts of data from a different language.
In most low-resource settings, it is less likely to have sufficient transcribed data in the
73
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
��������
��������
AcousticFeatures − PLP
− FDLPMFeaturesAcoustic
on M hrs of dataMLP trained
Different language −
MLP trainedon M hrs of data
PosteriorCombination
ProcessingTandem
with multilingual posteriorsAcoustic Features enhanced
Data driven front−end
MLP trained onN hrs of data
Same language, differentAcousticFeatures − PLP
hoursgenre − {1/2/5/10/15/20}
200 hours
for LVCSRfeatures
Posterior
Speech
Figure 4.3: A cross-lingual front-end built with data from the same language and with largeamounts of additional data from a different language but with same acoustic conditions.
same language to train a monolingual front-end. However considerable resources in other
languages might be available. Figure 4.3 outlines the components of the cross-lingual
front-end that we train to include additional data from a different language. This front-
end has two parts. The first part is similar to the monolingual front-end described above
and consists of an MLP trained on various amounts of data from same language but
different genre (N hours). The second part includes a set of MLPs trained on large
amounts of data from a different language (M hours). Outputs from these MLPs are
used to enhance the input acoustic features for the former part.
Although languages have common attributes between them, data from these languages
is transcribed using different phone sets and need to be combined before it can be used.
In the previous chapter, we use two different approaches to deal with this - a count based
74
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
data driven approach to find a common phone set and an MLP training scheme with
intermediate language specific layers. Both these approaches finally involve adaptation
of multilingual MLPs to the low-resource language. In this chapter, we do not adapt
any MLPs, instead we keep the front-end fixed by using the multilingual MLP to derive
posterior features.
When MLPs trained on a particular language are used to derive phoneme posteriors from
a different language, the language mismatch results in less sharp posteriors than from an
MLP trained on the same language. However an association can still be seen between
similar speech sounds from the different languages. We use this information to enhance
acoustic features of the task specific language. Phoneme posteriors from two compli-
mentary acoustic streams are combined to improve the quality of the posteriors before
they are converted to features using the Tandem technique. The multilingual posterior
features are finally appended to short-term acoustic features to train a second level of
MLPs on varying amounts of data from the same language as the low-resource task. This
procedure is hence similar to the approaches described earlier with modulation features
and the TRAPS/HATS configurations used to build wide neural network topologies (see
Figure 4.1).
4.2.2 Experiments and Evaluations
We train two data-driven front-ends for the low-resource LVCSR task as described
in Sec. 4.2.1. We train the monolingual front-end on a separate task independent training set
of 20 hours from the Switchboard corpus. Although this training set has similar telephone
75
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
channel conditions, as the low-resource task used for our experiments, it has a different
genre. The phone labels for this set are obtained by force aligning word transcripts to
previously trained HMM/GMM models using a set of 45 phones. 39 dimensional PLP
features (13 cepstral + Δ + ΔΔ features) are used along with a context of 9 frames. We
train separate MLPs on subsets of 1, 2, 5, 10, 15 and 20 hours to understand how the
amount of task independent data affects performance on these features.
In addition to the Switchboard corpus, we train Spanish MLPs on 200 hours of tele-
phone speech from the LDC Spanish Switchboard and Callhome corpora for the cross-lingual
front-end. Phone labels for this database are obtained by force aligning word transcripts
using BBN’s Byblos recognition system using 27 phones. We use two acoustic features
- short-term 39 dimensional PLP features with 9 frames of context and 476 dimensional
long-term modulation features (FDLPM). When networks are trained on multiple feature
representations, better posterior estimates can be derived by combining the outputs from
different systems using posterior probability combination rules. We use the Dempster-Shafer
rule of combination for our experiments. Posteriors from multiple streams are combined to
reduce the effects of language mismatch and improve posteriors. Phoneme posteriors are
then converted to features by Gaussianizing the posteriors using the log function and decor-
relating them by using the Karhunen-Loeve transform (KLT). A dimensionality reduction
is also performed by retaining only the top 20 feature components which contribute most
to the variance of the data.
The English MLPs in the cross-lingual setting are trained on enhanced acoustic
features. These features are created by appending posterior features derived from the
76
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Table 4.1: Word Recognition Accuracies (%) using different amounts of Callhome data totrain the LVCSR system with conventional acoustic features
1hr 2hr 5hr 10hr 15hr
PLP features 28.8 33.60 39.70 43.80 46.50
Spanish MLPs to the PLP features used in monolingual training. We similarly also train
separate MLPs on subsets of 1, 2, 5, 10, 15 and 20 hours of task independent data.
In our first experiment we use 39 dimensional PLP features directly for the 1
hour Callhome LVCSR task. The acoustic models have a low word accuracy of 28.8%.
These features are then replaced by 25 dimensional posterior features using the monolingual
and cross-lingual front-ends, each trained on varying amounts of task independent data
from the Switchboard corpus. Figure 4.4 shows how the performance changes for both the
monolingual and cross-lingual systems. Using the data-driven front-ends, the word accuracy
improves from 28.8% to 30.1% and 37.1% with just 1 hour of task independent training
data using the monolingual and cross-lingual front-ends respectively. These improvements
continue to 37.2% and 41.5% with the same 1 hour of Callhome LVCSR training data as
the amount of task-independent data is increased for both the front-ends. We draw the
following conclusions from these experiments -
1. With very few hours of task specific training data, posteriors features can provide
significant gains over conventional acoustic features. Table 4.1 shows the work accu-
racies when different amounts of Callhome data are used to train the LVCSR system.
By using the cross-lingual front-end, features from only 1 hour of data perform close
to 5-10 hours of the Callhome data with conventional features. This demonstrates
77
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
1 5 10 15 2026283032343638404244464850
Amount of Task−independent Training Data
Wor
d ac
cura
cy (%
)
1hr of Acoustic Features only1hr of Posterior Features using Monolingual Front−end1hr of Posterior Features using Crosslingual Front−end
Figure 4.4: LVCSR word recognition accuracies (%) with 1 hour of task specific trainingdata using the proposed front-ends
the usefulness of our approach where we use task independent data in low-resource
settings to generate better features.
2. When data from a different language is used, additional gains of 4-7% absolute are
achieved over just using task independent data from the same language. It is interest-
ing to observe that the performance with the cross-lingual front-end starts improving
from the best performance achieved with the monolingual front-end.
4.3 Deep Network Topologies
A deep neural networks (DNN) is multilayer MLPs with several more layers than
traditionally used networks. The layers of a DNN are often initialized using a pretraining
algorithm before the network is trained to completion using the error back-propagation
algorithm [119]. In this section we discuss the development of a DNN for low-resource
scenarios.
78
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
4.3.1 DNN Pretraining and Initialization
The purpose of the pretraining step is to initialize a DNN network with a better set
of weights than a randomly selected set. Networks trained from these kinds of initial weights
are observed to be well regularized and converge to a better local optimum than a randomly
initialized networks [120, 121]. As with traditional ANNs, deep neural networks have been
used both as acoustic models that directly model context-dependent states of HMMs [117]
and also to derive data-driven features [122, 123]. In both cases, the performances of these
networks are better than traditional shallow networks [117,118].
In the deep belief network (DBN) pretraining procedure [124], by treating layers
of the MLP as restricted Boltzmann machines (RBM), the parameters of the network are
trained in an unsupervised fashion with an approximate contrastive divergence algorithm
[124]. However various approximations in training algorithm, introduce modeling errors
which in turn decreases the effectiveness of this approach when the number of layers is
increased [119].
A different algorithm that has been shown to equally effective for pretraining
DNNs is called discriminative pretraining [119, 125]. This pretraining procedure starts by
training an MLP with 1 hidden layer. After this MLP has been trained discriminatively
with the error back-propagation algorithm, a new randomly initialized hidden layer and
softmax layer are introduced to replace the initial soft-max layer of the first network. The
deeper network is then trained again discriminatively. This procedure is repeated until the
desired number of hidden layers in place.
Although pretraining algorithms are effective in initializing DNNs, the key con-
79
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
straint in low resource settings is often the insufficient amount of data to train these net-
works. We show that in these scenarios, task independent data can instead be used to
pretrain and initialize a DNN before it is finally adapted and used with limited amounts of
task specific data in a low resource setting.
We outline the training of a 5 layer DNN of size - d×m1×m2×m3×h. The training
algorithm is however general and can be extended to more hidden layers. The MLP has a
linear input layer with a size d corresponding to the dimension of the input feature vector,
followed by three non-linear layers m1,m2,m3 and a final linear layer with a size h corre-
sponding to the phone set of the task independent data the DNN is being trained. While
the dimensions of m1,m2 are quite high, m3 is low bottleneck dimensional layer. Similar
to data driven networks described in the previous chapter, both posterior and bottleneck
features can be derived from the DNN. We use the following steps to pretrain a DNN -
1. Initializing the network - We begin the training procedure by initializing a simple
network with 1 hidden layer - d×m1×h. Starting with randomly initialized the weights
connecting all the layers of the network, we train this network with one pass of the
entire data similar to [119].
2. Growing the network - The d×m1×h network is now grown by inserting a new layer
m2 and a set of random weights connecting m1 −m2 and m2 − h. The new network
is again trained with one pass of the entire data using the standard back-propagation
algorithm. The weights d −m1 are copied from the initialization step and are kept
fixed.
The desired network d×m1×m2×m3×h is finally created by adding the bottleneck
80
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
layer m3. While weights d − m1, m1 − m2 are copied from the previous step, new
random weights are used to connect m2 −m3 and m3 − h.
3. Final training - With all the layers of the network in place, the complete network is
trained to full convergence.
We use task independent data in all these steps. The DNN is next adapted to the
low-resource setting using limited amounts of task specific data.
4.3.2 DNN Adaptation with task specific data
As described in the previous chapter, one limitation while adapting between do-
mains are differences in the phoneme set. We have proposed a neural network based tech-
nique for this in the previous chapter that replaces the last language specific layer. We use
this technique in the following steps for the adapting the DNN -
1. Initialize the network to train on task specific set - To continue training on the task
specific set which has a different phoneme set size l, we create a new 5 layer DNN of
size d×m1×m2×m3×l. The first 4 layer weights of this new network are initialized
using weights from the DNN trained on the task independent data set. Instead of
using random weights between the last two layers, we initialize these weights from a
separately trained single layer perceptron. To train the single layer perceptron, non-
linear representations of the low-resource training data are derived by forward passing
the data through the first 4 layers of the MLP. The data is then used to train a single
layer network of size m3×l.
2. Train the MLP on task specific set - Once the 4 layer MLP of size d×m1×m2×m3×l
81
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
has been initialized, we re-train the MLP on the low-resource language. By sharing
weights across languages the MLP is now able to train better on limited amounts of
task specific data.
We derive features from the bottleneck hidden layer of the final DNN as features
for ASR.
4.3.3 Experiments and Evaluations
Similar to low-resource experiments in the previous chapter, we build a cross-
lingual DNN front-end using data from 3 different languages - Spanish, German and English.
Separate DNNs are trained on two different feature representations - PLP and FDLPM.
Bottleneck features from these front-ends are then combined and used for ASR experiments.
DNN pretraining with cross-lingual data
32 hours of cross-lingual data from Spanish (16 hours), German (15 hours) and
English (1 hour) are used to train a 6 layer DNN network with 3 hidden layers. The cross-
lingual data uses a combined phoneme set size of 52 derived from a count-based mapping
scheme (Chapter 3, Section 3.4.3).
Separate DNNs are trained on two feature representations. 39 dimensional PLP
features (13 cepstral + Δ + ΔΔ features) are used along with a context of 9 frames to
train the first network with architecture - 351×1000×1000×25×52. A second system is
trained on modulation features derived using FDLP. These features (FDLPM) correspond
to 28 static and dynamic modulation frequency components extracted from 17 bark spaced
bands. A reduced feature set from only 9 alternate odd bands is used to train a system
82
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Table 4.2: Word Recognition Accuracies (%) with semi-supervised pre-training
System Word Rec. Acc. (%)
Conventional acoustic features (PLP) using
1 hour of English training data 28.8
Data-driven features using data-driven map - 31 hours of
multilingual data (German + Spanish) and 1 hour of English 36.5
(Chapter 3)
Data-driven features using adaptable last layer for MLP training
31 hours of multilingual data (German + Spanish) and 1 hour 37.2
of English (Chapter 3)
Data-driven features using deep neural network pre-trained using
31 hours of multilingual data (German + Spanish) and 1 hour 41.0
of English
with an architecture of 252×1000×1000×25×52. Both the systems are trained with the
standard back propagation algorithm and cross entropy error criteria. The learning rate
and stopping criterion are controlled by the error in the frame-based phoneme classification
on the cross validation data.
The DNN networks are build in stages as described in the previous section. For the
DNN trained using PLP features, a three layer MLP (351×1000×52) initialized with random
weights, is first trained using one pass of the cross-lingual data. In the next step, a four
layer MLP (351×1000×1000×52) is trained starting with copied weights from the 351×1000
section of the earlier network and random weights for the 1000×1000×52 section. A single
83
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
pass of the cross-lingual data is used to train this network keeping the copied weights fixed.
The final 6 layer network (351×1000×1000×25×52) is constructed with copied weights for
the 351×1000×1000 section and random weights for the 1000×25×52 part. The network
is then trained to full convergence. A similar 252×1000×1000×25×52 is trained using the
FDLPM features.
DNN adaptation to low-resource settings
Each of the DNN networks trained on task independent data are then adapted
to the low-resource setting with task-specific 1 hour of English data. The networks are
adapted after the task dependent output layer of the cross-lingual DNN has been replaced.
This is done in two steps.
In the first step, all weights except the weights between the bottleneck layer and
the output layer are initialized directly from the cross-lingual network. The second set
of weights are initialized from a single layer network trained on non-linear representations
of the 1 hour of English data derived by forward passing the English data through the
cross-lingual network till the bottleneck layer. This network has an architecture of 25×47
corresponding to the dimensionality of the non-linear representations from the bottleneck
layer of the cross-lingual network and the size of the English phoneme set.
Once the networks has been initialized, PLP and FDLPM features derived from 1
hour of English are used to train the new low-resource networks. The networks trained on
PLP and FDLPM features now have an architecture of 351×1000×25×47 and 252×1000×25×47
respectively. These networks are then used to derive bottleneck features. The 2 sets of 25
dimensional bottleneck features from each of the networks are appended together before ap-
84
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
plying a dimensionality reduction to form a final 25 dimensional bottleneck feature vector
for ASR.
ASR Experiments using DNN features
We use the similar ASR setup on Callhome English described earlier. The baseline
HMM-GMM system is trained on 1 hour of data using 39 dimensional PLP features. Table
4.2 shows the recognition accuracies on this task using different approaches. The DNN
features significantly improve ASR accuracies when compared with equivalent systems built
using features from simpler 3 layer MLPs.
4.4 Semi-supervised training in Low-resource Settings
4.4.1 Overview
Semi-supervised training has been effectively used to train acoustic models in
several languages and conditions [93,126–128]. In this section we describe the development
of a semi-supervised approach to improve speech recognition performances in low-resource
settings.
We start by using the best acoustic models trained in the low-resource setting to
decode the available untranscribed data. The decoded data is then used along with the
limited amounts of transcribed training data to train acoustic models in a semi-supervised
fashion.
85
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
4.4.2 Selecting Reliable Data
In low-resource settings, since the recognition performance of recognizers is low,
the quality of the decoded untranscribed data is also poor. It is hence useful to select
reliable portions of the untranscribed data for semi-supervised training. This selection is
done using confidences scores computed for each decoded utterance. Confidence scores are
computed using two techniques -
1. LVCSR based word confidences - LVCSR lattice outputs can be treated as directed
graphs with arcs representing hypothesized words. Each arc spans a duration of
time (ts, tf ), that the word is hypothesized to be present in the speech signal and is
also associated with acoustic and language model scores. Using these scores, word
posteriors can be computed with the standard forward-backward algorithm [129].
For any given hypothesized word wi, at a given time frame t, several instances of
the word can be present on different lattice arcs simultaneously. A frame-based word
posterior of wi can be computed as
p(wi|t) =∑
j
p(wji |t) (4.1)
where j corresponds to all the different instances of wi that are present at time frame
t [130]. In our proposed selection technique we use a word confidence measure Cmax
based on these frame level word posteriors [130], given as the maximum word confi-
dence of the word in its hypothesized time interval (ts, tf )
Cmax(wi, ts, tf ) = maxtε(ts,tf )
p(wi|t) (4.2)
86
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
����������
����������
���������������
���������������
����������
����������
���������������
��������������� ���
���������
������������
��������
�����������������
���������
������
������
���������
�����������
������������������
������������
������������
����������
����������
���������������
���������������
��������������������
Phonemes ofword W
tt s+1Time (frames)
t
43p
p21
p
p
s f
Presence ofphoneme
Path along on whichoccurances are
counted
Figure 4.5: MLP posteriogram based phoneme occurrence count
2. MLP posteriogram based phoneme occurrence confidence - Similar to the above men-
tioned confidence from the LVCSR classifier, we also derive confidences scores from
phoneme posterior outputs of a neural network classifier. This confidence measure
uses the posteriogram representation of an utterance, derived by forward passing
acoustic features corresponding to the utterance through the trained MLP classifier.
For each hypothesized word wi in the LVCSR transcripts, we first look up its set of
constituent phonemes {p1, p2 . . . pn} from a pronunciation lexicon. Phoneme posteri-
ors corresponding to each phoneme are then selected for the utterance’s posteriogram
representation and binarized to indicate the phoneme’s presence or absence using a
set threshold. The average number of times the constituent phonemes appear in the
hypothesized time span (ts, tf ) along a Viterbi search path is then used as confidence
measure. The selected path is designed to produced the occurrence count while visit-
ing all constituent phonemes in sequence. The rationale behind this measure is that
if a word is hypothesized correctly, it is likely that all its constituent phonemes will
be present in the posteriogram, hence resulting in a high average occurrence count.
Figure 4.5 is a schematic of the proposed count based measure computed as -
Cocc(wi, ts, tf ) =c
N(4.3)
87
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
where c is the total number of times phoneme occurrences and N is the total number
of frames in the hypothesized interval (ts, tf ).
The two confidence measures are finally combined using logistic regression. The
regressor is trained to predict a combined confidence using word confidence and phoneme
occurrence confidence scores of a held out data set.
4.4.3 Experiments and Results
For our experiments in low-resource settings, we use a randomly selected 1 hour
of transcribed data from the complete 15 hour Callhome English data set. In our semi-
supervised training experiments we consider the remaining 14 hours as untranscribed data
and attempt to use it.
Data selection
Using the ASR system trained with features from the cross-lingual DNN front-
end, the 14 hour data set is first decoded. Word lattices also produced during the decoding
process are used to generate word-confidences for each hypothesized word as described
above. The cross-lingual DNN front-end is also used to produce phoneme posterior outputs
from which phoneme occurrence based confidences are derived. Combination weights for
these confidence scores are then estimated by training a logistic regressor on a 45 minute
held-out data set with the set’s ground truth transcriptions.
After every hypothesized word in the decoded output has been given a score using
the trained logistic regression module, each utterance is assigned an utterance-level score.
This utterance level score is the average of all word-level scores in the utterance.
88
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Table 4.3: Word Recognition Accuracies (%) at different word confidence thresholds
Threshold Word Rec. Acc. % Threshold Word Rec. Acc. %
None 38.75 + 0.2 44.0
- 0.1 39.5 + 0.3 45.5
+ 0.0 41.7 + 0.4 45.4
+ 0.1 42.7 + 0.5 44.6
To evaluate the usefulness of the proposed confidence selection scheme we generate
utterance level scores for the held out data. The word recognition accuracy (%) is then
evaluated on selected sentences at different threshold levels. Table 4.3 shows the word
recognition accuracies at different thresholds. As the threshold increases, only fewer reliable
sentences get selected.
Semi-supervised training of DNNs
The initial cross-lingual DNN training experiments described earlier were based on
only 1 hour of transcribed data. For semi-supervised training of DNNs we include additional
data with noisy transcripts. These utterances are selected from the untranscribed data based
on their utterance level confidences.
To avoid detrimental effects from noisy semi-supervised data during discriminative training
of neural networks, we make the following design choices -
(a) During back-propagation training, the semi-supervised data is de-weighted. This is
done by multiplying the cross-entropy error with a small multiplicative factor during
training,
89
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Table 4.4: Word Recognition Accuracies (%) with semi-supervised pre-training
Cross-lingual pre-training 41.0
Cross-lingual pre-training with semi-supervised data 42.7
(b) The semi-supervised data is used only in the final pre-training stage after all the layers
of the DNN have been created,
(c) Only limited amount of semi-supervised data is added.
For our experiments we select about 4.5 hours of data using utterances with a
score of 0.3 and greater. This data is then combined with the cross-lingual pre-training
data set of 15 hours of German, 16 hours of Spanish and 1 hour of English. During the
DNN training, we use a multiplicative factor of 0.3 to de-weight the cross-entropy error
from the semi-supervised data.
The semi-supervised data is used in the final pre-training stage (Section 4.3.1,
step 3) to train both the DNN networks using PLP (351×1000×1000×25×52 network) and
FDLPM (252×1000×1000×25×52 network) features (Section 4.3.3). After pre-training,
both the networks are adapted with 1 hour of English as before. Bottleneck features from
both the networks are combined and used to train the low-resource ASR system with 1 hour
of data as before. Table 4.4 shows the performance of the system after using semi-supervised
data for pre-training.
90
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Table 4.5: Word Recognition Accuracies (%) with semi-supervised acoustic model training
Hours of semi-supervised
data added Word Rec. Acc. %
0 42.7
2 43.3
4 44.0
8 44.3
14 44.8
Semi-supervised training of Acoustic Models
Features from the DNN front-end with semi-supervised data are used to extract
data-driven features for semi-supervised training of the ASR system. Similar to the weighing
of semi-supervised data during the DNN training, we also use a simple corpus weighing while
training the ASR systems. This is done by adding the 1 hour of fully supervised data with
accurate transcripts twice.
To understand the effect of the semi-supervised data, we evaluate the recognition
performance using different amounts of semi-supervised data. From Table 4.5 we observe
that as we double the amount of semi-supervised data, there is roughly a 0.5% absolute
increase in performance.
91
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
4.5 Conclusions
In this chapter we have shown how complex neural network architectures can be
built in low resource settings. Using large amounts of multilingual data, we have show
that task independent data can significantly improve performances in low resource settings.
Training using task independent data compensates for the lack of limited amounts of tran-
scribed task specific data in low resource settings. Both the deep and wide networks trained
in this fashion improve word recognition accuracies significantly.
92
Chapter 5
Applications of Data-driven
Front-end Outputs
In the previous chapters, the outputs of data-driven front-ends were used as features for
automatic speech recognition. In this chapter, we describe how these front-ends can be used
in other applications - to derive features for speech activity detection, combination weights
in neural network based speaker recognition models, feature representations for zero resource
speech applications and event detectors for speech recognition.
5.1 Application 1 - Speech Activity Detection
5.1.1 Overview
Speech activity detection (SAD) is the first step in most speech processing ap-
plications like speech recognition, speech coding and speaker verification. This module is
93
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
an important component that helps subsequent processing blocks focus resources on the
speech parts of the signal. In each of these applications, several approaches have been
used to build reliable SAD modules. These techniques are usually variants of decision rules
based on features from the audio signal like signal energy [131], pitch [132], zero crossing
rate [133] or higher order statistics in the LPC residual domain [134]. Acoustic features
have also been used to train multi-layer perceptrons (MLPs) [135] and hidden Markov
models (HMMs) [136] to differentiate between speech and non-speech classes. All these
approaches in essence focus on characteristic attributes of speech which differentiate it from
other acoustic events that can appear in the signal.
5.1.2 Data-driven Features for SAD
Traditionally acoustic features derived from the spectrum of speech have been
used to differentiate between speech and other acoustic events. In a different approach, we
train MLPs on large amounts of data to differentiate between two classes - speech versus
non-speech. Instead of using these models to directly produce S/NS decisions, the models
are used as a data-driven front-ends to derive features for SAD.
The proposed front-end has a multi-stream architecture with several levels of MLPs
[137]. The motivation behind this multi-stream front-end is to use parallel streams of data
that carry complementary or redundant information while at the same time degrading
differently in noisy environments [138]. We form 3 feature streams by dividing the sub-
band trajectories derived using FDLP on a mel-scale with 45 filters equally into 3 groups.
Similar to deriving short-term spectral features, we then integrate the envelopes in short
term frames (of the order of 25 ms with a shift of 10 ms). We also use a context of
94
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
about 1 second by appending 50 frames from the right and left with each sub-band feature
vector to form TRAP like features [65]. The two other streams are formed by dividing the
14 modulation features into 2 groups - the first 5 DCT coefficients corresponding to slow
modulations and the remaining 5 coefficients corresponding to fast modulations.
5.1.3 Experiments and Results
Speech activity detection is carried out on the proposed features in three main
steps. In the first step, the input frame-level features are projected to a lower-dimensional
space. The reduced features are then used to compute per-frame log likelihood scores with
respect to speech and non-speech classes, each class being represented separately by a GMM.
The frame level log likelihood scores are mapped to S/NS classification decisions to produce
final segmentation outputs in the last step. Figure 5.1 is a brief schematic of the proposed
approach and the processing pipeline for SAD. Each of these steps is described in detail
in [139].
The proposed features are evaluated in terms of speech activity detection (SAD)
accuracy on noisy radio communications audio provided by the Linguistic Data Consortium
(LDC) for the DARPA RATS program [140, 141]. The audio data for the DARPA RATS
program is collected under both controlled and uncontrolled field conditions over highly
degraded, weak and/or noisy communication channels making the SAD task very challeng-
ing [140]. Most of the RATS data released for SAD were obtained by retransmitting existing
audio collections - such as the DARPA EARS Levantine/English Fisher conversational tele-
phone speech (CTS) corpus - over eight radio channels, labeled A through H covering a
wide range of radio channel transmission effects.
95
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
Figure 5.1: Schematic of (a) features and (b) the processing pipeline for speech activitydetection.
The development corpus used in our SAD experiments consists of 11 hours of
audio from the Arabic Levantine and English Fisher CTS corpus, retransmitted over the
eight channels. The training corpus consists of 73 hours of audio (62 hours from the Fisher
collection, and 11 from new RATS collection). Although the entire data was also retrans-
mitted over eight channels, since some data from channel F was unusable, all data from
that channel was excluded from both training and development.
The MLPs used for extracting data-driven features are trained on close to 660
hours of audio from the RATS development corpus using LDC provided S/NS annotations.
Outputs from these 5 sub-systems are then fused by a merger MLP at the second level to
derive the final S/NS posterior features. These features are derived from the pre-softmax
96
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
Dimensionality Equal Error Rate (%) on different channels