Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos
Dec 20, 2015
Feature Selection, Acoustic Modeling and Adaptation
SDSG REVIEW of recent WORK
Technical University of Crete
Speech Processing and
Dialog Systems Group
Presenter: Alex Potamianos
TUC - SDSG
Outline
• Prior Work – Adaptation– Acoustic Modeling– Robust Feature Selection
• Bridge over to HIWIRE work-plan– Robust Features, Acoustic Modeling, Adaptation– New areas: audio-visual, microphone arrays
TUC - SDSG
Adaptation
• Transformation-based adaptation• MAP Adaptation (Bayesian learning
approximation)• Speaker Clustering / Speaker space
models.• Robust Feature Selection• Combinations
TUC - SDSG
Acoustic Model Adaptation: SDSG Selected Work
• Constrained Estimation Adaptation• Maximum Likelihood Stochastic
Transformations• Combined Transformation-MAP adaptation• MLST Basis Vectors• Incremental Adaptation• Dependency modeling of biases• Vocal Tract Norm. with Linear Transformation
TUC - SDSG
Constrained Estimation Adaptation (Digalakis 1995)
• Hypothesize a sequence of feature-space linear transformations:
• Adapted models (A) are then:
• diagonal.
• Adaptation is equivalent to estimating the state dependent
stst byAx
ss bA ,
))(,;()|()|(1
Tsisssisst
N
iitA ASAbmAxNspsxP
sA
TUC - SDSG
Compared to MLLR (Leggeter 1996)
• Both published at the same time.
• MLLR is only model adaptation.
• MLLR transforms only the model means
• in MLLR is block diagonal.
• Constrained estimation is more generic.
),;()|()|(1
ississt
N
iitA SbmAxNspsxP
sA
TUC - SDSG
Limitations of the Linear Assumption
• Linear assumption may be too restrictive in modeling the training testing dependency.
• Goal: Try a more complex transformation.
• All Gaussians in a class are restricted to be transformed identically using the same transformation.
• Goal: Let each Gaussian in a class to decide for its own transformation.
• Which transformation transforms each Gaussian is predefined.
• Goal: Let the system to automatically choose the transformation-Gaussian couples.
TUC - SDSG
ML Stochastic Transformations (MLST) (Diakoloukas Digalakis 1997)
• Hypothesize a sequence of feature-space stochastic transformations of the form:
NiNsNtsN
ists
ists
t
lspbyA
lspbyA
lspbyA
x
),|(y probabilit with,
),|(y probabilit with,
),|(y probabilit with,
2222
1111
TUC - SDSG
MLST: model-space
• Use a set of MLSTs instead of linear transformations. • Adapted observation densities:
– MLST-Method I
• is diagonal
– MLST-Method II
• is block diagonal
)ASA,bmA;N(xs)|p()s,|p(s)|(xP Tjsisjsjsisjst
N
1i
N
1=jiijtA
sjA
)S,bmA;N(xs)|p()s,|p(s)|(xP isjsissjt
N
1i
N
1=jiijtA
sjA
TUC - SDSG
MLST: Reduce the number of mixture components
• The adapted mixture densities consist of Gaussians.
• Reduce the Gaussians back to their SI number:– HPT: Apply the component transformation with the highest
probability to each Gaussian.
– LCT: Linear combination of all component transforms.
– MTG: Merge the transformed Gaussians.
NN
N
jsjijs
N
jsjijs bspbAspA
11
),|( and ),|(
2)(
1
2)(2)(2)(
1
)()(
~),|(~
),|(~
di
N
j
dji
djiij
di
N
j
djiij
di
sp
sp
TUC - SDSG
MLST properties
• Asj, bsj are shared at a state or state-cluster level
• Transformation weights lj are estimated at a Gaussian level
• MLST combines transformed Gaussians • MLST is flexible on how to select a transformation for
each Gaussian.• MLST chooses arbitrary number of transformations
per class.
TUC - SDSG
MLST compared to ML Linear Transforms
• Hard versus Soft decision: – Choose the linear component based on the
training samples.
• Adaptation Resolution: – Linear components are common to a
transformation class– Choose the transformation at a Gaussian level– Increased adaptation resolution - robust
estimation
TUC - SDSG
MLST basis transforms (Boulis Diakoloukas Digalakis 2000)
• Algorithm steps:– Cluster the training speaker space into classes– Train MLST component transforms using data from
each training speaker class – Adaptation data is used to estimate the
transformation weight
• It is like having a-priori knowledge to the estimation process
• Results in rapid speaker adaptation• Significant gains for medium and small data
sets
TUC - SDSG
Combined Transformation Bayesian (Digalakis Neumeyer 1996)
• MAP estimation can be expressed as:
• Retain the asymptotic properties of MAP
• Retain fast adaptation rates of transformations.
adaptpriorMAP ww )1(
TUC - SDSG
Rapid Speech Recognizer Adaptation (Digalakis et.al 2000)
• Dependence models of the bias components of cascaded transforms. Techniques:– Gaussian multiscale process– Hierarchical tree-structured prior– Explicit correlation models– Markov Random Fields
TUC - SDSG
VTN with Linear Transformation(Potamianos and Rose 1997, Potamianos and Narayanan 1998)
• Vocal Tract Normalization:
Select optimal warping factor according to
= arg max P(Xª|a, , H)
where H is the transcription, and Xª frequency
warped observation vector by factor a.• VTN with linear transformation
{, } = arg max P(Xª|a, , , H)
where h() is a parametric linear transformation
with parameter
TUC - SDSG
Acoustic Modeling:SDSG Selected Work
• Genones: Generalized Gaussian mixture
tying scheme
• Stochastic Segment Models (SSMs)
TUC - SDSG
Genones: Generalized Mixture Tying (Digalakis Monaco Murveit 1996)
• Algorithm Steps:– Clustering of HMM states based on the similarity of
their distributions– Splitting: Construct seed codebooks for each state
cluster• Either identify the most likely mixture component subset • Or cluster down the original codebook
– Reestimation of the parameters using Baum-Welch
• Better trade-off between modelling resolution and robustness
• Genones are used in Decipher and Nuance
TUC - SDSG
Segment Models
• HMM limitations:– Weak duration modelling– Conditional independence of observations assumption– Restrictions on feature extraction imposed by frame-based
observations
• Segment models motivation:– Larger number of degrees of freedom in the model– Use segmental features– Model correlation of frame-based features – Powerful modelling of transitions and longer-range speech
dynamics– Less distortion for segmental coding segmental recognition
more efficient
TUC - SDSG
General Stochastic Segment Models
• A segment s in an utterance of N frames iss = {(τa , τb): 1≤ τa ≤ τb ≤ N}
• Segment model density:
• Segment models generate a variable-length sequence of frames
)|()()|(),|,...,()|,,...,( 1,11 alpybalpalyypalyyp llall
TUC - SDSG
Stochastic Segment Model (Ostendorf Digalakis 1992)
• Problem: Model time correlation within a segment
• Solution: Gaussian model variations based on assumptions about the form of statistical dependency– Gauss-Markov model– Dynamical System model– Target State model.
TUC - SDSG
SSM Viterbi Decoding (Ostendorf Digalakis Kimball 1996)
• HMM Viterbi recognition:• State to Word sequence mapping:
• SSM analogous solution:
• Map the segment label sequence to the appropriate word sequence:
1 1 11arg max ( | ) ( )ˆ
TT T Tp pys s s
1 1( )ˆ ˆ
T Thw s
11
ˆ
1 1 1 1 1 11( , ) { ( | , ) ( | ) ( )}
,
ˆ arg max maxˆN N
TN N N N N Np p
N a lp ya l a l a aN
1 1( )ˆ ˆ
K Nfw a
TUC - SDSG
From HMMs to Segment Models(Ostendorf Digalakis 1996)
• Unified view of stochastic modeling
• General stochastic model that encompasses most SM type models
• Similarities in terms of correlation and parameter tying assumptions
• Analogies between segment models and HMMs
TUC - SDSG
Robust Feature Selection
• Time-Frequency Representation for ASR(Potamianos and Maragos 1999)
• Confidence Measure Estimation for ASR Features
sent over wireless channels (“missing features”)(Potamianos and Weerackody 2001)
• AM-FM Model Based Features(Dimitriadis et al 2002)
TUC - SDSG
Other Work
• Multiple source separation using microphone
arrays (Sidiropoulos et al. 2001)
TUC - SDSG
Prior Work Overview
MLST.Constr. Est. Adapt.
MAP (Bayes) Adapt.
GenonesSegment Models
VTLN
Combinations
Robust Features
TUC - SDSG
HIWIRE Work Proposal
AdaptationBayes optimal class.
Audio Visual ASRBaseline experiments
Microphone ArraysSpeech/Noise Separation
Feature SelectionAM-FM Features
Acoustic ModelingSegment Models
TUC - SDSG
Bayes optimal classification (HIWIRE proposal)
• Classifier decision for a test data vector xtest:
• Choose the class that results in the highest value:
),...,,|(maxarg)( 21jN
jjtest
jtest xxxxpcxc
dxxxpxpxxxxp jN
jjtest
jN
jjtest ),...,,|()|(),...,,|( 2121
TUC - SDSG
Bayes optimal versus MAP
• Assumption: the posterior is sufficiently peaked around the most probable point
• MAP approximation:
• θMAP is the set of parameters that maximize:
)|(),...,,|( 21 MAPtestjN
jjtest xpxxxxp
)}()|,...,,({maxarg),...,,|(maxarg 2121
pxxxpxxxp NNMAP
TUC - SDSG
Why Bayes optimal classification
• Optimal classification criterion• The prediction of all the parameter hypotheses is
combined• Better discrimination• Less training data • Faster asymptotic convergence to the ML estimate
• However: – Computationally more expensive– Difficult to find analytical solutions– ....hence some approximations should still be
considered
TUC - SDSG
Segment Models
• Phone Transition modeling
– New features
• Combine with HMMs
• Parametric modeling of feature trajectories