Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes Articulatory Feature-based Speech Recognition: A Proposal for the 2006 JHU Summer Workshop on Language Engineering LIP-OP TT- OPEN TT- LOC TB-OPEN VELUM GLOTTIS . . . . . . . . . Mark Hasegawa-Johnson Ozgur Cetin Kate Saenko November 12, 2005
12
Embed
Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes Articulatory Feature-based Speech Recognition: A Proposal.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Articulatory Feature-based Speech Recognition:A Proposal for the 2006 JHU Summer Workshop
on Language Engineering
LIP-OP TT-OPEN
TT-LOC
TB-OPEN VELUM
GLOTTIS
.
.
.
.
.
.
.
.
.
Mark Hasegawa-JohnsonOzgur Cetin Kate Saenko
November 12, 2005
Motivations
• Why articulatory feature-based ASR?– Improved modeling of co-articulatory pronunciation phenomena– Take advantage of human perception and production knowledge– Application to audio-visual modeling– Application to multilingual ASR– Evidence of improved ASR performance with feature-based models
* In noise [Kirchhoff et al. 2002]* For hyperarticulated speech [Soltau et al. 2002]
– Potential savings in training data
• Why this workshop project?– Growing number of sites investigating complementary aspects of this idea;
a non-exhaustive list:* U. Edinburgh (King et al.)* UIUC (Hasegawa-Johnson et al.)* MIT (Livescu, Glass, Saenko)
– Recently developed tools (e.g. graphical models) for systematic exploration of the model space
The challenge of pronunciation variation
(2) p r aa b iy
(1) p r ay
(1) p r aw l uh
(1) p r ah b iy
(1) p r aa l iy
(1) p r aa b uw
(1) p ow ih
(1) p aa iy
(1) p aa b uh b l iy
(1) p aa ah iy
probably
(1) s eh n t s
(1) s ih t s
sense
(1) eh v r ax b ax d iy
(1) eh v er b ah d iy
(1) eh ux b ax iy
(1) eh r uw ay
(1) eh b ah iy
everybody
(37) d ow n
(16) d ow
(6) ow n
(4) d ow n t
(3) d ow t
(3) d ah n
(3) ow
(3) n ax
(2) d ax n
(2) ax
(1) n uw
(1) n
(1) t ow
(1) d ow ax n
...
don’t
• Noted as an obstacle for recognition of conversational speech [McAllaster et al. ‘98, Saraçlar et al. ‘00]
– Conversational speech is recognized at twice the error rate of read speech [Weintraub et al. ‘98]
– Recognizer errors are correlated with reduced pronunciations [Fosler-Lussier ’99]
• Phonetic transcription of conversational pronunciations [Greenberg et al. ‘96]
• Product observation models combining phones and features, p(obs|s) = p(obs|phs) p(obs|fsi), improve ASR in some conditions
– [Kirchhoff et al. 2002, Metze et al. 2002, Stueker et al. 2002]
• Lexical access from manual transcriptions of Switchboard words using DBN model above [Livescu & Glass 2004, 2005]– Improves over phone-based pronunciation models (~50% ~25% error)
– Preliminary result: Articulatory phonology features preferable to IPA-style (place/manner) features
• JHU WS’04 project [Hasegawa-Johnson et al. 2004]– Can combine landmarks + IPA-style features at acoustic level with articulatory
phonology features at pronunciation level
• Articulatory recognition using DBN and ANN/DBN models [Wester et al. 2004, Frankel et al. 2005]– Modeling inter-feature dependencies useful, asynchrony may also be useful
• Lipreading using multistream DBN model + SVM feature detectors– Improves over viseme-based models in medium-vocabulary word ranking and
realistic small-vocabulary task [Saenko et al. 2005]
Ongoing work: Audio-visual ASR
visual state (viseme)
audio state (phoneme)
V V V
AAA
phoneme-viseme based
A A A
V V V
checkSyncLT
checkSyncT
G
asyncLT
asyncTG
Lip features
Tongue features
Glottis/velum
articulatory feature-based
spectrogram
mouth images
G phone
T phone
L phone
Sample alignment from a prototype feature-based system:
A partial taxonomy of design issues
factored state (multistream structure)?
No
factored obs model?
Yes No
obs model
GM SVMNN
[Metze ’02] [Kirchhoff ’02] [Juneja ’04]
[Deng ’97, Richardson ’00]
Yes
state asynchrony
free within unit
soft asynchrony within word
coupled state transitions
cross-word soft asynchrony
[Livescu ‘04]
fact. obs?
YN
fact. obs?
YN
fact. obs?
YN
fact. obs?
YN
CD
[Kirchhoff ’96,
Wester et al. ‘04]
CHMMs
FHMMs [Livescu ’05]???
???
???[WS04]
CDCD
YN
??????Y
N
???
CD
YN
???
CD
Y N
???
CD
YNCD
Y N
Y NCD
Y
N
???
(Not to mention choice of feature sets... same in hidden structure and observation model?)
Goals for 2006 workshop
• To build complete articulatory feature-based ASR systems– Using multistream DBN structures
– For both audio-only and audio-visual ASR
• To develop a thorough understanding of the design issues involved
– Asynchrony modeling
– Context modeling
– Speaker dependency
– Generative observation modeling vs. discriminative feature classification
Potential participants and contributors
• Local participants:– Karen Livescu, MIT:
* Feature-based ASR structures, graphical models, GMTK– Mark Hasegawa-Johnson, U. Illinois at Urbana-Champaign
* Discriminative feature classification, JHU WS’04– Simon King, U. Edinburgh