18.0 Some Recent Developments in NTU Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on Speech and Audio Processing, Vol.13, No.3, May 2005, pp.399-411. 2. “Higher Order Cepstral Moment Nomalization(HOCMN) for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Montreal, CA, May 2004, pp.197-200. 3. “Extension and Further Analysis of Higher Order Cepstral Moment Normalization (HOCMN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 4. “Powered Cepstral Normalization (P-CN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 5. “ Improved Spontaneous Mandarin Speech Recognition by Disfluency Interruption Point (IP) Detection Using Prosodic Features”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.1621-1624. 6. “ Prosodic Modeling in Large Vocabulary Mandarin Speech Recognition”,
67
Embed
18.0 Some Recent Developments in NTU Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
18.0 Some Recent Developments in NTU
Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on Speech and Audio Processing, Vol.13, No.3, May 2005, pp.399-411.
2. “Higher Order Cepstral Moment Nomalization(HOCMN) for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Montreal, CA, May 2004, pp.197-200.
3. “Extension and Further Analysis of Higher Order Cepstral Moment Normalization (HOCMN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.
4. “Powered Cepstral Normalization (P-CN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.
5. “ Improved Spontaneous Mandarin Speech Recognition by Disfluency Interruption Point (IP) Detection Using Prosodic Features”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.1621-1624. 6. “ Prosodic Modeling in Large Vocabulary Mandarin Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 7. “Latent Prosodic Modeling (LPM) for Speech with Applications in Recognizing
Spontaneous Mandarin Speech with Disfluencies”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.
Reference: 8. “Entropy-based Feature Parameter Weighting for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.
9. “A New Framework for System Combination Based on Integrated Hypothesis Space,” International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.
10. “Improved Spoken Document Summarization Using Probabilistic Latent Semantic Analysis (PLSA)”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.
11. “Analytical Comparison between Position Specific Posterior Lattices and Confusion Networks Based on Words and Subword Units for Spoken Document Indexing”, IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto, Japan, December 2007.
12. “A Multi-Modal Dialogue System for Information Navigation and Retrieval across Spoken Document Archives with Topic Hierarchies”, Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, San Juan, Nov-Dec 2005.
13. “Efficient Interactive Retrieval of Spoken Documents with Key Terms Ranked by Reinforcement Learning”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.
14. “Type- Dialogue Systems for Information Access from Unstructured Knowledge ⅡSources”, IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto, Japan, December 2007.
18.0 Some Recent Developments in NTU
Reference: 15. “Histogram-Based Quantization (HQ) for Robust and Scalable Distributed Speech Recognition”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.957-960.
16. “Joint Uncertainty Decoding (JUD) with Histogram-Based Quantization (HQ) for Robust and/or Distributed Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.
18.0 Some Recent Developments in NTU
Role of Spoken Language Processing under Network Environment
Content AnalysisUser Interface
Internet
User-Content Interaction
User Interface
—when keyboards/mice inadequate
Content Analysis — help in browsing/retrieval of multimedia content User-Content Interaction —all text-based interaction can be accomplished by spoken language
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 14 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
Segmental Eigenvoice
– Decompose the supervectors into sub-supervectors, from which sub-eigenspaces can be constructed, therefore better performance
can be obtained with more adaptation data
Segmental Eigenvoice (1/3)
Segmental Eigenvoice (2/3)
Segmental Eigenvoice (3/3)
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 14 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition
— to reduce the mismatch between the statistical characterics of training and testing corpora by
normalizing the ceptral moments
Cepstral Moment Normalization
• Moment Estimation:– Time average : N-th moment of MFCC parameters about the origin
• Cepstral Normalization:– For odd order L
– For even order N
1 1
0 0
1 1[ ( )] ( ) ( )
T TNN N
k k
E X n X k X kT T
[ ]( ) 0LLE X n
[ ]( )NN NE X n M
Example: CMS for L=1
Example: CMVN for N=1 and 2
Higher Order Cepstral Moment Normalization (HOCMN)
CN
CTN=HOCMN[1,2,3]
CN (l=86)
• Aurora 2, Clean Condition Training, Word Accuracy Averaged over 0~20dB and All Types of Noise (sets A,B,C)
CMVN
CTN=HOCMN[1,3,2]
CMVN (l=86)
Skewness and Kurtosis (1)
• Skewness
– Third moment about the mean and normalized to the standard deviation
– Departure of pdf from symmetry• Positive/negative indicates skew to right/left• Zero indicates symmetric
• Kurtosis
– Fourth moment about the mean and normalized to the standard deviation
– Peaked or “flat with tails of large size” as compared to standard Gaussian
• “3” is the fourth moment of N(0,1)• Positive/negative indicates flatter/more peaked
Skewness and Kurtosis (2)
• Define: Generalized Skewness of Odd Order L
– L not necessarily 3– Similar meaning as skewness (skew to right or left) except in the
sense of L–th moment
• Define: Generalized Kurtosis of Even Order N
– N not necessarily 4– Similar meaning as kurtosis (peaked or flat) except in the sense of
N–th moment
( ) , : an odd integerL LS E X L
Skewness and Kurtosis (3)
• Normalizing Odd Order Moment is to Constrain the pdf to be Symmetric about the Origin
– Except in the sense of L-th moment
• Normalizing Even Order Moment is to Constrain the pdf to be “Equally Flat with Tails of Equal Size” as Compared to a Standard Gaussian
– Except in the sense of N-th moment
• The Order of Normalized Moments are not necessarily Integers
• Generalized Moments– Type 1:
• Reduced to odd order moment when u is an odd integer L
(example: L=1 or 3)
– Type 2:
• Reduced to even order moment when u is an even integer N
(example: N=2 or 4)
– HOCMN with Non-integer Moment Orders
Generalized Moments with Non-integer Orders
PDF Analysis
• HEQ– Over fitted to Gaussian– Original statistics lost
• HOCMN– Fitting the generalized skewness and
kurtosis of a few orders only– Retain more original characteristics
HEQ
HOCMN
Original C0 & C1
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 14 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
Use of Prosody in Recognition and Handling Disfluencies in Spontaneous Speech
— prosody may be useful in recognition, and in particular in handling disfluencies in spontaneous speech
• Pitch-related Features– The average pitch value within the syllable – The maximum difference of pitch value within the syllable – The average of absolute values of pitch variations within the syllable– The magnitude of pitch reset for boundaries – The difference of such feature values of adjacent syllable boundaries ( P1-P2 ,
d1-d2 , etc.)
– A total of 54 pitch-related features were obtained
• Duration-related Features
– A total of 38 duration-related features were obtained
syllable boundary syllable boundarypausepause
end of utterancebegin of utterance
A B C D Eba
Prosodic Features (Ⅱ) —Duration-related Features
Pause duration b Average syllable duration
(B+C+D+E)/4 or ( (D+E)/2 + C )/2 Average syllable duration ratio
(D+E)/(B+C) or (D+E)/2 /C
Combination of pause & syllable features (ratio or product) C*b , D*b, C/b, D/b Lengthening C / ( (A+B)/2 ) Standard deviation of feature values
Recognition Framework with Prosodic Modeling
• Rescoring Formula:
λl ,λp: weighting coefficients
( ) log log logl pS W P X W P W P F W Prosodicmodel
• Two-pass Recognition
Prosodic Feature Extraction from Paths in the Word Graph
• PSPL:─ Locate a word in a segment according to the order of the word in a path
• CN:─ Cluster several words in a segment according to similar time spans and word
pronunciation
OOV/Rare Word Problem
• OOV word W=w1w2w3w4 and a lattice L of document D
– wi : subword units
• W never appears in L – Never find D under PSPL
• But W=w1w2w3w4 is hidden in L at subword level
• Subword-based PSPL (S-PSPL)
w2w3
w3w4bcdw3w4e
w3w4b
aw1w2
w1w2
Word Lattice L:
Time index
Subword-based PSPL and CN
w1_1
Time index
w1_2
w1_3 w2_1 w2_2 w2_3
w2_4
w3_2w3_1 w4_1 w4_2 w51
w5_2 w5_3 w5_4
w7_1
w7_2
w6_1
w6_2 w8_1 w8_2
w8_2w8_1
w9_1
w9_2
w10_1
w10_2
w1_1: prob w1_2: prob …. …..
S-PSPL structure:
…..
cluster 1
…..
…..
cluster 2cluster 8
S-CN structure:
…..
w5_4: prob
….. …..
w1_1: prob w1_2: prob …. …..
…..
cluster 1
…..
…..
cluster 2 cluster 8
…..
w2_4: prob
…..
Lattice Represented by Subword Arcs:
Performance Comparison
0.54
0.59
0.64
0.69
0.74
0.79
0.84
0.89
0 2 4 6 8 10 12 14 16 18 20
Index Size(MB)
MA
P
PSPL(word)CN(word)
PSPL(character)
CN(character)
CN(syllable)
PSPL(syllable)
Interactive Retrieval of Spoken Documents by Topic Hierarchy
• Interactive Process between User and Content for Spoken Document Retrieval
• Given User’s Initial Query, the Extracted Key Terms can be many, even in a Hierarchy– Ranking the key terms will be helpful in efficient retrieval
Topic Hierarchy
User
Multi-modal Dialogue
Retrieved Documents
Spoken Document
Archive
Retrieval System
Query/Instruction
Key Term Space Archive Space
titj
tktl
C(ti)
C(tj)
C(tk)
s1 = [ti ]
s2 = [ti ,tj ]
s3 = [ti ,tk ]
sn = [ti ,tj ,tl ]
G1 = C(ti )
G2 = C(ti + tj)
G3 = C(ti + tk)
Gn = C(ti +tj +tl)
Query Term Suggestions and Improved Interaction by Dialogue Modeling
Such mapping is defined by some IR function (ex: PLSA)
states: s1, s2, s3, …
actions: ti, tj, tk, …
state_s1 + action_tj
state_s2
Document Space
• A State Transition Diagram Generated for Each User Given the Initial Query s1
• User Assumed Satisfied (Double Circles) when Recall Rate = L/|D| > τ0
– L: number of relevant documents appearing in the top K retrieved documents– D: desired document set– m(s) = Mininum Number of Steps or Queries to Arrive at the Final State
Learning User’s Behavior in Retrieval by a Large Number of Simulated Users
s1
s2
s3
s4
s6
s7
s8s13
s14
s12
s15
m(s12) = 4
m(s7) = 3
m(s13) = 4
m(s15) = 5
s9 m(s9) = 3
m(s4) = 2
…
m(s3) = 3
Goal: to minimize the number of steps to arrive at the final state
Types- and Dialogue SystemsⅠ Ⅱ
ASRLanguage
Understanding
Well-organizedDatabase
Speech, Graph, Tables
Dialogue Modeling
words,lattices
Dialogue Act Classification
Semantic Frame
Dialogue State
Output Generator
Spoken language Understanding
User Act
System Action Dialogue
Manager
U
Input Speech Utterance Au
^S
Type-I:
Types- and Dialogue SystemsⅠ Ⅱ
ASR
Multimedia Document
Archive
Retrieval Engine
Indexing
word/ phone lattice, one-best, N-best
ASR
inverted index file
word/ phone lattice, one-best, N-bestSpoken Language based Information Access
Internal State
Dialogue Modeling
Related Documents
Multi-modal
User Interface
Dialogue ManagerOutput
Presentation
Multi-modal interactions
Information Obtained
d
Spoken Docume
nts
Input Spoken Query
q
Type-II:
Improved Performance by Dialogue Modeling
0
15
30
45
60
75
90
105
74 88 92 100
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
74 88 92 100ASR Character Accuracy in % for Queries ASR Character Accuracy in % for Queries
Ave
rage
Num
ber
of K
ey T
erm
s N
eede
d fo
r S
ucce
ssfu
l Tri
als
Tas
k S
ucce
ss R
ate
Dialogue Modeling
wpq
tf-idf Dialogue Modeling
wpq
tf-idf
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 14 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
Histogram-based Quantization (HQ) for Robust Distributed Speech Recognition
– quantization dynamically determined by local statistics, thus automatically absorbing the various disturbances
• An Example Partition of Speech Recognition Processes into Client/Sever
Distributed Speech Recognition (DSR) and Wireless Environment
Front-endSignal Processing
AcousticModels Lexicon
FeatureVectors
Linguistic Decoding and
Search Algorithm
Output Sentence
SpeechCorpora
AcousticModel
Training
LanguageModel
Construction
TextCorpora
LexicalKnowledge-base
LanguageModel
Input Speech
Grammar
– encoded feature parameters transmitted in packets Client/Server Structure
Server
ServerClients
Network
Client
Problems with Conventional Vector Quantization (VQ) Conventional VQ (e.g. SVQ) Popularly Used in DSR Dynamic Environmental Noise and Codebook Mismatch
Jointly Degrade the Performance of SVQ
Noise moves clean speech to another partition cell (X to
Y)
Mismatch between fixed VQ codebook and test data
increases distortion
Quantization increases difference between clean
and noisy features
– Decision boundaries yi{i=1,…,N} are dynamically defined by C(y).
– Representative values zi {i=1,…,N} are fixed, transformed by a standard Gaussian.
Histogram-based Quantization (HQ) ( )Ⅰ
T
{ , , (vertical scale) 1,..., }determined by Lloyd-Max and a standard Gaussian Distribution
i i iD z b i N
Histogram-based Quantization (HQ) (Ⅱ)
– With histogram C’(y’), decision boundaries automatically changed to .
– Decision boundaries are adjusted according to local statistics, no codebook mismatch problem.
T
1( , )iiy y
1
1
, '( )
' ' ,
1,2, ...
t ti ii
t ii
x z if b C x b
or y x y
where i N
Histogram-based Quantization (HQ) (Ⅱ)
• Based on CDF on the Vertical Scale and Histogram, less Sensitive to Noise on the Horizontal Scale
• Disturbances are Automatically Absorbed into HQ Blocks
Dynamic nature of HQ hidden codebook on vertical scaletransformed by dynamic C(y){yi} Dynamic on horizontal scale
T
Histogram-based VQ (HVQ)
Different Types of Noise, Averaged over All SNR Values