The TUM Cumulative DTW Approach for the Spoken Web Search Task Cyril Joder, Felix Weninger , Martin Wöllmer, Björn Schuller Institute for Human-Machine Communication Technische Universität München
Dec 18, 2014
The TUM Cumulative DTW Approach for the Spoken Web
Search Task
Cyril Joder, Felix Weninger, Martin Wöllmer, Björn Schuller
Institute for Human-Machine CommunicationTechnische Universität München
Summary
• Not a „system“• Low-level features only• No ASR• Little „engineering“• Method of integrating discriminative
training into DTW
Mediaeval 2012 Workshop 2
Cumulative DTW (CDTW)
• Limitations of DTW: – Only one local cost function (distance)– Usually manual parameter tuning
• Idea: – Use different local cost functions for each step – Automatic learning of these functions as
combination of general features
Mediaeval 2012 Workshop 3
From DTW to CDTW
• Local cost function:
Mediaeval 2012 Workshop 4
(𝑖 , 𝑗)
(𝑖 , 𝑗−1)
(𝑖−1 , 𝑗 )
(𝑖−1 , 𝑗−1)
𝛼1
𝛼2
𝛼3
• Dynamic Programming:
From DTW to CDTW
• Local step function:
Mediaeval 2012 Workshop 5
(𝑖 , 𝑗)
(𝑖 , 𝑗−1)
(𝑖−1 , 𝑗 )
(𝑖−1 , 𝑗−1)
𝑠1(𝑖 , 𝑗)
𝑠2(𝑖 , 𝑗)
𝑠3( 𝑖 , 𝑗)
• Dynamic Programming:
From DTW to CDTW
• Local step function:
Mediaeval 2012 Workshop 6
(𝑖 , 𝑗)
(𝑖 , 𝑗−1)
(𝑖−1 , 𝑗 )
(𝑖−1 , 𝑗−1)
𝑠1(𝑖 , 𝑗)
𝑠2(𝑖 , 𝑗)
𝑠3( 𝑖 , 𝑗)
• Dynamic Programming:
Softmax?
+ Differentiable– Allow for an optimization of the
+ Combine several alignment paths– More robust to local changes
- Only give a score (not the optimal path)
Mediaeval 2012 Workshop 7
Features
• Acoustic descriptors: MFCC++ (D=36) – HTK, 25 ms, CMN, Global normalization
• Features (k=1…D):– Local distance
– “Local self-similarity”
; – Distance / product of the self-similarities
Mediaeval 2012 Workshop 8
Decision
• Are the two sequences instances of the same word/expression?
• Learning of the parameters.– Backpropagation (stochastic gradient descent)– Training data: queries/utterances of dev set
Mediaeval 2012 Workshop 9
Decision𝑆( 𝐼 , 𝐽 )𝐼+ 𝐽
Search Procedure
Given query and utterance
1) Feature extraction
2) Candidate search in
3) CDTW comparison
4) Score post-processing
Mediaeval 2012 Workshop 10
Candidate Search
• Align query with entire utterance– CDTW with backtracking– “Scores” for each point
• Extract potential starts and ends– Peak-picking of scores
• Filter by duration– Only allow warping factors < 2
Mediaeval 2012 Workshop 11
Candidate Search
• Align query with entire utterance– CDTW with backtracking– “Scores” for each point
• Extract potential starts and ends– Peak-picking of scores
• Filter by duration– Only allow warping factors < 2
Mediaeval 2012 Workshop 12
CDTW Score Post-Processing
• Same decision function as for learning– Many false positives– Bias toward some queries
• Heuristic post-processing:– For each query, subtract a specific threshold– Threshold: 90-th percentile of the CDTW
scores for that query
Mediaeval 2012 Workshop 13
Results
run devQ-devC evalQ-devC devQ-evalC evalQ-evalC
P(miss) 55.6% 59.5% 60.2% 54.5%
P(FA) 1.18% 1.13% 1.17% 1.13%
ATWV 0.263 0.333 0.164 0.290
Mediaeval 2012 Workshop 14
• Great improvement over naive DTW– ATWV = 0.065 on devQ-devC
• ATWV scores depend on the run
Results
• DET curves similar
• CDTW seems to generalize well
• Decision function has to be improved
Mediaeval 2012 Workshop 15
Conclusion
• CDTW: promising results– Data-based approach with satisfactory results– Significantly outperforms (naive) DTW– Good generalization
• Future work:– Decision function– Acoustic descriptors– Integrate „hard“ path constraints into search
Mediaeval 2012 Workshop 16