The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task

The TUM Cumulative DTW Approach for the Spoken Web

Search Task

Cyril Joder, Felix Weninger, Martin Wöllmer, Björn Schuller

Institute for Human-Machine CommunicationTechnische Universität München

Summary

• Not a „system“• Low-level features only• No ASR• Little „engineering“• Method of integrating discriminative

training into DTW

Mediaeval 2012 Workshop 2

Cumulative DTW (CDTW)

• Limitations of DTW: – Only one local cost function (distance)– Usually manual parameter tuning

• Idea: – Use different local cost functions for each step – Automatic learning of these functions as

combination of general features


From DTW to CDTW

• Local cost function:


(𝑖 , 𝑗)

(𝑖 , 𝑗−1)

(𝑖−1 , 𝑗 )

(𝑖−1 , 𝑗−1)

𝛼1

𝛼2

𝛼3

• Dynamic Programming:

From DTW to CDTW

• Local step function:


(𝑖 , 𝑗)

(𝑖 , 𝑗−1)

(𝑖−1 , 𝑗 )

(𝑖−1 , 𝑗−1)

𝑠1(𝑖 , 𝑗)

𝑠2(𝑖 , 𝑗)

𝑠3( 𝑖 , 𝑗)


From DTW to CDTW

• Local step function:


(𝑖 , 𝑗)

(𝑖 , 𝑗−1)

(𝑖−1 , 𝑗 )

(𝑖−1 , 𝑗−1)

𝑠1(𝑖 , 𝑗)

𝑠2(𝑖 , 𝑗)

𝑠3( 𝑖 , 𝑗)


Softmax?

+ Differentiable– Allow for an optimization of the

+ Combine several alignment paths– More robust to local changes

- Only give a score (not the optimal path)


Features

• Acoustic descriptors: MFCC++ (D=36) – HTK, 25 ms, CMN, Global normalization

• Features (k=1…D):– Local distance

– “Local self-similarity”

; – Distance / product of the self-similarities


Decision

• Are the two sequences instances of the same word/expression?

• Learning of the parameters.– Backpropagation (stochastic gradient descent)– Training data: queries/utterances of dev set


Decision𝑆( 𝐼 , 𝐽 )𝐼+ 𝐽

Search Procedure

Given query and utterance

1) Feature extraction

2) Candidate search in

3) CDTW comparison

4) Score post-processing


Candidate Search

• Align query with entire utterance– CDTW with backtracking– “Scores” for each point

• Extract potential starts and ends– Peak-picking of scores

• Filter by duration– Only allow warping factors < 2


Candidate Search

• Align query with entire utterance– CDTW with backtracking– “Scores” for each point

• Extract potential starts and ends– Peak-picking of scores

• Filter by duration– Only allow warping factors < 2


CDTW Score Post-Processing

• Same decision function as for learning– Many false positives– Bias toward some queries

• Heuristic post-processing:– For each query, subtract a specific threshold– Threshold: 90-th percentile of the CDTW

scores for that query


Results

run devQ-devC evalQ-devC devQ-evalC evalQ-evalC

P(miss) 55.6% 59.5% 60.2% 54.5%

P(FA) 1.18% 1.13% 1.17% 1.13%

ATWV 0.263 0.333 0.164 0.290


• Great improvement over naive DTW– ATWV = 0.065 on devQ-devC

• ATWV scores depend on the run

Results

• DET curves similar

• CDTW seems to generalize well

• Decision function has to be improved


Conclusion

• CDTW: promising results– Data-based approach with satisfactory results– Significantly outperforms (naive) DTW– Good generalization

• Future work:– Decision function– Acoustic descriptors– Integrate „hard“ path constraints into search


Thank you.

[email protected]


The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task

Technology