Manual Match UnmatchAMLM 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 UnmatchAMLM Transcripts LI-LPr-0.2 LI-LPr-0.5 LI-P-0.9 TF_IDF Spoken Query Types MAP Slide-group segments with prosody This research is supported by the Science Foundation Ireland (Grant 12/CE/I2267) as part of CNGL (www.cngl.ie) Results Introduction • Speech is more than a simple sequence of words. • Prosodic variation encode rich information about: – emotions, discourse structure, dialogue acts, focus, emphasis, contrast, topic shifting, etc. • We examined the potential of prosodic prominence in the NTCIR-11 SpokenQuery&Doc Task. Indexing Stores normalised prosodic features for each term. David N. Racca Gareth J.F. Jones CNGL Centre for Global Intelligent Content, School of Computing, Dublin City University, Dublin, Ireland {dracca, gjones}@computing.dcu.ie Background and Previous Work Prosody may be useful in speech search: Relationship between stress and TF-IDF scores [1]. SDR exploiting amplitude and duration [2]. Topic tracking exploiting energy and pitch [3]. SCR exploiting pitch, loudness, and duration [4]. Conclusions • No significant differences between prosodic and text based runs. • Transcript quality affects retrieval effectiveness. • Prosodic-based models may be useful for certain queries. Retrieval Increases weights of prominent terms. References [1] F. Crestani. Towards the use of prosodic information for spoken document retrieval . SIGIR'01, 2001. [2] B. Chen et al. Improved spoken document retrieval by exploring extra acoustic and linguistic cues. INTERSPEECH'01, 2001. [3] C. Guinaudeau et al. Accounting for prosodic information to improve ASR-based topic tracking for TV broadcast news . INTERSPEECH'11, 2011. [4] D.N. Racca et al. DCU search runs at MediaEval 2014 Search and Hyperlinking. MediaEval 2014 Multimedia Benchmark Workshop, 2014. IPUs with prosody IPU Grouping Lectures WAV Segment Index Manual Match UnmatchAMLM 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Match Transcripts LI-LPr-0.5 LI-Pr-0.7 LI-Dur-0.3 TF_IDF Spoken Query Types MAP Match Unmatch Manual Match Unmatch Manual Match Unmatch Manual Manual Match Unmatch 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Query 1: Prosodic-based vs TF_IDF TF_IDF Prosodic-based AveP Spoken Query Types Manual Match UnmatchAMLM 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Manual Transcripts LI-Pr-0.7 LI-LPr-0.7 TF_IDF Spoken Query Types MAP DCU at the NTCIR-11 SpokenQuery&Doc Task Data Pre-processing VAD OpenSMILE Julius LVCSR ChaSen Annotation Removal Forced Alignment Lecture Normalisation Queries WAV tf-idf 単語 max ( f0 tf-idf )=280.44 Hz Raw max ( f0 tf-idf )=0.58 Normalised max ( f0 単語 )=236.46 Hz Raw max ( f0 単語 )=0.49 Normalised Terrier Matching tf ( i,j )= k 1 tf i,j tf i,j + k 1 ( 1 −b + b dl j avdl ) idf ( i,C )= log ( N n i + 1 ) ac ( i,j )= { f0 ( i,j ) Pitch [P] l ( i, j ) Loudness [L] d ( i,j ) Duration [Dur] f0 range ( i,j ) Pitch Range [Pr] l ( i, j ) . f0 ( i, j ) [LP] l ( i, j ) . f0 range ( i,j ) [LPr] rel ( q , s j )= ∑ i M w ( i,j ) w ( i,j )= { idf ( i,C )[α . tf ( i,j )+( 1 −α) ac ( i,j )] LI θ ir . tf ( i, j ) . idf ( i,C )+θ ac . ac ( i,j ) θ ir +θ ac G tf ( i, j ) . idf ( i,C ) TF_IDF f0 ( i,j )=max k { max ( f0 i,j k ) } l ( i,j )= max k { max ( l i, j k ) } f0 range ( i,j )=max k { max ( f0 i,j k ) } −min k { min ( f0 i,j k ) } d ( i,j )=max k { d i,j k } IPUs WAV Manually Annotated Transcripts F0, Loudness every 10ms ASR Transcripts Manual Transcripts Normalised F0, Loudness Enriched ASR Transcripts Enriched Manual Transcripts Enriched Transcripts Terrier Indexing