Top Banner
Audio Information for Hyperlinking of TV Content Petra Galuščáková and Pavel Pecina [email protected]ff.cuni.cz Faculty of Mathematics and Physics Charles University in Prague SLAM Workshop, 30. 10. 2015
23

Audio Information for Hyperlinking of TV Content

Jan 08, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Audio Information for Hyperlinking of TV Content

Audio Information for Hyperlinking of TV Content

Petra Galuščáková and Pavel [email protected]

Faculty of Mathematics and PhysicsCharles University in Prague

SLAM Workshop, 30. 10. 2015

Page 2: Audio Information for Hyperlinking of TV Content

2

Hyperlinking TV Content

● Our main objective: create hyperlinks● Retrieve segments similar to a given query segment from

the collection of television programmes.

● Benefits:● Recommendation – bring additional entertainment value● Exploratory search – explore the topic and enable users to

find unexplored connections

Page 3: Audio Information for Hyperlinking of TV Content

3

BBC Broadcast Data

● Subtitles● Three ASR transcripts

● LIMSI– word variants occurring at the same time– confidence of each word variant

● TED-LIUM– confidence of each word

● NST-Sheffield● Metadata● Prosodic features

Page 4: Audio Information for Hyperlinking of TV Content

4

System Description

● Retrieve relevant segments● Divide documents into 60-second long segments● A new segment is created each 10 seconds● Index textual segments● Post-filter retrieved segments

● A query segment is transformed to textual query● Terrier IR Framework● Speech retrieval

● Suffering from problems associated with ASR systems

Page 5: Audio Information for Hyperlinking of TV Content

5

Speech Retrieval Problems

1. Restricted vocabulary● Data and query segment expansion● Combination of transcripts

2. Lack of reliability● Utilizing only the most confident words of the

transcripts● Using confidence score

3. Lack of content● Audio music information● Acoustic similarity

Page 6: Audio Information for Hyperlinking of TV Content

6

1. Restricted Vocabulary

● Number of unique words in transcripts is almost three times smaller than in subtitles.

● Low frequency words are expected to be the most informative for the information retrieval.

● Expand data and query segments● Metadata ● Content surrounding the query segment

● Combine different transcripts

Page 7: Audio Information for Hyperlinking of TV Content

7

Data and Query Segment Expansion

● Metadata● Concatenate each data and query segment with

metadata of the corresponding file.● Title, episode title, description, short episode synopsis,

service name, and program variant● Content surrounding the query segment

● Use 200 seconds before and after the query segment.

Page 8: Audio Information for Hyperlinking of TV Content

8

Data and Query Segment Expansion Results

Page 9: Audio Information for Hyperlinking of TV Content

9

MAP-bin vs. WERMAP-bin

WER

Page 10: Audio Information for Hyperlinking of TV Content

10

Data and Query Segment Expansion Results

● The improvement is significant in terms both measures.● Expansion using metadata and context may substantially

reduce query expansion problem.● The highest MAP-tol score was achieved on the LIUM

transcript.● Even though the transcripts have a relatively high WER.

● The metadata and context produce much higher relative improvement to the automatic transcripts than to the subtitles.

● MAP-bin score corresponds with the WER

Page 11: Audio Information for Hyperlinking of TV Content

11

Transcripts Combination

MAP-bin MAP-tol

Page 12: Audio Information for Hyperlinking of TV Content

12

Transcripts Combination

● The combination is generally helpful.● Even though the high score achieved by the LIUM

transcripts● The overall highest MAP-bin score was achieved using

union of the LIMSI and NST transcripts.● Outperforms the results achieved with the subtitles

Page 13: Audio Information for Hyperlinking of TV Content

13

2. Transcript Reliability

● WER● LIMSI: 57.5%● TED-LIUM: 65.1%● NST-Sheffield: 58.6%

● Word variants● Word confidence

Page 14: Audio Information for Hyperlinking of TV Content

14

Word Variants

● Compare utilization of the first, most reliable word and all word variants in LIMSI transcripts.

Page 15: Audio Information for Hyperlinking of TV Content

15

Word Confidence

● Only use words with high confidence scores● Only the words from LIMSI and LIUM transcripts with a

confidence score higher than a given threshold● Increased both scores for the development set● It did not outperform fully transcribed test data● We also experimented with voting

Page 16: Audio Information for Hyperlinking of TV Content

16

3. Lack of Content

● We only use content of the subtitles/transcripts● A wide range of acoustic attributes could also be

utilized: applause, music, shouts, explosions, whispers, background noise, …

● Acoustic fingerprinting● Acoustic similarity

Page 17: Audio Information for Hyperlinking of TV Content

17

Acoustic FingerprintingMotivation

● Obtain additional information from the music contained within the query segment

● Especially helpful for hyperlinking music programmes

Page 18: Audio Information for Hyperlinking of TV Content

18

Acoustic Fingerprinting

● 1) Minimize noise in each query segment● Query segments were divided into 10-second long

passages; a new passage was created each second● 2) Submit sub-segments to Doreso API service● 3) Retrieve song title, artist and album

● Development set: 4 queries out of 30 ● Test set: 10 queries out of 30

● 4) Concatenate title and artist and album name with text of query segment

● Both retrieval scores drop

Page 19: Audio Information for Hyperlinking of TV Content

19

Acoustic SimilarityMotivation

● Retrieve identical acoustic segments● E.g. signature tunes and jingles

● Detect semantically related segments ● E.g. segments containing action

scenes and music

Page 20: Audio Information for Hyperlinking of TV Content

20

Acoustic Similarity

● Calculate similarity between data and query vector sequences of prosodic features

● Find the most similar sequences near the beginning

● Linearly combine the highest acoustic similarity with text-based similarity score

● MAP-bin: 0.2689 0.2687

● MAP-tol: 0.2465 0.2473

Page 21: Audio Information for Hyperlinking of TV Content

21

Conclusion

Page 22: Audio Information for Hyperlinking of TV Content

22

Overview

Restricted vocabularyData expansion +Transcripts combination +

Transcript reliabilityWord variants +Word confidence -

Lack of contentAcoustic fingerprinting -Acoustic similarity +

Page 23: Audio Information for Hyperlinking of TV Content

23

Thank you

This research has been supported by the project AMALACH (grant n. DF12P01OVV022 of the program NAKI of the Ministry of Culture of the Czech Republic), the Czech Science Foundation

(grant n. P103/12/G084), and the Charles University Grant Agency (grant n. 920913).