Top Banner
Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne, Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka
21

Cross-Language Access to Recorded Speech in the MALACH Project

Jan 19, 2016

Download

Documents

Jaimie

Cross-Language Access to Recorded Speech in the MALACH Project. Douglas Oard , Dina Demner-Fushman, Jan Hajic , Bhuvana Ramabhadran, Sam Gustman, Bill Byrne, Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka. Outline. The MALACH project Searching speech - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cross-Language Access  to Recorded Speech in the MALACH Project

Cross-Language Access to Recorded Speech

in the MALACH Project

Douglas Oard, Dina Demner-Fushman, Jan Hajic,Bhuvana Ramabhadran, Sam Gustman, Bill Byrne,

Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka

Page 2: Cross-Language Access  to Recorded Speech in the MALACH Project

Outline

• The MALACH project

• Searching speech

• A cross-language retrieval experiment

• Next steps

Page 3: Cross-Language Access  to Recorded Speech in the MALACH Project

The MALACH Project

• 52,000 interviews with Holocaust survivors– 116,000 hours (180 TB MPEG-1)– 32 languages, recorded in 67 countries

• Present: Manual indexing– 14,000 controlled vocabulary terms

• Future: Automatic indexing– Speech recognition– Translation

Page 4: Cross-Language Access  to Recorded Speech in the MALACH Project

Who Uses the Collection?

• History• Linguistics• Journalism• Material culture• Education• Psychology• Political science• Law enforcement

• Book• Documentary film• Research paper• CDROM• Study guide• Obituary• Evidence• Personal use

Discipline Products

Based on analysis of 280 access requests

Page 5: Cross-Language Access  to Recorded Speech in the MALACH Project

Research Challenges

• Speech Recognition– Spontaneous, accented, elderly, language switching

• Computational Linguistics– Segmentation, classification, summarization, extraction

• Information Retrieval– Query formulation, search, selection, examination, use

Today

Tomorrow (Josef Psutka)

Page 6: Cross-Language Access  to Recorded Speech in the MALACH Project

Supporting Information Access

SourceSelection

Search

Query

Selection

Ranked List

Examination

Recording

Delivery

Recording

QueryFormulation

Search System

Query Reformulation and

Relevance Feedback

SourceReselection

Page 7: Cross-Language Access  to Recorded Speech in the MALACH Project

Key Issues in Speech Retrieval

• Recognition accuracy– Content-based retrieval works when WER<40%

• Topic segmentation– Average MALACH interview is 2.3 hours!

• Multi-scale summarization– Brief summaries: selection from a ranked list– Detailed summaries: minimize audio replay

Page 8: Cross-Language Access  to Recorded Speech in the MALACH Project

English Recognition Accuracy

• 60% WER for off-the-shelf systems!– 3 systems (broadcast news, dictation, telephone)

• MLLR adaptation helps– 33% WER for fluent speech– 46% WER for heavy accents/disfluent speech

• Next step: retrain on transcribed interviews– 200 hours from 800 speakers

Page 9: Cross-Language Access  to Recorded Speech in the MALACH Project

Cross-Language Search

• Query formulation– Spoken words (free text)– Thesaurus descriptors

• Segment selection– Speech-to-text translation– multi-scale indicative summaries

• Use of retrieved segments– Query reformulation– Incorporation in projects

Page 10: Cross-Language Access  to Recorded Speech in the MALACH Project

Ranked Retrieval System Design

ComputeTerm Weights

Build Index

Documents

ComputeTerm Weights

ComputeDocument Score

Sort ScoresRankedList

Query

TranslationLexicon

Page 11: Cross-Language Access  to Recorded Speech in the MALACH Project

Ranked Retrieval

Czech/EnglishTranslationLexicon

Evaluation Framework

Ranked List

EnglishDocuments

Relevance Judgments Evaluation

Measure of Effectiveness

Czech Queries

Page 12: Cross-Language Access  to Recorded Speech in the MALACH Project

Czech/English Test Collection

• 113,000 English newspaper stories

• Two sets of 33 Czech queries – S: Very short (1-3 words)– L: Sentence-length

• Human “ground truth” relevance judgments– Pooled assessment methodology (CLEF-2000)

Page 13: Cross-Language Access  to Recorded Speech in the MALACH Project

Translation Lexicon

• Machine-readable dictionary– Lemmatized Czech query words– Looked each up in “PC Translator”

• Bilingual term list– Downloaded 800 term pairs from Ergane

• Retained untranslatable terms– Stripped diacritics to match proper names– Optionally, made minor corrections (by hand)

• e.g., “afrika” to “africa”

Page 14: Cross-Language Access  to Recorded Speech in the MALACH Project

Example Query

• Original Czech query (S)– Architektura v Berlínì

• Word-by-word translation into English– architecture architecture– at below beneath by embattled in inside into on per

under upon upstairs v within at below beneath by embattled in inside into on per under upon upstairs v within

– berlin

Page 15: Cross-Language Access  to Recorded Speech in the MALACH Project

Example Search Results

• Creating a new architectural vocabulary for a democratic Berlin

• UCLA merges architecture and arts into a new school

• Best of Berlin for young travelers

• Who owns the Nazi paper trail?

• A commitment to change the world; No place like utopia: Modern Architecture and the Company we Kept …

• On the record: Sanderling's dark take on Sibelius

• Max Bill, 85; Controversial Swiss artist, sculptor and writer

• The week ahead: Berlin; Farewell to allies

• Roll over Beethoven; Jeff Berlin leaves the violin and classical …

• Californians had right stuff for airlift; Europe: former pilots …

Page 16: Cross-Language Access  to Recorded Speech in the MALACH Project

Precision-Recall Graph

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Inte

rpo

late

d P

reci

sio

n

Average Precision = 0.477

Czech title query 1, LA Times Documents, CLEF 2000 Relevance Assessments

Page 17: Cross-Language Access  to Recorded Speech in the MALACH Project

0.0

0.2

0.4

0.6

0.8

1.0

1 3 4 5 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 24 26 28 29 30 31 32 33 34 36 37 38 39 40

Query

Ave

rag

e P

reci

sio

n

Average Precision

Czech title queries, LA Times Documents, CLEF 2000 Relevance Assessments

Mean Average Precision = 0.188

0.477

Page 18: Cross-Language Access  to Recorded Speech in the MALACH Project

Results

0.0

0.1

0.2

0.3

0.4

0.5

No Translation DQT DQT +Names

MonolingualMea

n A

ver

age

Pre

cisi

on

TTD

Page 19: Cross-Language Access  to Recorded Speech in the MALACH Project

Results

• Czech seems to pose no unusual problems– 55% of monolingual with simple techniques

• Suitable Czech/English resources exist– Czech morphology– Czech/English bilingual lexicon

• Multiword expression handling would help– Named entities, non-compositional phrases

Page 20: Cross-Language Access  to Recorded Speech in the MALACH Project

Some Next Steps

• Integrate Czech/English statistical MT– Johns Hopkins (Summer 2002 Workshop)

• Integrate with English and Czech ASR– IBM and Univ of West Bohemia/Charles Univ

• Integrate into an interactive retrieval system– University of Maryland and Shoah Foundation

Page 21: Cross-Language Access  to Recorded Speech in the MALACH Project

For More Information• Cross-language and speech retrieval

– http://www.clis.umd.edu/~dlrg/clir/– http://www.clis.umd.edu/~dlrg/speech/

• The MALACH project– http://www.clsp.jhu.edu/research/malach/

• NSF/EU Spoken Word Access Working Group– http://www.dcs.shef.ac.uk/spandh/projects/swag/