Cross-Language Access to Recorded Speech in the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne, Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka
Jan 19, 2016
Cross-Language Access to Recorded Speech
in the MALACH Project
Douglas Oard, Dina Demner-Fushman, Jan Hajic,Bhuvana Ramabhadran, Sam Gustman, Bill Byrne,
Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka
Outline
• The MALACH project
• Searching speech
• A cross-language retrieval experiment
• Next steps
The MALACH Project
• 52,000 interviews with Holocaust survivors– 116,000 hours (180 TB MPEG-1)– 32 languages, recorded in 67 countries
• Present: Manual indexing– 14,000 controlled vocabulary terms
• Future: Automatic indexing– Speech recognition– Translation
Who Uses the Collection?
• History• Linguistics• Journalism• Material culture• Education• Psychology• Political science• Law enforcement
• Book• Documentary film• Research paper• CDROM• Study guide• Obituary• Evidence• Personal use
Discipline Products
Based on analysis of 280 access requests
Research Challenges
• Speech Recognition– Spontaneous, accented, elderly, language switching
• Computational Linguistics– Segmentation, classification, summarization, extraction
• Information Retrieval– Query formulation, search, selection, examination, use
Today
Tomorrow (Josef Psutka)
Supporting Information Access
SourceSelection
Search
Query
Selection
Ranked List
Examination
Recording
Delivery
Recording
QueryFormulation
Search System
Query Reformulation and
Relevance Feedback
SourceReselection
Key Issues in Speech Retrieval
• Recognition accuracy– Content-based retrieval works when WER<40%
• Topic segmentation– Average MALACH interview is 2.3 hours!
• Multi-scale summarization– Brief summaries: selection from a ranked list– Detailed summaries: minimize audio replay
English Recognition Accuracy
• 60% WER for off-the-shelf systems!– 3 systems (broadcast news, dictation, telephone)
• MLLR adaptation helps– 33% WER for fluent speech– 46% WER for heavy accents/disfluent speech
• Next step: retrain on transcribed interviews– 200 hours from 800 speakers
Cross-Language Search
• Query formulation– Spoken words (free text)– Thesaurus descriptors
• Segment selection– Speech-to-text translation– multi-scale indicative summaries
• Use of retrieved segments– Query reformulation– Incorporation in projects
Ranked Retrieval System Design
ComputeTerm Weights
Build Index
Documents
ComputeTerm Weights
ComputeDocument Score
Sort ScoresRankedList
Query
TranslationLexicon
Ranked Retrieval
Czech/EnglishTranslationLexicon
Evaluation Framework
Ranked List
EnglishDocuments
Relevance Judgments Evaluation
Measure of Effectiveness
Czech Queries
Czech/English Test Collection
• 113,000 English newspaper stories
• Two sets of 33 Czech queries – S: Very short (1-3 words)– L: Sentence-length
• Human “ground truth” relevance judgments– Pooled assessment methodology (CLEF-2000)
Translation Lexicon
• Machine-readable dictionary– Lemmatized Czech query words– Looked each up in “PC Translator”
• Bilingual term list– Downloaded 800 term pairs from Ergane
• Retained untranslatable terms– Stripped diacritics to match proper names– Optionally, made minor corrections (by hand)
• e.g., “afrika” to “africa”
Example Query
• Original Czech query (S)– Architektura v Berlínì
• Word-by-word translation into English– architecture architecture– at below beneath by embattled in inside into on per
under upon upstairs v within at below beneath by embattled in inside into on per under upon upstairs v within
– berlin
Example Search Results
• Creating a new architectural vocabulary for a democratic Berlin
• UCLA merges architecture and arts into a new school
• Best of Berlin for young travelers
• Who owns the Nazi paper trail?
• A commitment to change the world; No place like utopia: Modern Architecture and the Company we Kept …
• On the record: Sanderling's dark take on Sibelius
• Max Bill, 85; Controversial Swiss artist, sculptor and writer
• The week ahead: Berlin; Farewell to allies
• Roll over Beethoven; Jeff Berlin leaves the violin and classical …
• Californians had right stuff for airlift; Europe: former pilots …
Precision-Recall Graph
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Inte
rpo
late
d P
reci
sio
n
Average Precision = 0.477
Czech title query 1, LA Times Documents, CLEF 2000 Relevance Assessments
0.0
0.2
0.4
0.6
0.8
1.0
1 3 4 5 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 24 26 28 29 30 31 32 33 34 36 37 38 39 40
Query
Ave
rag
e P
reci
sio
n
Average Precision
Czech title queries, LA Times Documents, CLEF 2000 Relevance Assessments
Mean Average Precision = 0.188
0.477
Results
0.0
0.1
0.2
0.3
0.4
0.5
No Translation DQT DQT +Names
MonolingualMea
n A
ver
age
Pre
cisi
on
TTD
Results
• Czech seems to pose no unusual problems– 55% of monolingual with simple techniques
• Suitable Czech/English resources exist– Czech morphology– Czech/English bilingual lexicon
• Multiword expression handling would help– Named entities, non-compositional phrases
Some Next Steps
• Integrate Czech/English statistical MT– Johns Hopkins (Summer 2002 Workshop)
• Integrate with English and Czech ASR– IBM and Univ of West Bohemia/Charles Univ
• Integrate into an interactive retrieval system– University of Maryland and Shoah Foundation
For More Information• Cross-language and speech retrieval
– http://www.clis.umd.edu/~dlrg/clir/– http://www.clis.umd.edu/~dlrg/speech/
• The MALACH project– http://www.clsp.jhu.edu/research/malach/
• NSF/EU Spoken Word Access Working Group– http://www.dcs.shef.ac.uk/spandh/projects/swag/