• EARS STT Workshop at ICASSP, March 2005 EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel, David Graff, Kevin Walker, Mark Liberman {ccieri,maamouri,shudong,jfiumara, strassel,graff,walkerk,[email protected]}
EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel, David Graff, Kevin Walker, Mark Liberman {ccieri,maamouri,shudong,jfiumara, strassel,graff,walkerk,[email protected]}. What Happens Next?. Collect feedback here - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
• EARS STT Workshop at ICASSP, March 2005
EARS STT Workshop at ICASSP
Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel, David Graff, Kevin Walker, Mark
• Collect feedback here• Check feasibility of new ideas
– e.g. availability of BN (tran)scripts
• Estimate cost, timeline for wish list• Sponsors allocate funds• EARS Board revise priorities• Re-estimate cost, timeline for task list• Communicate final plan• “Start”
• EARS STT Workshop at ICASSP, March 2005
What Happened Next?• Feedback was generally favorable• Next day learned of 3 month projects• Received 25% funding• Preparation of utility thresh holds• Learned of TIDES/EARS end• Learned that GALE <> TIDES+EARS• Completed existing commitments
– STT Test Sets (MT Test Set)– CTS Collections
• Adjusted focus to GALE preparation
• EARS STT Workshop at ICASSP, March 2005
Broadcast News• Continue 2004 collection
– >2000h English: VOA, NBC/MSNBC, CNN, ABC, PBS, PRI, WB17– >1000h Chinese: VOA, CCTV, Radio Free Asia (RFA), NTDTV, Tai Yuan– >1000h Arabic: VOA, Al Hurra, Al Jazeera, Dubai, Jordan TV, LBC, Nile
• Select 2005 evaluation set then distribute 2004 data (February 2005)– delivery made after eval set picked
• 2005 Collection same sources, volumes– add semi-automatic language, source, program ID to QC process– harvest (tran)scripts where possible– 100 hours of transcribed Chinese BN (commercial, QTr)– 100 hours of transcribed Arabic BN (commercial, QTr)– collect broadcast conversations: audio and (tran)scripts
• Continue IPR negotiations• Contribute to Experiments
– Utility of Careful vs. Commercial vs. QTr. vs. CC. vs. Roverized ASR• Update pronouncing Lexicons with vocab from English, Chinese, Arabic• Continue collection with sources adjusted for GALE
– Greater focus on broadcast conversation– Total: 62.5 hrs/week of Arabic, 60 hrs/week of Chinese, 75 hs/week of English– BC: 2.5 hours/week Arabic, 15 hours/week Chinese, 25 hours/week English– Acquired IPR for several new programs: 100% English 50% of Arabic, Chinese
• EARS STT Workshop at ICASSP, March 2005
English CTS• Volume: complement 2003 collection to
provide another 1400 hours (was 850) with subjects making 1-20 10-minute calls
• Used November 2003 Topics• BBNT/WordWave doing transcription• Complete collection of 1400 hours• Finalize evaluation set• Distribute beginning in December as
transcripts are ready• 1400 hours sent to BBN/WordWave for
transcription• 450 hours distributed to sites February 17
• EARS STT Workshop at ICASSP, March 2005
Chinese CTS• New Collection at HKUST
– Target 200 hours transcribed, gender balance, regions represented
• Transcription based upon RT03• 150 hours in delivered to LDC so far
– regions not balanced across delivery increments• Select 2005 evaluation & dev/test sets
– to control demographics across train/test sets• Deliver training data once final increment has arrived and
evaluation data extracted• Repeat collection in 2005
– require gender, age, regional balance across collection epoch– require word segmentation?
• Build portable platform?• HKUST finished Collection of 150 hours of CTS
– ready for release once test set extracted– will deliver 50 more hours at end of March– will collect & transcribe another 50 hours through June
• EARS STT Workshop at ICASSP, March 2005
Arabic CTS• Fisher Protocol, platform in US• Select 2005 evaluation set from current collection• Continue collection until current pool sapped• Complete audit and transcription; deliver in December• Add ‘yellow’ tier (surface phonemic) transcription• Build portable platform? Begin new dialect?• Demographics changed since last test sets created
– new Dev/Test as well as Eval set required• Finished 50 hours of Levantine Arabic CTS• Released on 01/15/2005 as LDC 2005SO7 & LDC 2005TO3• 50 more hours of Levantine due March 31, 2005• 85 hours scheduled June 30, 2005 ???• Yellow layer transcription of 15h underway• RT rates improving: 8-10xRT on green, 15xRT yellow
(assuming green)
• EARS STT Workshop at ICASSP, March 2005
STT Test Sets
• None
• EARS STT Workshop at ICASSP, March 2005
MDE
• Ported English specification v6.2 to Chinese, Arabic
• Created MDE v7 specification, tool for English
• Created Chinese and Arabic tools• Created small pilot data set in each
language• Distributed as: LDC2004E47
• EARS STT Workshop at ICASSP, March 2005
GALE Preparation• Created 13 new Fisher English topics
designed to elicit ACE worthy conversations• Collected 500 conversations; manually
selected 25% for transcription. ACE transcribed; are in ACE annotation pipeline
• LDC Staff Read DLI DLPT material in Arabic• LDC Staff read WSJ articles• In preparation for GALE, adding new source