Design, compilation and Design, compilation and processing of CUCall: a set of processing of CUCall: a set of Cantonese spoken language corpora Cantonese spoken language corpora collected over telephone networks collected over telephone networks by by W.K. Lo, P.C. Ching, Tan Lee and Helen Meng W.K. Lo, P.C. Ching, Tan Lee and Helen Meng The Chinese University of Hong Kong The Chinese University of Hong Kong at at ROCLING XIV ROCLING XIV 16th August 2001 16th August 2001
22
Embed
Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Design, compilation and processing of Design, compilation and processing of CUCall: a set of Cantonese spoken CUCall: a set of Cantonese spoken language corpora collected over language corpora collected over
telephone networkstelephone networks
byby
W.K. Lo, P.C. Ching, Tan Lee and Helen MengW.K. Lo, P.C. Ching, Tan Lee and Helen Meng
The Chinese University of Hong KongThe Chinese University of Hong Kong
atat
ROCLING XIVROCLING XIV
16th August 200116th August 2001
AcknowledgmentAcknowledgment
• The The CUCallCUCall data collection is conducted data collection is conducted
under the support from the Innovation under the support from the Innovation
and Technology Fund (AF/96/99)and Technology Fund (AF/96/99)
• We are also grateful to the industrial We are also grateful to the industrial
sponsors:sponsors:
– Group Sense LimitedGroup Sense Limited
– SmarTone Mobile Communication LimitedSmarTone Mobile Communication Limited
OutlineOutline
• Corpus Design and OrganizationCorpus Design and Organization
– phonetically orientedphonetically oriented
– application orientedapplication oriented
• Data Collection and ProcessingData Collection and Processing
• Data AnalysisData Analysis
• ConclusionsConclusions
Part I:Part I:Corpus Design and OrganizationCorpus Design and Organization
OverviewOverview
• extension to the CUCorpora microphone extension to the CUCorpora microphone speech databasespeech database
• collection of telephone speech data over collection of telephone speech data over fixed-line and mobile networksfixed-line and mobile networks
• allow phonetically oriented and domain allow phonetically oriented and domain specific applicationsspecific applications– rich phonetic coverage with speaking style rich phonetic coverage with speaking style
variationsvariations– words, phrases and digit strings for specific words, phrases and digit strings for specific
useuse
CUCall OrganizationCUCall Organization
Phonetically OrientedPhonetically Oriented
• 5719 5719 sentencessentences– select from the pools of CUSENT training select from the pools of CUSENT training
and testing setand testing set– target for phonetic coverage in a biphone target for phonetic coverage in a biphone
contextcontext
• 90 short paragraphs90 short paragraphs– enrich the phonetic coverage in additional enrich the phonetic coverage in additional
to the sentence materialsto the sentence materials– capture the variations brought about by the capture the variations brought about by the
lengthy nature of the reading materialslengthy nature of the reading materials
– content is unlimited and unconstrainedcontent is unlimited and unconstrained
– contains all kinds of non-speech events, contains all kinds of non-speech events, e.g. correction, hesitation, skipped word, …e.g. correction, hesitation, skipped word, …
– questions must be simple and open-endedquestions must be simple and open-ended
Phonetically OrientedPhonetically Oriented
• Criteria for the questions designCriteria for the questions design– simple enough for spontaneous response; simple enough for spontaneous response;
avoid calculation, memory recall etc.avoid calculation, memory recall etc.– answers are expected to be different for answers are expected to be different for
different speakersdifferent speakers– responses may be either long or shortresponses may be either long or short– avoid answers that are relevant to speakers’ avoid answers that are relevant to speakers’
privacyprivacy
Application OrientedApplication Oriented
• 1440 1440 words and phraseswords and phrases– simple words cover various domainssimple words cover various domains
• names of placesnames of places• listed companieslisted companies• foreign currenciesforeign currencies• navigation commandsnavigation commands
• Digit stringsDigit strings– strings of digits of various lengthstrings of digits of various length
• all ten single digitsall ten single digits• random generated strings of length 7, 8 and 16random generated strings of length 7, 8 and 16
Part II:Part II:Data Collection and ProcessingData Collection and Processing
Collection ProcessCollection Process
• Preparation of reading materialsPreparation of reading materials– prepare reading materials as prompt sheetsprepare reading materials as prompt sheets– separate male & female, fixed & mobile linesseparate male & female, fixed & mobile lines
• Distribution of prompt sheetDistribution of prompt sheet– distributed hierarchically through agentsdistributed hierarchically through agents
• Speakers callSpeakers call– speakers call automatic recording serversspeakers call automatic recording servers– they are identified by unique serial numbersthey are identified by unique serial numbers
• Questionnaire returnQuestionnaire return– information on age, telephone network type information on age, telephone network type
are collectedare collected
Data Collection System Set-upData Collection System Set-up
Calling End : From any location, using any telephone, by all walks of life
Calling End : From any location, using any telephone, by all walks of life
Telephone Companies :mobile/fixed line networkTelephone Companies :mobile/fixed line network
Recording End : telephone outlet,telephony hardware, recording system, data storage system
Recording End : telephone outlet,telephony hardware, recording system, data storage system
Post-processing of dataPost-processing of data for various targetedfor various targeted
domains of applicationsdomains of applications
Post-processing of dataPost-processing of data for various targetedfor various targeted
domains of applicationsdomains of applications
…..
Note : CT board is Dialogic® D/41-ESC
Recording Servers :fixed-line connection to local telephone companies
Recording Servers :fixed-line connection to local telephone companies
Post-processing of DataPost-processing of Data
• Call validationCall validation– received prompt sheets are verified against received prompt sheets are verified against
the recorded speech datathe recorded speech data– user information are entered into databasesuser information are entered into databases
• Phonemic transcriptionPhonemic transcription– all accepted speech data are 100% all accepted speech data are 100%
phonemic transcribed on initial-final levelphonemic transcribed on initial-final level
• Partitioning of collected dataPartitioning of collected data– collected data are partitioned properlycollected data are partitioned properly– speech data and the transcriptions are speech data and the transcriptions are
organized per speaker basisorganized per speaker basis
• the collection process is divided into the collection process is divided into several stagesseveral stages
• expected completion date: March 2002expected completion date: March 2002
• until now, over 200 hours of data (from until now, over 200 hours of data (from 1000 speakers) has been collected1000 speakers) has been collected– 120 hours for phonetically oriented data120 hours for phonetically oriented data– 80 hours for application-specific data80 hours for application-specific data
• over half of the collected have been over half of the collected have been phonemically transcribedphonemically transcribed
ConclusionsConclusions
• design and collection process for the design and collection process for the Cantonese telephone speech corpora is Cantonese telephone speech corpora is presentedpresented
• corpora are designed to cover both corpora are designed to cover both phonetically oriented and application-phonetically oriented and application-specific dataspecific data
• include also long reading materials and include also long reading materials and open questions for spontaneous dataopen questions for spontaneous data
• details of post-processing and data details of post-processing and data analysis are givenanalysis are given