Top Banner

Click here to load reader

of 46

Issues in Designing a Corpus of Spoken Irish

May 11, 2015

ReportDownload

Technology

© Elaine Uí Dhonnchadha, Alessio Frenda, Brian Vaughan

  • 1.Issues in Designing a Corpusof Spoken IrishElaine U Dhonnchadha, Alessio Frenda, Brian VaughanCentre for Language and Communication StudiesTrinity College DublinIreland.

2. Overview Linguistic Background Corpus Design Pilot Corpus Data Collection and Recording Transcription Corpus Processing Future workLREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua2 3. Irish Indo-European - Celtic language Verb initial language (VSO) Irish is the first official language of Ireland -English is the second official language. Irish is spoken as a first language (L1) in onlya small number of areas known as Gaeltachta. Irish is learned at school as a second languageby the majority of the population.LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua3 4. Irish Speaking Regions Na Gaeltachta 1 2. Donegal 3. Mayo2 4. Galway 73 5. Kerry 6. Cork 7. Waterford4 8. Meath 65LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua4 5. Irish 1.6 million of the 3.9 millionpopulation report proficiency in thespoken language. The number of native speakers is 64thousand These sociolinguistic conditions meanthat a comprehensive spoken corpuscan play a vital role in promoting andpreserving the spoken language.LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua5 6. Motivation Linguistic research language change, language contact phonology, syntax, semantics, pragmatics, discourse etc. Lexicography (new Irish-Englishdictionary project due to start in 2013) Teaching materials Speech RecognitionLREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua6 7. Existing Resources Spoken Language Collections Caint Chonamara (1964) 1.2 mill. wds Iorras Aithneach Irish (pub. 2007) Doegen Records Web Project(1928-1931) (various dialects) Other dialectal studies (without audiofiles)LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua7 8. Motivation Various difficulties one dialect, or one year Different dialects but mainly songs,stories, monologues Very little dialogue Book and CD format (pdf) Some phonetic transcriptions but notother linguistic annotation Limited searchabilityLREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua8 9. Motivation Need a spoken corpus which is: Dialectally balanced Diachronically balanced Gender/age balanced L1 and L2 speakers Text aligned with audio/video file Linguistically annotatedLREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua9 10. Corpus Design We examined the design of a number of corpora: London-Lund Corpus of Spoken English Lancaster/IBM Spoken English Corpus (SEC) Corpus of Spoken New Zealand English British National Corpus (BNC) COREC (Corpus oral de referencia del EspaolContemporneo) CLIPS (Corpora e Lessici dellItaliano Parlato e Scritto) ICE (The International Corpus of English) CGN (Corpus Gesproken Nederlands)LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 10 11. Corpus Design One common feature shared by the morerecent corpora surveyed here is the extentof naturalistic conversational material theyinclude. Our design is heavily influenced by ICE andCGNLREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 11 12. Corpus DesignDialoguesPrivate (250, 42%)[r] Face-to-face conversations (120, 20%)(420, 70%) [r] Phone calls (50, 8.5%) [r] Video calls (50, 8.5%) [r] Interviews with teachers of Irish (30, 5%) Public (170, 28%) [r] Classroom Lessons (40, 7%) Broadcast Discussions (40, 7%) Broadcast Interviews (40, 7%) Parliamentary Debates (20, 3%) [r] Legal cross-examinations (10, 1.5%) [r] Business Transactions (20, 3%)Monologues Unscripted (90, 15%) Spontaneous Commentaries (40, 7%)(180, 30%)Unscripted Speeches (20, 3%)Demonstrations (20, 3%)[r] Legal Presentations (10, 1.5%) Scripted (90, 15%)Broadcast News (40, 7%) Broadcast Talks (40, 7%) Non-broadcast Talks (10, 1%)LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 12 13. Corpus Design Our design considers the followingvariables: Time frame Dialectal variation Sociolinguistic variation Gender and age Context and subject matterLREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 13 14. Time Frame We have decided upon the threetime periods P1: 1930-1971 P2: 1972-1995 P3: 1996-presentLREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 14 15. Dialectal Variation We aim to cover the main dialectsof Irish in equal measure i.e. not proportionally to the number ofspeakers of each dialect (which mayhave varied over the years) Ulster (north) Connaught (west) Munster (south)LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 15 16. Sociolinguistic Variation We aim to include Irish speakers fromall linguistic backgrounds Traditional native speakers (L1) Non-native speakers (L2) Non-traditional native speakers (L1), i.e. those who were raised through Irishby L1 or L2 parents, typically in a non-Gaeltacht settingLREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 16 17. Gender and Age Variation We aim to represent both males andfemales proportionally We aim to represent differentgenerations i.e. young adults,middle aged and elderly speakersLREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 17 18. Content Variation We aim to record conversations in avariety of contexts (informal, work,leisure, education etc.) and cover avariety of topics. Overall we aim for a spoken corpusof 2 million words approx.LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 18 19. Pilot Corpus - GaLa 20. Pilot Corpus Funded by Foras na Gaeilge P3: 1996-present (contemporary) Dialogues Mainly public broadcast dialogues (mp3podcasts of radio interviews anddiscussions). We also carried out a small amount ofvideo recording of private dialogueconversations.LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 20 21. Data Collection Four pairs of volunteers agreed to bevideo recorded in informal conversation inthe Speech Communications Laboratory,TCD Video recorded using a Sony HDR-XR500vHigh Definition Handycam. The audio was recorded in two ways: using the onboard camera microphone and using two Sennheiser MKH-60 shotgunmicrophones and an Edirol 4-channel HD Audiorecorder.LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 21 22. Data Collection Audio was recorded at a sampling rate of96KhZ with a bit rate of 24 bits. Bounced down a sampling rate of44.1KhZ with a bit rate of 16bits (theRedbook audio standard), with the higher96KhZ files being used for archiving.LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 22 23. Podcast Extracts 70 x 8 min. audio extracts were transcribedgiving 102,000 words of transcribed speech(8.5 hours approx.). We also aligned and formatted someexisting transcripts, Frenda (2011) materialtranscribed for PhD research TCD (20K); Wigger (2000) Caint Chonamara (10K); Dillon, G. material transcribed for PhD research TCD(5K). overall total 140,000 words (approx.)106 transcripts, 151 speakersLREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 23 24. Transcription Spoken and written language differ in anumber of important respects. The syntactic structure of spontaneousspoken utterances is usually simpler Spontaneous speech: repetitions, falsestarts, hesitations or non-verbalcommunication such as a gesture or thetone of voice. Dialectal pronunciations deviatesubstantially from standardorthographical representationsLREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 24 25. Transcription Guidelines Phonetic or Orthographic transcription We examined a number of transcriptionconventions already in use including CHAT: The CHAT (Codes for the Human Analysis ofTranscripts) System is a comprehensive standardfor transcribing and encoding the characteristics ofspoken language (MacWhinney, 2000). LINDSEI: Louvain International Database of SpokenEnglish Interlanguage Transcription guidelineshttp://www.uclouvain.be/en-307849.html LDC: Linguistic Data Consortiumhttp://www.ldc.upenn.edu /Creating/creating_annotated.LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 25 26. CHAT Guidelines The CHAT (Codes for the Human Analysisof Transcripts) (MacWhinney, 2000). These guidelines were developed for thetranscription of spoken interactionsbetween children and their carers in orderto study child language acquisition. Inaudible segments, phonetic fragments,repetitions, overlaps, interruptions,trailing off, foreign words, proper nounsand numbers etc.LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 26 27. CHAT Guidelines the guidelines are very comprehensivebut there are a few drawbacks toimplementing the guidelines in full it can slow down the transcription processconsiderably some are quite subjective (short, mediumand long pauses) while others are difficult to implement(retracings and reformulations)LREC-2012: SALTMIL-AfLaT Workshop on Language technology for normalisation of less-resourced langua 27 28. LDC Transcription Guidelines LDC guidelines advocate simplicity Keep the rules to a minimum in order tomake transcription as easy as possible forthe transcri