Top Banner
SPOKEN LANGUAGE CORPUS PROJECT SPOKEN CORPORA FOR THE 9 OFFICIAL SOUTH AFRICAN AFRICAN LANGUAGES
41

SPOKEN LANGUAGE CORPUS PROJECT

Jan 14, 2016

Download

Documents

Yuri

SPOKEN LANGUAGE CORPUS PROJECT. SPOKEN CORPORA FOR THE 9 OFFICIAL SOUTH AFRICAN AFRICAN LANGUAGES. The Asmara Declaration – Rusandre What’s the point of spoken language corpora? – Jens Overview of the project and it’s phases – Rusandre. The recording phase – Jens/Mmem - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • SPOKEN LANGUAGE CORPUS PROJECTSPOKEN CORPORA FOR THE 9 OFFICIAL SOUTH AFRICAN AFRICAN LANGUAGES

  • Workshop OverviewThe Asmara Declaration RusandreWhats the point of spoken language corpora? JensOverview of the project and its phases RusandreThe recording phase Jens/MmemThe transcription phase JensThe checking phase JensThe tagging phase Leif/RusandreResearch output - Jens

  • THE ASMARA DECLARATION - 2000Dialogue among African languages is essential: African languages must use the instrument of translation to advance communication among all people, including the disabled.All African children have the inalienable right to attend school and learn in their mother tongues. All effort should be made to develop African languages at all levels of education.

  • ASMARA DECLARATION -CNTDPromoting research on African languages is vital for their development, while the advancement of African research and documentation will be best served by the use of African languages.The effective and rapid development of science and technology in Africa depends on the use of African languages and modern technology must be used for the development of African languages.

  • Whats the point of spoken language corpora?Jens Allwood

    Corpus linguistics / Armchair linguistics

  • PROJECT MANAGEMENT

  • OBJECTIVESTo develop a platform of computer supported basic linguistic resources for the previously disadvantaged languages of SA The resources will be in the form of archived audio-visual recordings of activity-based natural language use;machine-readable transcriptions of recordings for corpus-driven searches; morphologically tagged corpora for corpus-based searches.

  • PROJECT PHASES2002 - 2004Ongoing Audio-video recordings of activity-based spoken language use (min. 200hrs p/l).Transcriptions (enriched with comment lines) of recordings in machine-readable text format.Checking and editing of transcriptions.Manual morphological tagging of corpora.Automated tagging of corpora.Research outputs.

  • The recording phaseWhat to recordActivity typesWhat to think about when recording natural language dialoguesKeep it naturalThe video camera, microphone, etcKeep the camera fixed!

  • Recording and transcriptionPractical exercise!

    A short recordingTranscribe together

  • Transcription StructureHeader (background information about transcription and recorded activity)Body (the actual transcription consisting of two kinds of elements)Contributions (transcribed utterances of participants in the recorded activity)Information lines - marks various peculiar aspects in the contributions and recorded activity

  • Example of a header@ Recorded activity ID:@ Activity type: @ Recorded activity title: @ Recorded activity date:@ Recorder:@ Participant: A = F1 ()@ Participant: B = M2 ()@ Participant: C = M3 ()@ Transcriber: @ Transcription date:@ Checker: @ Checking date:@ Anonymised: No@ Activity Medium: face-to-face@ Activity duration: 00:44:30@ Other time coding: Each section@ Tape: V0105@ Section: Family affairs@ Section: Crime@ Section: Unemployment@ Section: Closing@ Comment: Medunsa open ended conversation between two adult speech therapy students Bukiwe and Lunga

  • Transcription header@ Recorded activity ID: V010501V = Video, 01 = project number05 = Tape number within this project01 = Recording number

    @ Activity type: Informal conversation

    @ Recorded activity title: Getting to know each other@ Recorded activity date: 20020725@ Recorder: Britta Zawada

  • Transcription header, cont@ Participant: A = F2 (Lunga)@ Participant: B = F1 (Bukiwe)F stands for femaleF1 is unique for Bukiwe in the entire corpus A and B are ID:s for the participants

  • Transcription header, cont@ Transcriber: Mvuyisi Siwisa@ Transcription date: 20020805

    @ Checker: Rusandre Hendrikse@ Checking date: 20020912

  • Transcription header, cont@ Anonymised: NoIndicates whether personal names, etc have been changed to pseudonyms (Yes) or not (No) both in the header and in the conversation

    @ Activity Medium: face-to-faceNormally spoken, face to face, but could also have other values, like telephone conversations.

  • Transcription header, cont@ Activity duration: 00:44:30Duration in hours, minutes and seconds

    @ Other time coding: Each sectionThere is a time line for each section

    @ Tape: V0105This is a part of the recorded activity ID

  • Transcription header, cont@ Section: Family affairs@ Section: Crime@ Section: Unemployment@ Section: Closing

    @ Comment: Medunsa open ended conversation between two adult speech therapy students Bukiwe and LungaAny relevant information that is not covered by any of the required headings

  • The bodyThis is the actual transcription - the background information is in the headerFour kinds of lines:$A: a ga ba a go senya ka kwanoContribution@ < nod >Information line At officeSection line# 00:10:00Time line

  • Sections Introduction$A: a ga ba a go senya ka kwano$B: nnyaa // tshenyo yona e te:ng le fa (...) bothata ke tshaba thobolo... At home$A: on{e}o:r{e}a:re thobalo

  • Contributions$A: < a ga ba a go senya ka kwano > @ < circular hand movements >$B: nnyaa // tshenyo yona e te:ng le fa (...) bothata ke tshaba thobolo$C: eng$B: THOBOLO e ntsi thata ka kwano /// [1 ka kwa ga rona thobolo ga e kalokalo ]1$C: [1 < o: > ]1 tlhobo:lo < >@ < pointing gesture indicating a gun>@ $A: on{e}o:r{e}a:re thobalo

  • Comment Lines$A: < a ga ba a go senya ka kwano > @ < circular hand movements >$B: nnyaa // tshenyo yona e te:ng le fa (...) bothata ke tshaba thobolo$C: eng$B: THOBOLO e ntsi thata ka kwano /// [1 ka kwa ga rona thobolo ga e kalokalo ]1$C: [1 < o: > ]1 tlhobo:lo < >@ < pointing gesture indicating a gun>@ $A: on{e}o:r{e}a:re thobalo

  • Overlaps$A: < a ga ba a go senya ka kwano > @ < circular hand movements >$B: nnyaa // tshenyo yona e te:ng le fa (...) bothata ke tshaba thobolo$C: eng$B: THOBOLO e ntsi thata ka kwano /// [1 ka kwa ga rona thobolo ga e kalokalo ]1$C: [1 < o: > ]1 tlhobo:lo < >@ < pointing gesture indicating a gun>@ $A: on{e}o:r{e}a:re thobalo

  • Contrastive stress, pauses and lengthening$A: < a ga ba a go senya ka kwano > @ < circular hand movements >$B: nnyaa // tshenyo yona e te:ng le fa (...) bothata ke tshaba thobolo$C: eng$B: THOBOLO e ntsi thata ka kwano /// [1 ka kwa ga rona thobolo ga e kalokalo ]1$C: [1 < o: > ]1 tlhobo:lo < >@ < pointing gesture indicating a gun>@ $A: on{e}o:r{e}a:re thobalo

  • Unclear speech, reduction, and glottal stop$A: < a ga ba a go senya ka kwano > @ < circular hand movements >$B: nnyaa // tshenyo yona e te:ng le fa (...) bothata ke tshaba thobolo$C: eng$B: THOBOLO e ntsi thata ka kwano /// [1 ka kwa (ga rona thobolo ga) e kalokalo ]1$C: [1 < o: > ]1 tlhobo:lo < >@ < pointing gesture indicating a gun>@ $A: on{e}o:r{e}a:re thobalo // ee

  • Research outputJens AllwoodA distributed database (corpus)Networks (homepages)Spoken language corpus activities (seminars, workshops)

  • TAGGING SPOKEN LANGUAGE SAMPLESPROBLEMATIC ISSUES CONVENTIONS & STANDARDSA P Hendrikse 16/03/04

  • PROBLEMATIC ISSUESLoans and codeswitchingFixed expressionsSpoken language reductionsMorphophonological issuesDesigning a tag setManual taggingA drag-and-drop taggerAutomated tagging

  • Loans and CodeswitchingNon-indigenised codeswitching ndifuna Indigenised but non-standardised codeswitching loans >ndiyakleyimisha?ndiyaklayimisha? ndiyafonisha?ndiyafowunisha?

  • Fixed ExpressionsA continuum:Idioms/proverbs prefabricated expressions collocationsHow fixed is fixed?Into yokuba (*izinto zokuba)Nantso ke (*nantsi ke?)(Ke) kaloku (ke)Bafondini/mfondiniUndincedileUngadinwa nangomso

  • Fixed Expressions cntdFlagging fixed phrasesInto_yokubaKe_kaloku_keMorphosyntactic tagging or not?Ke_kaloku_keOrKe_kaloku_ke

  • Spoken language reductionsStandardised reductionsNgokuba > ngobaWritten standard reduction: reconstruction convention {} not used, i.e. *ngo{ku}baNon-standardised reductionsMusa ukuhamba > sukuhamba (wsr) >Suhamba (non-standardised)

  • Spoken Reductions cntdReconstruction conventionS{uku}hambaTaggedS{uku}hamba

  • Morphophonological IssuesCoalescenceNenkomo > nenkomoNeenkomo > neenkomoSyllabificationNgasendl{w}ini > ngasendl{w}iniAyikafiki > ayikafiki

  • Morphophonological cntdElisionAndinamoto > andinamotoStem modificationsEmlanjeni > emlanjeni

  • Designing a tag setGranularityLexical categoriesN, V (Tagging lexical categories is problematic in an agglutinating language)Syntagmatic morphological slotsamadodana > amadodana

  • Designing cntdParadigmatic instantiations within a syntagmatic slotgnp = ---Word categoriesnje (wenjenje) nje; njalo; njeya ke ke kaloku keke kaloku keke_kaloku_keemlanjeni>??

  • Designing cntdSpoken language expressionsNon-word like expressions 2 problemsStandardising orthographic representationTags e: mh:uh_uh_uh

  • Designing cntdWord-like expressions thixoThixoThixoHeyi_wethuNantso_keSuka_(wena)

  • Manual taggingManual tagging necessary for 3 reasons Identifying tagging problems and problematic phenomena and revising the tag set Developing a training corpus Correcting automated tagging errors Manual (typing) tagging not ideal Tedious Error-prone Solution: Drag-and-drop tagger

  • Drag-and-drop taggerDemonstration of drag-and-drop tagger