Top Banner
Issues in Designing a Corpus of Spoken Irish Elaine Uí Dhonnchadha, Alessio Frenda, Brian Vaughan Centre for Language and Communication Studies Trinity College Dublin Ireland.
46

Issues in Designing a Corpus of Spoken Irish

May 11, 2015

Download

Technology

Guy De Pauw

© Elaine Uí Dhonnchadha, Alessio Frenda, Brian Vaughan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Issues in Designing a Corpus of Spoken Irish

Issues in Designing a Corpus of Spoken Irish

Elaine Uí Dhonnchadha, Alessio Frenda, Brian Vaughan

Centre for Language and Communication StudiesTrinity College DublinIreland.

Page 2: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

2

Overview

Linguistic Background Corpus Design Pilot Corpus

Data Collection and Recording Transcription Corpus Processing

Future work

Page 3: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

3

Irish

Indo-European - Celtic language Verb initial language (VSO) Irish is the first official language of Ireland -

English is the second official language. Irish is spoken as a first language (L1) in only a

small number of areas known as Gaeltachtaí. Irish is learned at school as a second language

by the majority of the population.

Page 4: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

4

Irish Speaking Regions

Na Gaeltachtaí1. Donegal2. Mayo3. Galway4. Kerry5. Cork6. Waterford7. Meath

1

2

3

4

5

6

7

Page 5: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

5

Irish

1.6 million of the 3.9 million population report proficiency in the spoken language.

The number of native speakers is 64 thousand

These sociolinguistic conditions mean that a comprehensive spoken corpus can play a vital role in promoting and preserving the spoken language.

Page 6: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

6

Motivation

Linguistic research language change, language contact … phonology, syntax, semantics, pragmatics, discourse etc.

Lexicography (new Irish-English dictionary project due to start in 2013)

Teaching materials Speech Recognition

Page 7: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

7

Existing Resources

Spoken Language Collections Caint Chonamara (1964) 1.2 mill. wds Iorras Aithneach Irish (pub. 2007) Doegen Records Web Project (1928-

1931) (various dialects) Other dialectal studies (without audio

files)

Page 8: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

8

Motivation

Various difficulties … one dialect, or one year Different dialects but mainly songs,

stories, monologues Very little dialogue Book and CD format (pdf) Some phonetic transcriptions but not

other linguistic annotation Limited searchability

Page 9: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

9

Motivation

Need a spoken corpus which is: Dialectally balanced Diachronically balanced Gender/age balanced L1 and L2 speakers Text aligned with audio/video file Linguistically annotated

Page 10: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

10

Corpus Design

We examined the design of a number of corpora: London-Lund Corpus of Spoken English Lancaster/IBM Spoken English Corpus (SEC) Corpus of Spoken New Zealand English British National Corpus (BNC) COREC (Corpus oral de referencia del Español

Contemporáneo) CLIPS (Corpora e Lessici dell’Italiano Parlato e Scritto) ICE (The International Corpus of English) CGN (Corpus Gesproken Nederlands)

Page 11: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

11

Corpus Design

One common feature shared by the more recent corpora surveyed here is the extent of naturalistic conversational material they include.

Our design is heavily influenced by ICE and CGN

Page 12: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

12

Corpus Design Dialogues (420, 70%)

Private (250, 42%) [r] Face-to-face conversations (120, 20%) [r] Phone calls (50, 8.5%)[r] Video calls (50, 8.5%) [r] Interviews with teachers of Irish (30, 5%)

Public (170, 28%) [r] Classroom Lessons (40, 7%) Broadcast Discussions (40, 7%) Broadcast Interviews (40, 7%) Parliamentary Debates (20, 3%) [r] Legal cross-examinations (10, 1.5%) [r] Business Transactions (20, 3%)

Monologues (180, 30%)

Unscripted (90, 15%) Spontaneous Commentaries (40, 7%)Unscripted Speeches (20, 3%) Demonstrations (20, 3%) [r] Legal Presentations (10, 1.5%)

Scripted (90, 15%) Broadcast News (40, 7%) Broadcast Talks (40, 7%) Non-broadcast Talks (10, 1%)

Page 13: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

13

Corpus Design

Our design considers the following variables: Time frame Dialectal variation Sociolinguistic variation Gender and age Context and subject matter

Page 14: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

14

Time Frame

We have decided upon the three time periods P1: 1930-1971 P2: 1972-1995 P3: 1996-present

Page 15: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

15

Dialectal Variation

We aim to cover the main dialects of Irish in equal measure i.e. not proportionally to the number of

speakers of each dialect (which may have varied over the years)

Ulster (north) Connaught (west) Munster (south)

Page 16: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

16

Sociolinguistic Variation

We aim to include Irish speakers from all linguistic backgrounds

‘Traditional’ native speakers (L1) Non-native speakers (L2) ‘Non-traditional’ native speakers (L1),

i.e. those who were raised through Irish by L1 or L2 parents, typically in a non-Gaeltacht setting

Page 17: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

17

Gender and Age Variation

We aim to represent both males and females proportionally

We aim to represent different generations i.e. young adults, middle aged and elderly speakers

Page 18: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

18

Content Variation

We aim to record conversations in a variety of contexts (informal, work, leisure, education etc.) and cover a variety of topics.

Overall we aim for a spoken corpus of 2 million words approx.

Page 19: Issues in Designing a Corpus of Spoken Irish

Pilot Corpus - GaLa

Page 20: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

20

Pilot Corpus

Funded by Foras na Gaeilge P3: 1996-present (contemporary) Dialogues Mainly public broadcast dialogues (mp3

podcasts of radio interviews and discussions).

We also carried out a small amount of video recording of private dialogue conversations.

Page 21: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

21

Data Collection

Four pairs of volunteers agreed to be video recorded in informal conversation in the Speech Communications Laboratory, TCD

Video recorded using a Sony HDR-XR500v High Definition Handycam.

The audio was recorded in two ways: using the onboard camera microphone and using two Sennheiser MKH-60 shotgun

microphones and an Edirol 4-channel HD Audio recorder.

Page 22: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

23

Podcast Extracts

70 x 8 min. audio extracts were transcribed giving 102,000 words of transcribed speech (8.5 hours approx.).

We also aligned and formatted some existing transcripts, Frenda (2011) material transcribed for PhD research TCD (20K); Wigger (2000) Caint Chonamara (10K); Dillon, G. material transcribed for PhD research TCD

(5K).

overall total 140,000 words (approx.)106 transcripts, 151 speakers

Page 23: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

24

Transcription

Spoken and written language differ in a number of important respects.

The syntactic structure of spontaneous spoken utterances is usually simpler

Spontaneous speech: repetitions, false starts, hesitations or non-verbal communication such as a gesture or the tone of voice.

Dialectal pronunciations deviate substantially from standard orthographical representations

Page 24: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

25

Transcription Guidelines

Phonetic or Orthographic transcription We examined a number of transcription

conventions already in use including CHAT: The CHAT (Codes for the Human Analysis of

Transcripts) System is a comprehensive standard for transcribing and encoding the characteristics of spoken language (MacWhinney, 2000).

LINDSEI: Louvain International Database of Spoken English Interlanguage Transcription guidelines http://www.uclouvain.be/en-307849.html

LDC: Linguistic Data Consortium http://www.ldc.upenn.edu /Creating/creating_annotated.shtml#Transcription

Page 25: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

26

CHAT Guidelines

The CHAT (Codes for the Human Analysis of Transcripts) (MacWhinney, 2000).

These guidelines were developed for the transcription of spoken interactions between children and their carers in order to study child language acquisition.

Inaudible segments, phonetic fragments, repetitions, overlaps, interruptions, trailing off, foreign words, proper nouns and numbers etc.

Page 26: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

27

CHAT Guidelines

the guidelines are very comprehensive but there are a few drawbacks to implementing the guidelines in full

it can slow down the transcription process considerably

some are quite subjective (short, medium and long pauses)

while others are difficult to implement (retracings and reformulations)

Page 27: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

28

LDC Transcription Guidelines

LDC guidelines advocate simplicity Keep the rules to a minimum in order to

make transcription as easy as possible for the transcriber, which increases transcription speed, accuracy and consistency

In addition automatic procedures are used when possible

Page 28: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

29

Transcription

On average 30 minutes to orthographically transcribe 1 minute of audio material.

Transcription process must be as straightforward and intuitive as possible.

Minimum number of codes and keystrokes [repeated material], xxx, < … > [?], [% comment], @laugh etc., @eng, filled pauses {yeah, ehm, uh..}

Page 29: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

30

Transcription

Dialectal variation maith ‘good’ /mah/ or /maɪ/ an-mhaith ‘very good’ /ənə'wa/

or /ənə'waɪ/ or /ənə'va/ Initial mutations

ag déanamh ‘doing’ /ə d´ianəv/ or /ə ʤanu/ (not a’ déanamh)

standard orthography

Page 30: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

31

Transcription Guidelines

Advantages to using standard orthography: It makes the job of transcription easier and

quicker for transcribers It helps mimimise spelling inconsistencies among

transcribers as only standard spelling is used, apart from predefined lists permitted exceptions

Attempting to represent actual pronunciation in orthography is difficult and prone to inconsistency. It can be more accurately captured in a separate phonetic transcription layer (which may be partially generated from the orthography).

Page 31: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

32

Transcription Guidelines

Standard orthography facilitates corpus querying and lexical searches

Standard orthography facilitates automatic text processing, such as part-of-speech tagging and parsing

Transcription codes for some linguistic features (e.g. co-articulation effects, elision etc.) would require specialist training for transcribers, in order to ensure accuracy and consistency, and are better undertaken as a separate task.

Page 32: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

33

Transcription Software

We tested several pieces of freely-available transcription and annotation software (e.g. Praat, ELAN, Anvil, CLAN, Xtrans, Transcriber)

We chose Transcriber http://trans.sourceforge.net It has a straightforward user interface It facilitates alignment of the audio and text

transcription in XML format Audio duration and word count information at a

glance Transcripts can be conveniently exported as

text

Page 33: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

34

Transcription Software

It handles a variety of audio file types, including .wav, .mp3 (podcasts) and .ogg

The later version of the software, TranscriberAG, can handle video as well as audio

It facilitates the annotation of various features of spontaneous speech (overlap, interruptions, coughs, laughs, etc.) as well as linguistics categories (e.g. proper nouns, human/animate etc. etc.) if desired

It can be used with foot pedals for increased speed if necessary

Page 34: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

35

Transcribers

Audio segments of 8 min. in duration broadcast discussions and interviews Raidio na Gaeltachta podcasts.

Panel of 22 transcribers recruited Workpackages were sent via e-mail to

members of the panel who worked from home. (filenames, speaker ids)

They returned a time-aligned transcription and timesheet for each workpackage completed.

Page 35: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

36

Transcription Checking

Each transcript was checked for accuracy against the audio file by a member of the project team.

In the case of new video-recordings, the transcripts were also anonymised, i.e. names and places which could identify the participants were replaced by fictitious names to ensure anonynity.

Page 36: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

37

Corpus Processing

Corpus Metadata XCES Corpus Encoding Standard Part-of-Speech Tagging SketchEngine Corpus Query Tool

Page 37: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

38

Corpus Metadata

All relevant details related to speakers, transcripts and transcribers are recorded in a database.

Each speaker is given a speaker code which is used in the transcript in place of the speaker’s name, in order to make speakers less recognisable.

Speaker attributes such as dialect, language acquisition type, (L1-G L1-NG L2) gender and age, etc

are recorded where known.

Page 38: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

39

Corpus Metadata

Corpus database is used to generate XML corpus headers, and to facilitate onging monitoring of word counts of the various corpus design categories.

Page 39: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

40

XCES – XML Corpus Encoding Std.

For each transcript, the output of the Transcriber software was transformed into TEI compliant XCES (XML Corpus Encoding Standard) format using a Perl script and data from the corpus database.

Page 40: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

41

Speech Turns

All of the transcripts to date involve conversations between at least two participants (dialogues).

It is quite common, particularly in radio interviews, for spoken interactions to take place between speakers with different dialects or between native and non-native speakers.

Page 41: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

42

Speech Turns

In order to create sub-corpora on the basis of dialect, native/non-native status, speaker, age, gender etc. then these features must be recorded at the level of speaker-turn rather than for the transcript as a whole.

Page 42: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

43

XML - XCES

<doc id = "irbs0012" title = "Barrscéalta 08 October 2010" period = "1996-pres" medium = "broadcast-radio"spokentype = "interview" text_source = "GALA-TCD" av_source = "RnaG podcast">

<speaker_turn id = "200" code = "RNG_ANC" dialect = "Ulaidh" gender = "Bain" actype = "L1 Gaeltacht" year = "2010">

caidé méid airgid a chosnódh sé na bádaí seo a thabhairt suas chun dáta agus cloígh lena rialacha úra atá tagtha isteach?

</speaker_turn> <speaker_turn id = "559" code = "RNG_LCI" dialect =

"Mumhan" gender = "Fir" actype = "L1 Gaeltacht?" year = "2010" >

Bhuel ehm braitheann sé sin ar chaighdeán an bháid, abair, agus níl aon dabht faoi ach go bhfuil sé costasach, abair, [tá tá] tá tuairiscí …

Page 43: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

44

Part-of-Speech Tagging

All transcripts are lemmatised and POS tagged

Using finite-state tools (xfst/foma) and Constraint Grammar (VISL cg3)

Page 44: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

46

Future Work

Extensive Data Collection is required Archives need to be examined for suitable

material (diachronic corpus) Quality control procedures for

transcription standards need to be formalised

Testing and enhancement of POS tagging tools for spoken language

Page 45: Issues in Designing a Corpus of Spoken Irish

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

47

Websites

GaLa TCD Website https://www.scss.tcd.ie/SLP/gala/index.utf8.html

GaLa in the SketchEnginehttp://the.sketchengine.co.uk/

Page 46: Issues in Designing a Corpus of Spoken Irish

Go raibh maith agat!

Thank you!