Top Banner
The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a , Andrew Thwaites b , Paula Buttery c , Jeroen Geertzen c Billi Randall a , Meredith Shafto a , Barry Devereux a , Lorraine Tyler a a The Centre for Speech, Language and the Brain, University of Cambridge b The MRC Cognition and Brain Sciences Unit, Cambridge c Computation, Cognition and Language Group, RCEAL, University of Cambridge
18

The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

Dec 14, 2015

Download

Documents

Kolby Truelock
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

The Cambridge Cookie-Theft Corpus:A Corpus of Directed and Spontaneous Speech

of Brain-Damaged Patients and Healthy Individuals

Caroline Williamsa, Andrew Thwaitesb, Paula Butteryc, Jeroen Geertzenc

Billi Randalla, Meredith Shaftoa, Barry Devereuxa, Lorraine Tylera

aThe Centre for Speech, Language and the Brain, University of CambridgebThe MRC Cognition and Brain Sciences Unit, Cambridge

cComputation, Cognition and Language Group, RCEAL, University of Cambridge

Page 2: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

Acknowledgments

• This work is part of the Computational Natural Language Processing and the Neuro-Cognition of Language (COMPLEX) project, supported by EPSRC (grant EP/F030061/1) and by a Medical Research Council UK grant to LKT (grant G0500842).

Page 3: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

Outline of talk

• Motivation for Corpus

• Data collection

• Transcription Guidelines

Page 4: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

Motivation• To look at differences between speech populations: young

and old; and healthy and brain-damaged patients

• The brain-damaged patients have mainly left-lateral damage (known speech processing areas)

• Desire to characterise speech output in these populations.

• This characterization hasn’t been not done before with respect to language generation

Page 5: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

Description of corpus• The finished corpus comprises of machine-

friendly transcriptions of two speech tasks: spontaneous speech and the cookie-theft picture description

• Brief statistics: 232 healthy individuals, 110 patients, ≈ 23 hours of speech, ≈15000 ‘sentences’

• Spontaneous speech task: 10 minute semi-prompted monologue

Page 6: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

The ‘cookie-theft’ picture

From the Boston Diagnostic Aphasia Examination - Goodglass & Kaplan, 1983

Page 7: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

Participants• Healthy individuals

– volunteers part of a wider panel recruited for other behavioural and neuro-imaging studies.

• Patients– aetiology is varied but damage mainly left lateralised– patients were selected from a number of sources

• Neuro-imaging scans available for a third and growing

Page 8: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

Participants

Page 9: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

The recordings

• For healthy individuals: recordings were carried out in an isolated environment such as a sound attenuated interview room. The recordings are stored as uncompressed audio.

• For patients, sometimes at their home, normally with a family member present

Page 10: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

Transcription

• Producing a machine-parseable transcription– XML based– retain prosodic information as far as possible– Paying special attention to speech phenomena

(repetitions, hesitations, false-starts)

• Comparable corpora and existing guidelines

Page 11: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

• DTD validated XML

Meta & participant data

Interview transcription

Page 12: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

Outline of the transcription schema

• Meta-data– Gender– Age– Aetiology– Type of damage– Broad location of damage– Date of recording– Who was in the room

Page 13: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

• Structural units– Utterance

“And I’ve been in my van uhuh but i’ve been out all day”– Segment

“(The kiddies are taking biscuits)(now one of them is falling off)”– Sub-segment

“(erm)(mum)(washing up)”

Page 14: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

• Representing the nature of speech– Rep tag

“it is <rep no=1 >is</rep> <rep no=2 >is</rep> falling over”

– ‘…’ incompleteness“oh dear the sink is ... and oh my the children”

– Unclear tag etc.“and <unclear reason= ambiguous>taps</unclear> running”

• Suprasegmental features– Shifts

• Laughing• Language change etc

Page 15: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

• Phonological information– phonological information

“The sink is <tr target=‘flooding’>blAdin</tr>”

– IPA transcriptions

• Anonymisation– All personal names/places replaced with reference

markers

• Misc– Kinetic– Vocal– Incident etc

Page 16: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

The next phase• On the corpus

– Addressing gap in ages for healthy individuals with the cookie-theft task between 25 and 63yrs

– Addressing shortfall within each aetiology

• Work derived from the corpus.– Identifying ages based on the cookie theft

description– Identifying damage based on the tasks– Speech production issues more generally

Page 17: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

References

• Harold Goodglass and Edith Kaplan. 1983. Boston Diagnostic Aphasia Examination (BDAE). Lea and Febiger. Distributed by Psychological Assessment Resources, Odessa, FL.

Page 18: The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

Thank you

• Any questions?