Top Banner
Mitglied der Leibniz- Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim
27

Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mar 28, 2015

Download

Documents

Amelia Pollard
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Querying Spoken Language Corpora

Thomas SchmidtIDS Mannheim

Page 2: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Outline1) Background: EXMARaLDA, FOLKER, AGD, DGD22) Transcription: Data models, data formats, TEI3) Corpora: Recordings, transcripts, metadata4) Query requirements5) Query technologies6) Demo7) Future directions

Page 3: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Background

• EXMARaLDA: System for building and querying spoken language corpora

• Used in many individual projects, at the HZSK CLARIN Centre• Transcription editor, Corpus management tool, query tool

EXAKT• FOLKER: Transcription tool – same technical basis, optimised

for Research and Teaching Corpus of Spoken German (FOLK)

Page 4: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• Archive for Spoken German (AGD): central archive for oral corpora in Germany, IDS Mannheim

• Dialect corpora, conversation corpora• Database for Spoken German (DGD2): access (browsing and

query) for AGD data

Background

Page 5: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Model: Single timeline, multiple tiers

• Annotation tuples: text label + timeline reference• Timeline: fully ordered, reference to a recording• Tiers: collections of annotations of a specific category, a specific speaker,

annotations in a tier do not overlap Annotation Graph Framework (Bird/Liberman 2001)

Page 6: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

EXMARaLDA Basic Transcription:• (Flat) hierarchy of events in

tiers• Use of ID and IDREFS to

encode temporal relations• No additional markup, no

„deep“ semantics

Page 7: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• EXMARaLDA

• ELAN

Page 8: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• EXMARaLDA

• ELAN• Praat

Page 9: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Data formats• Schmidt, Loehr et al. (2008): An exchange format for

multimodal annotations.– XML format for data exchange between seven tools with STMT data

models improves interoperability for data creation

• Drawbacks– no document order (non-linear, non-hierachical)– what is the „full text“ / the „primary data“ / the „character data“?– no explicit representation of dependencies– temporal structure, not linguistic structure bad for querying?

Page 10: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

STMT to OHCO transformation

Page 11: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

STMT to OHCO transformation

• Segment chain = any temporally connected chain of annotations within one tier

• Assumption: all other hierarchical structure beneath the level of segment chains

• Correspondence: segment chain ↔ <u>

Page 12: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Page 13: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Unparsed (EXAKT) Parsed (DGD2)

Page 14: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Free annotation (EXAKT)

Token annotation (DGD2)

Page 15: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• Schmidt (2011): A TEI-based Approach to Standardising Spoken Language Transcription. jTEI (1)

• Romary, Witt, Schmidt: ISO/DIN PWI 24624: Transcription Of Speech

Page 16: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Transcripts, recordings, metadata• Interaction metadata

– date, „genre“, place, degree of formality, etc.– pertains to a (set of) transcription(s)

• Speaker metadata– age, sex, language biography, speech impediments, etc.– pertains to (a) part(s) of a transcription

• Audio and video recordings– for checking transcription quality– for obtaining information not encoded in transcripts

• Transcripts– not (the) primary data!– a „convenient index into the recording“?– selective, theory-dependent, …

Page 17: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Corpora

Page 18: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Corpora• AGD Corpora: 8 mill. tokens • CGN Corpus: 9 mill. tokens• BNC Spoken: 10 mill. tokens• MICASE: 2 mill. tokens• Most other corpora: < 1 mill. Tokens(at least) one order of magnitude smaller than

written corporaQuery speed is (not that) important

Page 19: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• „In informal conversation in Northern Scotland, older female speakers tend to use ‚aye‘ as a backchannel signal with a rising intonation“– Situational context Interaction metadata– Speaker metadata – Text data / Surface form Transcript text– Interactional context Temporal transcript structure– Prosodic properties Recording

Requirement #1: Access to all types of contextRequirement #2: (Manual) postprocessing of query results

Page 20: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• „After a cut-off word followed by a pause of more than 0.3 seconds, the cut-off word is frequently repeated“– special word tokens (incomplete words, semi-lexical

material, …)– non-word tokens (pauses, non-verbal articulations, …)– temporal measurements (pause length)

Requirement #3: Queries for „special“ tokensRequirement #4: Queries with special properties (numerical

values, repetition)

Page 21: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• „Filled pauses are less frequent in overlapping speech than at the beginning of turns“

• „Modal particles and modal adverbs often occur near one another in an utterance“ vs. „Filled pauses occur more frequently near another speaker‘s backchannel“

Requirement #5: Queries for position in temporal structureRequirement #6: Multiple distance measures, query scopes[…]

Page 22: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• RequirementsAccess to all types of contextManual post-processing of query resultsQueries for special tokensQueries with special propertiesQueries for position in temporal structureMultiple distance measures, query scopes…

Page 23: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

Recordings

Metadata

Transcripts

Corp

us

Query Query result

Context

Postprocessing

Page 24: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• EXAKT– Regular expression on „full text“ of <u>– (XPath on <u> with markup)– (XSL on transcripts)

• DGD2– Oracle full text on documents– SQL on <w> with attributes

Page 25: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• Demo 1: EXAKT with HaMaTaC corpus• HaMaTaC: Hamburg Map Task Corpus

– advanced L2 learners of German– solving a map task– Orthographic transcription with lemma, POS,

disfluency annotation

Page 26: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• Demo 2: DGD2 with FOLK Corpus• FOLK: Research & Teaching Corpus of Spoken

German

Page 27: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Mitglied der Leibniz-Gemeinschaft

• Future directions:– Support a „real“ query language: CQL– CQPWeb as a test case– User survey DGD2 (approaching 2000 users!)– …– …– TEI as common ground

• for different spoken language corpora query platforms? • for querying spoken and written data side-by-side?