Enhancing Learning Accessibility through Fully Automatic Captioning Maria Federico Marco Furini Servizio Accoglienza Studenti Disabili Dipartimento di Comunicazione ed Economia Università di Modena e Reggio Emilia Università di Modena e Reggio Emilia W4A 2012 Lyon, April 17 2012
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Enhancing Learning Accessibility through
Fully Automatic Captioning Maria Federico Marco Furini
Servizio Accoglienza Studenti Disabili Dipartimento di Comunicazione ed Economia Università di Modena e Reggio Emilia Università di Modena e Reggio Emilia
Architecture for the automatic production of video lesson captions based on Automatic speech recognition (ASR) technologies A novel caption alignment mechanism that:
Introduces unique audio markups into the audio stream before transcription by an ASR
Transforms the plain transcript produced by the ASR into a timecoded transcript
Markup Insertion
1. Identification of silence periods (i.e., when the speaker does not speak)
2. Insertion of a unique markup periodically in silence periods
It is important to find resonable values for silence length and minimum distance between two consecutive markups in order to have no truncated words in transcript and enough timing information
Speech2text Transcription of the audio stream coupled with unique
markup into plain text (including the textual form of the markup)
Any existing automatic speech recognition technology can be used
In the system prototype we used Dragon NaturallySpeaking Support for Italian language Availability of speech-to-text transcription from digital audio file Easy access to product High accuracy (99% for dictation)
Caption AlignmentSpeech2text produced plain transcript
Transcript with timestamps
Timing information about wheremarkups have been insertedby the Markup Insertion Module
Caption Alignment Existing solutions:
1. Alignment of manual transcript with video
2. ASR runs twice
Our solution: Automatic: based on audio analysis Efficient: ASR runs just one time Technology transparent: any ASR can be used
High computational environment
Experimental study
Different Computer Science and Linguistics Professors of the Communication Sciences degree of the University of Modena and Reggio Emilia teaching in front of a live audience
To tune the parameters used to locate the positions where to insert audio markups
To find the most appropriate hardware (microphone) and software (ASR) products to build the recording scenario
To investigate the transcription accuracy
Transcription accuracy
The higher the values of silence length and minimum markup distance are, the better the accuracy is, but these parameters affect the length of the produced captions