W4A 2012-Federico-Furini_AutomaticCaptioning

Enhancing Learning Accessibility through

Fully Automatic Captioning Maria Federico Marco Furini

Servizio Accoglienza Studenti Disabili Dipartimento di Comunicazione ed Economia Università di Modena e Reggio Emilia Università di Modena e Reggio Emilia

W4A 2012 Lyon, April 17 2012

The traditional learning scenario

VideoAudio

Disabled students:• Hearing-impaired• Dyslexic• Motion impaired

remoteclassroom

Able-bodied students

Traditional solutions:- Sign interpreters- Stenographers- Student note takers- Respeaking

An accessible learning scenario

VideoAudio

Disabled students:• Hearing-impaired• Dyslexic• Motion impairedAble-bodied students

OUR SYSTEM

VideoAudio

Textual transcript

remoteclassroom

Automatic speech transcriptionAutomatic speech transcription

System Architecture

Architecture for the automatic production of video lesson captions based on Automatic speech recognition (ASR) technologies A novel caption alignment mechanism that:

Introduces unique audio markups into the audio stream before transcription by an ASR

Transforms the plain transcript produced by the ASR into a timecoded transcript

Markup Insertion

1. Identification of silence periods (i.e., when the speaker does not speak)

2. Insertion of a unique markup periodically in silence periods

It is important to find resonable values for silence length and minimum distance between two consecutive markups in order to have no truncated words in transcript and enough timing information

Speech2text Transcription of the audio stream coupled with unique

markup into plain text (including the textual form of the markup)

Any existing automatic speech recognition technology can be used

In the system prototype we used Dragon NaturallySpeaking Support for Italian language Availability of speech-to-text transcription from digital audio file Easy access to product High accuracy (99% for dictation)

Caption AlignmentSpeech2text produced plain transcript

Transcript with timestamps

Timing information about wheremarkups have been insertedby the Markup Insertion Module

Caption Alignment Existing solutions:

1. Alignment of manual transcript with video

2. ASR runs twice

Our solution: Automatic: based on audio analysis Efficient: ASR runs just one time Technology transparent: any ASR can be used

High computational environment

Experimental study

Different Computer Science and Linguistics Professors of the Communication Sciences degree of the University of Modena and Reggio Emilia teaching in front of a live audience

To tune the parameters used to locate the positions where to insert audio markups

To find the most appropriate hardware (microphone) and software (ASR) products to build the recording scenario

To investigate the transcription accuracy

Transcription accuracy

The higher the values of silence length and minimum markup distance are, the better the accuracy is, but these parameters affect the length of the produced captions

Minimum Markup Distance (sec)

Caption lengthDesktop threshold = 375 char, ARIAL font family, 16 pt

The higher the values of silence length and minimum markup distance are, the longer the captions are

System Prototype

1024x80

Conclusions

VideoAudio

Disabled students:• Hearing-impaired• Dyslexic• Motion impairedAble-bodied students

OUR SYSTEM

VideoAudio

Textual transcript

remoteclassroom

Automatic

Efficient

Technology transparent

Contacts

Supported by

Servizio Accoglienza Studenti Disabili

University of Modena and Reggio Emilia

Further information:

Maria Federico, Ph.D.Maria Federico, Ph.D.

[email protected]

W4A 2012-Federico-Furini_AutomaticCaptioning

Education